Federated Learning Image Classification Dataset Process
This tutorial uses the federated learning dataset FEMNIST
in the leaf
dataset, which contains 62 different categories of handwritten digits and letters (digits 0 to 9, 26 lowercase letters, and 26 uppercase letters) with an image size of 28 x 28
pixels. The dataset contains handwritten digits and letters from 3500 users (up to 3500 clients can be simulated to participate in federated learning). The total data volume is 805,263, the average data volume per user is 226.83, and the variance of the data volume for all users is 88.94.
Refer to leaf dataset instruction to download the dataset.
Environmental requirements before downloading the dataset.
numpy==1.16.4 scipy # conda install scipy tensorflow==1.13.1 # pip install tensorflow Pillow # pip install Pillow matplotlib # pip install matplotlib jupyter # conda install jupyter notebook==5.7.8 tornado==4.5.3 pandas # pip install pandas
Use git to download the official dataset generation script.
git clone https://github.com/TalwalkarLab/leaf.git
After downloading the project, the directory structure is as follows:
leaf/data/femnist ├── data # Used to store the dataset generated by the command ├── preprocess # Store the code related to data pre-processing ├── preprocess.sh # shell script generated by femnist dataset └── README.md # Official dataset download guidance
Taking
femnist
dataset as an example, run the following command to enter the specified path.cd leaf/data/femnist
Using the command
. /preprocess.sh -s niid --sf 1.0 -k 0 -t sample
generates a dataset containing 3500 users, and the training sets and the test sets are divided in a ratio of 9:1 for each user’s data.The meaning of the parameters in the command can be found in the
leaf/data/femnist/README.md
file.The directory structure after running is as follows:
leaf/data/femnist/35_client_sf1_data/ ├── all_data # All datasets are mixed together, without distinguishing the training sets and test sets, containing a total of 35 json files, and each json file contains the data of 100 users ├── test # The test sets are divided into the training sets and the test sets in a ratio of 9:1 for each user's data, containing a total of 35 json files, and each json file contains the data of 100 users ├── train # The training sets are divided into the training sets and the test sets in a ratio of 9:1 for each user's data, containing a total of 35 json files, and each json file contains the data of 100 users └── ... # Other documents do not need to use, and details are not described herein
Each json file contains the following three parts:
users
: User list.num_samples
: The sample number list of each user.user_data
: A dictionary object with user names as key and their respective data as value. For each user, the data is represented as a list of images, with each image represented as a list of integers of size 784 (obtained by spreading the28 x 28
image array).
Before rerunning
preprocess.sh
, make sure to delete therem_user_data
,sampled_data
,test
andtrain
subfolders from the data directory.Divide the 35 json files into 3500 json files (each json file represents a user).
The code is as follows:
import os import json def mkdir(path): if not os.path.exists(path): os.mkdir(path) def partition_json(root_path, new_root_path): """ partition 35 json files to 3500 json file Each raw .json file is an object with 3 keys: 1. 'users', a list of users 2. 'num_samples', a list of the number of samples for each user 3. 'user_data', an object with user names as keys and their respective data as values; for each user, data is represented as a list of images, with each image represented as a size-784 integer list (flattened from 28 by 28) Each new .json file is an object with 3 keys: 1. 'user_name', the name of user 2. 'num_samples', the number of samples for the user 3. 'user_data', an dict object with 'x' as keys and their respective data as values; with 'y' as keys and their respective label as values; Args: root_path (str): raw root path of 35 json files new_root_path (str): new root path of 3500 json files """ paths = os.listdir(root_path) count = 0 file_num = 0 for i in paths: file_num += 1 file_path = os.path.join(root_path, i) print('======== process ' + str(file_num) + ' file: ' + str(file_path) + '======================') with open(file_path, 'r') as load_f: load_dict = json.load(load_f) users = load_dict['users'] num_users = len(users) num_samples = load_dict['num_samples'] for j in range(num_users): count += 1 print('---processing user: ' + str(count) + '---') cur_out = {'user_name': None, 'num_samples': None, 'user_data': {}} cur_user_id = users[j] cur_data_num = num_samples[j] cur_user_path = os.path.join(new_root_path, cur_user_id + '.json') cur_out['user_name'] = cur_user_id cur_out['num_samples'] = cur_data_num cur_out['user_data'].update(load_dict['user_data'][cur_user_id]) with open(cur_user_path, 'w') as f: json.dump(cur_out, f) f = os.listdir(new_root_path) print(len(f), ' users have been processed!') # partition train json files partition_json("leaf/data/femnist/35_client_sf1_data/train", "leaf/data/femnist/3500_client_json/train") # partition test json files partition_json("leaf/data/femnist/35_client_sf1_data/test", "leaf/data/femnist/3500_client_json/test")
where
root_path
isleaf/data/femnist/35_client_sf1_data/{train,test}
.new_root_path
is set by itself to store the generated 3500 user json files, which need to be processed separately for the training and test folders.Each of the 3500 newly generated user json files contains the following three parts:
user_name
: User name.num_samples
: The number of user samplesuser_data
: A dictionary object with ‘x’ as key and user data as value; with ‘y’ as key and the label corresponding to the user data as value.
Print the result as following after running the script, which means a successful run:
======== process 1 file: /leaf/data/femnist/35_client_sf1_data/train/all_data_16_niid_0_keep_0_train_9.json====================== ---processing user: 1--- ---processing user: 2--- ---processing user: 3--- ......
Convert a json file to an image file.
Refer to the following code:
import os import json import numpy as np from PIL import Image name_list = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z' ] def mkdir(path): if not os.path.exists(path): os.mkdir(path) def json_2_numpy(img_size, file_path): """ read json file to numpy Args: img_size (list): contain three elements: the height, width, channel of image file_path (str): root path of 3500 json files return: image_numpy (numpy) label_numpy (numpy) """ # open json file with open(file_path, 'r') as load_f_train: load_dict = json.load(load_f_train) num_samples = load_dict['num_samples'] x = load_dict['user_data']['x'] y = load_dict['user_data']['y'] size = (num_samples, img_size[0], img_size[1], img_size[2]) image_numpy = np.array(x, dtype=np.float32).reshape(size) # mindspore doesn't support float64 and int64 label_numpy = np.array(y, dtype=np.int32) return image_numpy, label_numpy def json_2_img(json_path, save_path): """ transform single json file to images Args: json_path (str): the path json file save_path (str): the root path to save images """ data, label = json_2_numpy([28, 28, 1], json_path) for i in range(data.shape[0]): img = data[i] * 255 # PIL don't support the 0/1 image ,need convert to 0~255 image im = Image.fromarray(np.squeeze(img)) im = im.convert('L') img_name = str(label[i]) + '_' + name_list[label[i]] + '_' + str(i) + '.png' path1 = os.path.join(save_path, str(label[i])) mkdir(path1) img_path = os.path.join(path1, img_name) im.save(img_path) print('-----', i, '-----') def all_json_2_img(root_path, save_root_path): """ transform json files to images Args: json_path (str): the root path of 3500 json files save_path (str): the root path to save images """ usage = ['train', 'test'] for i in range(2): x = usage[i] files_path = os.path.join(root_path, x) files = os.listdir(files_path) for name in files: user_name = name.split('.')[0] json_path = os.path.join(files_path, name) save_path1 = os.path.join(save_root_path, user_name) mkdir(save_path1) save_path = os.path.join(save_path1, x) mkdir(save_path) print('=============================' + name + '=======================') json_2_img(json_path, save_path) all_json_2_img("leaf/data/femnist/3500_client_json/", "leaf/data/femnist/3500_client_img/")
Print the result as following after running the script, which means a successful run:
=============================f0644_19.json======================= ----- 0 ----- ----- 1 ----- ----- 2 ----- ......
Since the dataset under some user folders is small, if the number is smaller than the batch size, random expansion is required.
The entire dataset
"leaf/data/femnist/3500_client_img/"
can be checked and expanded by referring to the following code:import os import shutil from random import choice def count_dir(path): num = 0 for root, dirs, files in os.walk(path): for file in files: num += 1 return num def get_img_list(path): img_path_list = [] label_list = os.listdir(path) for i in range(len(label_list)): label = label_list[i] imgs_path = os.path.join(path, label) imgs_name = os.listdir(imgs_path) for j in range(len(imgs_name)): img_name = imgs_name[j] img_path = os.path.join(imgs_path, img_name) img_path_list.append(img_path) return img_path_list def data_aug(data_root_path, batch_size = 32): users = os.listdir(data_root_path) tags = ["train", "test"] aug_users = [] for i in range(len(users)): user = users[i] for tag in tags: data_path = os.path.join(data_root_path, user, tag) num_data = count_dir(data_path) if num_data < batch_size: aug_users.append(user + "_" + tag) print("user: ", user, " ", tag, " data number: ", num_data, " < ", batch_size, " should be aug") aug_num = batch_size - num_data img_path_list = get_img_list(data_path) for j in range(aug_num): img_path = choice(img_path_list) info = img_path.split(".") aug_img_path = info[0] + "_aug_" + str(j) + ".png" shutil.copy(img_path, aug_img_path) print("[aug", j, "]", "============= copy file:", img_path, "to ->", aug_img_path) print("the number of all aug users: " + str(len(aug_users))) print("aug user name: ", end=" ") for k in range(len(aug_users)): print(aug_users[k], end = " ") if __name__ == "__main__": data_root_path = "leaf/data/femnist/3500_client_img/" batch_size = 32 data_aug(data_root_path, batch_size)
Convert the expanded image dataset into a bin file format usable in the Federated Learning Framework.
Refer to the following code:
import numpy as np import os import mindspore.dataset as ds import mindspore.dataset.vision as vision import mindspore.dataset.transforms as transforms import mindspore def mkdir(path): if not os.path.exists(path): os.mkdir(path) def count_id(path): files = os.listdir(path) ids = {} for i in files: ids[i] = int(i) return ids def create_dataset_from_folder(data_path, img_size, batch_size=32, repeat_size=1, num_parallel_workers=1, shuffle=False): """ create dataset for train or test Args: data_path: Data path batch_size: The number of data records in each group repeat_size: The number of replicated data records num_parallel_workers: The number of parallel workers """ # define dataset ids = count_id(data_path) mnist_ds = ds.ImageFolderDataset(dataset_dir=data_path, decode=False, class_indexing=ids) # define operation parameters resize_height, resize_width = img_size[0], img_size[1] # 32 transform = [ vision.Decode(True), vision.Grayscale(1), vision.Resize(size=(resize_height, resize_width)), vision.Grayscale(3), vision.ToTensor(), ] compose = transforms.Compose(transform) # apply map operations on images mnist_ds = mnist_ds.map(input_columns="label", operations=transforms.TypeCast(mindspore.int32)) mnist_ds = mnist_ds.map(input_columns="image", operations=compose) # apply DatasetOps buffer_size = 10000 if shuffle: mnist_ds = mnist_ds.shuffle(buffer_size=buffer_size) # 10000 as in LeNet train script mnist_ds = mnist_ds.batch(batch_size, drop_remainder=True) mnist_ds = mnist_ds.repeat(repeat_size) return mnist_ds def img2bin(root_path, root_save): """ transform images to bin files Args: root_path: the root path of 3500 images files root_save: the root path to save bin files """ use_list = [] train_batch_num = [] test_batch_num = [] mkdir(root_save) users = os.listdir(root_path) for user in users: use_list.append(user) user_path = os.path.join(root_path, user) train_test = os.listdir(user_path) for tag in train_test: data_path = os.path.join(user_path, tag) dataset = create_dataset_from_folder(data_path, (32, 32, 1), 32) batch_num = 0 img_list = [] label_list = [] for data in dataset.create_dict_iterator(): batch_x_tensor = data['image'] batch_y_tensor = data['label'] trans_img = np.transpose(batch_x_tensor.asnumpy(), [0, 2, 3, 1]) img_list.append(trans_img) label_list.append(batch_y_tensor.asnumpy()) batch_num += 1 if tag == "train": train_batch_num.append(batch_num) elif tag == "test": test_batch_num.append(batch_num) imgs = np.array(img_list) # (batch_num, 32,3,32,32) labels = np.array(label_list) path1 = os.path.join(root_save, user) mkdir(path1) image_path = os.path.join(path1, user + "_" + "bn_" + str(batch_num) + "_" + tag + "_data.bin") label_path = os.path.join(path1, user + "_" + "bn_" + str(batch_num) + "_" + tag + "_label.bin") imgs.tofile(image_path) labels.tofile(label_path) print("user: " + user + " " + tag + "_batch_num: " + str(batch_num)) print("total " + str(len(use_list)) + " users finished!") root_path = "leaf/data/femnist/3500_client_img/" root_save = "leaf/data/femnist/3500_clients_bin" img2bin(root_path, root_save)
Print the result as following after running the script, which means a successful run:
user: f0141_43 test_batch_num: 1 user: f0141_43 train_batch_num: 10 user: f0137_14 test_batch_num: 1 user: f0137_14 train_batch_num: 11 ...... total 3500 users finished!
Generate
3500_clients_bin
folder containing a total of 3500 user folders with the following directory structure:leaf/data/femnist/3500_clients_bin ├── f0000_14 # User number │ ├── f0000_14_bn_10_train_data.bin # The training data of user f0000_14 (The number 10 after bn_ represents the batch number) │ ├── f0000_14_bn_10_train_label.bin # Training tag for user f0000_14 │ ├── f0000_14_bn_1_test_data.bin # Test data of user f0000_14 (the number 1 after bn_ represents batch number) │ └── f0000_14_bn_1_test_label.bin # Test tag for user f0000_14 ├── f0001_41 # User number │ ├── f0001_41_bn_11_train_data.bin # The training data of user f0001_41 (The number 11 after bn_ represents the batch number) │ ├── f0001_41_bn_11_train_label.bin # Training tag for user f0001_41 │ ├── f0001_41_bn_1_test_data.bin # Test data of user f0001_41 (the number 1 after bn_ represents batch number) │ └── f0001_41_bn_1_test_label.bin # Test tag for user f0001_41 │ ... └── f4099_10 # User number ├── f4099_10_bn_4_train_data.bin # The training data of user f4099_10 (the number 4 after bn_ represents the batch number) ├── f4099_10_bn_4_train_label.bin # Training tag for user f4099_10 ├── f4099_10_bn_1_test_data.bin # Test data of user f4099_10 (the number 1 after bn_ represents batch number) └── f4099_10_bn_1_test_label.bin # Test tag for user f4099_10
The 3500_clients_bin
folder generated according to steps 1 to 9 above can be directly used as the input data for the device-cloud federated image classification task.