mindspore.dataset.CelebADataset
- class mindspore.dataset.CelebADataset(dataset_dir, num_parallel_workers=None, shuffle=None, usage='all', sampler=None, decode=False, extensions=None, num_samples=None, num_shards=None, shard_id=None, cache=None, decrypt=None)[source]
- CelebA(CelebFaces Attributes) dataset. - Only support to read list_attr_celeba.txt currently, which is the attribute annotations of the dataset. The generated dataset has two columns: - [image, attr]. The tensor of column- imageis of the uint8 type. The tensor of column- attris of the uint32 type and one hot encoded.- Parameters
- dataset_dir (str) – Path to the root directory that contains the dataset. 
- num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: - None, will use global default workers(8), it can be set by- mindspore.dataset.config.set_num_parallel_workers().
- shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: - None.
- usage (str, optional) – Specify the - 'train',- 'valid',- 'test'part or- 'all'parts of dataset. Default:- 'all', will read all samples.
- sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: - None.
- decode (bool, optional) – Whether to decode the images after reading. Default: - False.
- extensions (list[str], optional) – List of file extensions to be included in the dataset. Default: - None.
- num_samples (int, optional) – The number of images to be included in the dataset. Default: - None, will include all images.
- num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: - None. When this argument is specified, num_samples reflects the maximum sample number of per shard. Used in data parallel training .
- shard_id (int, optional) – The shard ID within num_shards . Default: - None. This argument can only be specified when num_shards is also specified.
- cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: - None, which means no cache is used.
- decrypt (callable, optional) – Image decryption function, which receives the path of the encrypted image file and returns the decrypted bytes data. Default: - None, no decryption.
 
- Raises
- RuntimeError – If dataset_dir does not contain data files. 
- RuntimeError – If sampler and shuffle are specified at the same time. 
- RuntimeError – If sampler and num_shards/shard_id are specified at the same time. 
- RuntimeError – If num_shards is specified but shard_id is None. 
- RuntimeError – If shard_id is specified but num_shards is None. 
- ValueError – If shard_id is not in range of [0, num_shards ). 
- ValueError – If num_parallel_workers exceeds the max thread numbers. 
- ValueError – If usage is not - 'train',- 'valid',- 'test'or- 'all'.
 
 - Tutorial Examples:
 - Note - The parameters num_samples , shuffle , num_shards , shard_id can be used to control the sampler used in the dataset, and their effects when combined with parameter sampler are as follows. 
 - Sampler obtained by different combinations of parameters sampler and num_samples , shuffle , num_shards , shard_id - Parameter sampler - Parameter num_shards / shard_id - Parameter shuffle - Parameter num_samples - Sampler Used - mindspore.dataset.Sampler type - None - None - None - sampler - numpy.ndarray,list,tuple,int type - / - / - num_samples - SubsetSampler(indices = sampler , num_samples = num_samples ) - iterable type - / - / - num_samples - IterSampler(sampler = sampler , num_samples = num_samples ) - None - num_shards / shard_id - None / True - num_samples - DistributedSampler(num_shards = num_shards , shard_id = shard_id , shuffle = True , num_samples = num_samples ) - None - num_shards / shard_id - False - num_samples - DistributedSampler(num_shards = num_shards , shard_id = shard_id , shuffle = False , num_samples = num_samples ) - None - None - None / True - None - RandomSampler(num_samples = num_samples ) - None - None - None / True - num_samples - RandomSampler(replacement = True , num_samples = num_samples ) - None - None - False - num_samples - SequentialSampler(num_samples = num_samples ) - Examples - >>> import mindspore.dataset as ds >>> celeba_dataset_dir = "/path/to/celeba_dataset_directory" >>> >>> # Read 5 samples from CelebA dataset >>> dataset = ds.CelebADataset(dataset_dir=celeba_dataset_dir, usage='train', num_samples=5) >>> >>> # Note: In celeba dataset, each data dictionary owns keys "image" and "attr" - About CelebA dataset: - CelebFaces Attributes Dataset (CelebA) is a large-scale dataset with more than 200K celebrity images, each with 40 attribute annotations. - The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including - 10,177 number of identities, 
- 202,599 number of images, 
- 5 landmark locations, 40 binary attributes annotations per image. 
 - The dataset can be employed as the training and test sets for the following computer vision tasks: attribute recognition, detection, landmark (or facial part) and localization. - Original CelebA dataset structure: - . └── CelebA ├── README.md ├── Img │ ├── img_celeba.7z │ ├── img_align_celeba_png.7z │ └── img_align_celeba.zip ├── Eval │ └── list_eval_partition.txt └── Anno ├── list_landmarks_celeba.txt ├── list_landmarks_align_celeba.txt ├── list_bbox_celeba.txt ├── list_attr_celeba.txt └── identity_CelebA.txt- You can unzip the dataset files into the following structure and read by MindSpore's API. - . └── celeba_dataset_directory ├── list_attr_celeba.txt ├── 000001.jpg ├── 000002.jpg ├── 000003.jpg ├── ...- Citation: - @article{DBLP:journals/corr/LiuLWT14, author = {Ziwei Liu and Ping Luo and Xiaogang Wang and Xiaoou Tang}, title = {Deep Learning Attributes in the Wild}, journal = {CoRR}, volume = {abs/1411.7766}, year = {2014}, url = {http://arxiv.org/abs/1411.7766}, archivePrefix = {arXiv}, eprint = {1411.7766}, timestamp = {Tue, 10 Dec 2019 15:37:26 +0100}, biburl = {https://dblp.org/rec/journals/corr/LiuLWT14.bib}, bibsource = {dblp computer science bibliography, https://dblp.org}, howpublished = {http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html} } 
Pre-processing Operation
| Apply a function in this dataset. | |
| Concatenate the dataset objects in the input list. | |
| Filter dataset by prediction. | |
| Map func to each row in dataset and flatten the result. | |
| Apply each operation in operations to this dataset. | |
| The specified columns will be selected from the dataset and passed into the pipeline with the order specified. | |
| Rename the columns in input datasets. | |
| Repeat this dataset count times. | |
| Reset the dataset for next epoch. | |
| Save the dynamic data processed by the dataset pipeline in common dataset format. | |
| Shuffle the dataset by creating a cache with the size of buffer_size . | |
| Skip the first N elements of this dataset. | |
| Split the dataset into smaller, non-overlapping datasets. | |
| Take the first specified number of samples from the dataset. | |
| Zip the datasets in the sense of input tuple of datasets. | 
Batch
| Combine batch_size number of consecutive rows into batch which apply per_batch_map to the samples first. | |
| Bucket elements according to their lengths. | |
| Combine batch_size number of consecutive rows into batch which apply pad_info to the samples first. | 
Iterator
| Create an iterator over the dataset that yields samples of type dict, while the key is the column name and the value is the data. | |
| Create an iterator over the dataset that yields samples of type list, whose elements are the data for each column. | 
Attribute
| Return the size of batch. | |
| Get the mapping dictionary from category names to category indexes. | |
| Return the names of the columns in dataset. | |
| Return the number of batches in an epoch. | |
| Get the replication times in RepeatDataset. | |
| Get the column index, which represents the corresponding relationship between the data column order and the network when using the sink mode. | |
| Get the number of classes in a dataset. | |
| Get the shapes of output data. | |
| Get the types of output data. | 
Apply Sampler
| Add a child sampler for the current dataset. | |
| Replace the last child sampler of the current dataset, remaining the parent sampler unchanged. | 
Others
| Release a blocking condition and trigger callback with given data. | |
| Add a blocking condition to the input Dataset and a synchronize action will be applied. | |
| Serialize a pipeline into JSON string and dump into file if filename is provided. |