mindspore.dataset.MindDataset
- class mindspore.dataset.MindDataset(dataset_files, columns_list=None, num_parallel_workers=None, shuffle=None, num_shards=None, shard_id=None, sampler=None, padded_sample=None, num_padded=None, num_samples=None, cache=None)[source]
A source dataset that reads and parses MindRecord dataset.
The columns of generated dataset depend on the source MindRecord files.
- Parameters
dataset_files (Union[str, list[str]]) – If dataset_file is a str, it represents for a file name of one component of a mindrecord source, other files with identical source in the same path will be found and loaded automatically. If dataset_file is a list, it represents for a list of dataset files to be read directly.
columns_list (list[str], optional) – List of columns to be read. Default:
None
, read all columns.num_parallel_workers (int, optional) – Number of worker threads to read the data. Default:
None
, will use global default workers(8), it can be set bymindspore.dataset.config.set_num_parallel_workers()
.shuffle (Union[bool, Shuffle], optional) –
Perform reshuffling of the data every epoch. Default:
None
, performs mindspore.dataset.Shuffle.GLOBAL. Bool type and Shuffle enum are both supported to pass in. If shuffle isFalse
, no shuffling will be performed. If shuffle isTrue
, performs global shuffle. There are three levels of shuffling, desired shuffle enum defined bymindspore.dataset.Shuffle
.Shuffle.GLOBAL
: Global shuffle of all rows of data in dataset, same as setting shuffle to True.Shuffle.FILES
: Shuffle the file sequence but keep the order of data within each file. Not supported when the number of samples in the dataset is greater than 100 million.Shuffle.INFILE
: Keep the file sequence the same but shuffle the data within each file. Not supported when the number of samples in the dataset is greater than 100 million.
num_shards (int, optional) – Number of shards that the dataset will be divided into. Default:
None
. When this argument is specified, num_samples reflects the maximum sample number of per shard. Used in data parallel training .shard_id (int, optional) – The shard ID within num_shards . Default:
None
. This argument can only be specified when num_shards is also specified.sampler (Sampler, optional) – Object used to choose samples from the dataset. Default:
None
, sampler is exclusive with shuffle and block_reader. Support list:mindspore.dataset.SubsetRandomSampler
,mindspore.dataset.PKSampler
,mindspore.dataset.RandomSampler
,mindspore.dataset.SequentialSampler
,mindspore.dataset.DistributedSampler
.padded_sample (dict, optional) – Samples will be appended to dataset, where keys are the same as columns_list. Default:
None
.num_padded (int, optional) – Number of padding samples. Dataset size plus num_padded should be divisible by num_shards. Default:
None
.num_samples (int, optional) – The number of samples to be included in the dataset. Default:
None
, all samples.cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default:
None
, which means no cache is used.
- Raises
ValueError – If dataset_files are not valid or do not exist.
ValueError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If shard_id is not in range of [0, num_shards ).
Note
When sharding MindRecord (by configuring num_shards and shard_id), there are two strategies to implement the data sharding logic. This API uses the strategy 2.
rank 0
rank 1
rank 2
rank 3
0
1
2
3
4
5
6
7
8
9
10
11
rank 0
rank 1
rank 2
rank 3
0
3
6
9
1
4
7
10
2
5
8
11
Note
The parameters num_samples , shuffle , num_shards , shard_id can be used to control the sampler used in the dataset, and their effects when combined with parameter sampler are as follows.
Parameter sampler
Parameter num_shards / shard_id
Parameter shuffle
Parameter num_samples
Sampler Used
mindspore.dataset.Sampler type
None
None
None
sampler
numpy.ndarray,list,tuple,int type
/
/
num_samples
SubsetSampler(indices = sampler , num_samples = num_samples )
iterable type
/
/
num_samples
IterSampler(sampler = sampler , num_samples = num_samples )
None
num_shards / shard_id
None / True
num_samples
DistributedSampler(num_shards = num_shards , shard_id = shard_id , shuffle = True , num_samples = num_samples )
None
num_shards / shard_id
False
num_samples
DistributedSampler(num_shards = num_shards , shard_id = shard_id , shuffle = False , num_samples = num_samples )
None
None
None / True
None
RandomSampler(num_samples = num_samples )
None
None
None / True
num_samples
RandomSampler(replacement = True , num_samples = num_samples )
None
None
False
num_samples
SequentialSampler(num_samples = num_samples )
Examples
>>> import mindspore.dataset as ds >>> mindrecord_files = ["/path/to/mind_dataset_file"] # contains 1 or multiple MindRecord files >>> dataset = ds.MindDataset(dataset_files=mindrecord_files)
Pre-processing Operation
Apply a function in this dataset. |
|
Concatenate the dataset objects in the input list. |
|
Filter dataset by prediction. |
|
Map func to each row in dataset and flatten the result. |
|
Apply each operation in operations to this dataset. |
|
The specified columns will be selected from the dataset and passed into the pipeline with the order specified. |
|
Rename the columns in input datasets. |
|
Repeat this dataset count times. |
|
Reset the dataset for next epoch. |
|
Save the dynamic data processed by the dataset pipeline in common dataset format. |
|
Shuffle the dataset by creating a cache with the size of buffer_size . |
|
Skip the first N elements of this dataset. |
|
Split the dataset into smaller, non-overlapping datasets. |
|
Take the first specified number of samples from the dataset. |
|
Zip the datasets in the sense of input tuple of datasets. |
Batch
Combine batch_size number of consecutive rows into batch which apply per_batch_map to the samples first. |
|
Bucket elements according to their lengths. |
|
Combine batch_size number of consecutive rows into batch which apply pad_info to the samples first. |
Iterator
Create an iterator over the dataset that yields samples of type dict, while the key is the column name and the value is the data. |
|
Create an iterator over the dataset that yields samples of type list, whose elements are the data for each column. |
Attribute
Return the size of batch. |
|
Get the mapping dictionary from category names to category indexes. |
|
Return the names of the columns in dataset. |
|
Return the number of batches in an epoch. |
|
Get the replication times in RepeatDataset. |
|
Get the column index, which represents the corresponding relationship between the data column order and the network when using the sink mode. |
|
Get the number of classes in a dataset. |
|
Get the shapes of output data. |
|
Get the types of output data. |
Apply Sampler
Add a child sampler for the current dataset. |
|
Replace the last child sampler of the current dataset, remaining the parent sampler unchanged. |
Others
Release a blocking condition and trigger callback with given data. |
|
Add a blocking condition to the input Dataset and a synchronize action will be applied. |
|
Serialize a pipeline into JSON string and dump into file if filename is provided. |