mindspore.dataset.TFRecordDataset
- class mindspore.dataset.TFRecordDataset(dataset_files, schema=None, columns_list=None, num_samples=None, num_parallel_workers=None, shuffle=Shuffle.GLOBAL, num_shards=None, shard_id=None, shard_equal_rows=False, cache=None, compression_type=None)[source]
A source dataset that reads and parses datasets stored on disk in TFData format.
The columns of generated dataset depend on the source TFRecord files.
- Parameters
dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in lexicographical order.
schema (Union[str, Schema], optional) – Data format policy, which specifies the data types and shapes of the data column to be read. Both JSON file path and objects constructed by
mindspore.dataset.Schema
are acceptable. Default:None
.columns_list (list[str], optional) – List of columns to be read. Default:
None
, read all columns.num_samples (int, optional) –
The number of samples (rows) to be included in the dataset. Default:
None
. When num_shards and shard_id are specified, it will be interpreted as number of rows per shard. Processing priority for num_samples is as the following:If specify num_samples with value > 0, read num_samples samples.
If no num_samples and specify numRows(parsed from schema) with value > 0, read numRows samples.
If no num_samples and no schema, read the full dataset.
num_parallel_workers (int, optional) – Number of worker threads to read the data. Default:
None
, will use global default workers(8), it can be set bymindspore.dataset.config.set_num_parallel_workers()
.shuffle (Union[bool, Shuffle], optional) –
Perform reshuffling of the data every epoch. Default:
Shuffle.GLOBAL
. Bool type and Shuffle enum are both supported to pass in. If shuffle isFalse
, no shuffling will be performed. If shuffle isTrue
, perform global shuffle. There are three levels of shuffling, desired shuffle enum defined bymindspore.dataset.Shuffle
.Shuffle.GLOBAL
: Shuffle both the files and samples, same as setting shuffle toTrue
.Shuffle.FILES
: Shuffle files only.
num_shards (int, optional) – Number of shards that the dataset will be divided into. Default:
None
. When this argument is specified, num_samples reflects the maximum sample number per shard.shard_id (int, optional) – The shard ID within num_shards . Default:
None
. This argument can only be specified when num_shards is also specified.shard_equal_rows (bool, optional) – Get equal rows for all shards. Default:
False
. If shard_equal_rows is False, the number of rows of each shard may not be equal, and may lead to a failure in distributed training. When the number of samples per TFRecord file are not equal, it is suggested to set it toTrue
. This argument should only be specified when num_shards is also specified. When compression_type is notNone
, and num_samples or numRows (parsed from schema ) is provided, shard_equal_rows will be implied asTrue
.cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default:
None
, which means no cache is used.compression_type (str, optional) – The type of compression used for all files, must be either
''
,'GZIP'
, or'ZLIB'
. Default:None
, as in empty string. It is highly recommended to provide num_samples or numRows (parsed from schema) when compression_type is"GZIP"
or"ZLIB"
to avoid performance degradation caused by multiple decompressions of the same file to obtain the file size.
- Raises
ValueError – If dataset_files are not valid or do not exist.
ValueError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If shard_id is not in range of [0, num_shards ).
ValueError – If compression_type is not
''
,'GZIP'
or'ZLIB'
.ValueError – If compression_type is provided, but the number of dataset files < num_shards .
ValueError – If num_samples < 0.
Examples
>>> import mindspore.dataset as ds >>> from mindspore import dtype as mstype >>> >>> tfrecord_dataset_dir = ["/path/to/tfrecord_dataset_file"] # contains 1 or multiple TFRecord files >>> tfrecord_schema_file = "/path/to/tfrecord_schema_file" >>> >>> # 1) Get all rows from tfrecord_dataset_dir with no explicit schema. >>> # The meta-data in the first row will be used as a schema. >>> dataset = ds.TFRecordDataset(dataset_files=tfrecord_dataset_dir) >>> >>> # 2) Get all rows from tfrecord_dataset_dir with user-defined schema. >>> schema = ds.Schema() >>> schema.add_column(name='col_1d', de_type=mstype.int64, shape=[2]) >>> dataset = ds.TFRecordDataset(dataset_files=tfrecord_dataset_dir, schema=schema) >>> >>> # 3) Get all rows from tfrecord_dataset_dir with the schema file. >>> dataset = ds.TFRecordDataset(dataset_files=tfrecord_dataset_dir, schema=tfrecord_schema_file)
Pre-processing Operation
Apply a function in this dataset. |
|
Concatenate the dataset objects in the input list. |
|
Filter dataset by prediction. |
|
Map func to each row in dataset and flatten the result. |
|
Apply each operation in operations to this dataset. |
|
The specified columns will be selected from the dataset and passed into the pipeline with the order specified. |
|
Rename the columns in input datasets. |
|
Repeat this dataset count times. |
|
Reset the dataset for next epoch. |
|
Save the dynamic data processed by the dataset pipeline in common dataset format. |
|
Shuffle the dataset by creating a cache with the size of buffer_size . |
|
Skip the first N elements of this dataset. |
|
Split the dataset into smaller, non-overlapping datasets. |
|
Takes at most given numbers of elements from the dataset. |
|
Zip the datasets in the sense of input tuple of datasets. |
Batch
Combine batch_size number of consecutive rows into batch which apply per_batch_map to the samples first. |
|
Bucket elements according to their lengths. |
|
Combine batch_size number of consecutive rows into batch which apply pad_info to the samples first. |
Iterator
Create an iterator over the dataset. |
|
Create an iterator over the dataset. |
Attribute
Return the size of batch. |
|
Return the class index. |
|
Return the names of the columns in dataset. |
|
Return the number of batches in an epoch. |
|
Get the replication times in RepeatDataset. |
|
Get the column index, which represents the corresponding relationship between the data column order and the network when using the sink mode. |
|
Get the number of classes in a dataset. |
|
Get the shapes of output data. |
|
Get the types of output data. |
Apply Sampler
Add a child sampler for the current dataset. |
|
Replace the last child sampler of the current dataset, remaining the parent sampler unchanged. |
Others
Return a transferred Dataset that transfers data through a device. |
|
Release a blocking condition and trigger callback with given data. |
|
Add a blocking condition to the input Dataset and a synchronize action will be applied. |
|
Serialize a pipeline into JSON string and dump into file if filename is provided. |