mindspore.dataset

At the heart of MindSpore data loading utility is the mindspore.dataset module. It is a dataset engine based on pipline design.

This module provides the following data loading methods to help users load datasets into MindSpore.

In addition, this module also provides data sampler, transformations, batching, as well as basic configurations such as random seed, parallelism setting and other features, to be used in conjunction with the dataset loading.

  • Data Sampler: Provides various common Sampler, such as RandomSampler, DistributedSampler, etc.

  • Data Transformations: Provides multiple dataset operations to perform data augmentation, batching.

  • Basic Configuration: Provides pipeline configuration for random seed setting, parallelism setting, data recovery mode, etc.

Descriptions of common dataset terms are as follows:

  • Dataset, the base class of all the datasets. It provides data processing methods to help preprocess the data.

  • SourceDataset, an abstract class to represent the source of dataset pipeline which produces data from data sources such as files and databases.

  • MappableDataset, an abstract class to represent a source dataset which supports for random access.

  • Iterator, the base class of dataset iterator for enumerating elements.

Introduction to data processing pipeline

../_images/dataset_pipeline_en.png

As shown in the above figure, the mindspore dataset module makes it easy for users to define data preprocessing pipelines and transform samples in the dataset in the most efficient (multi-process / multi-thread) manner. The specific steps are as follows:

  • Loading datasets: Users can easily load supported datasets using the *Dataset class, or load Python layer customized datasets through UDF Loader + GeneratorDataset . At the same time, the loading class method can accept a variety of parameters such as sampler, data slicing, and data shuffle;

  • Dataset operation: The user uses the dataset object method .shuffle / .filter / .skip / .split / .take / … to further shuffle, filter, skip, and obtain the maximum number of samples of datasets;

  • Dataset sample transform operation: The user can add data transform operations ( vision transform , NLP transform , audio transform ) to the map operation to perform transformations. During data preprocessing, multiple map operations can be defined to perform different transform operations to different fields. The data transform operation can also be a user-defined transform pyfunc (Python function);

  • Batch: After the transformation of the samples, the user can use the batch operation to organize multiple samples into batches, or use self-defined batch logic with the parameter per_batch_map applied;

  • Iterator: Finally, the user can use the dataset object method create_dict_iterator to create an iterator, which can output the preprocessed data cyclically.

Quick start of Dataset Pipeline

For a quick start of using Dataset Pipeline, download Load & Process Data With Dataset Pipeline to local and run in sequence.

User Defined

mindspore.dataset.GeneratorDataset

A source dataset that generates data from Python by invoking Python data source each epoch.

Standard Format

mindspore.dataset.MindDataset

A source dataset that reads and parses MindRecord dataset.

mindspore.dataset.OBSMindDataset

A source dataset that reads and parses MindRecord dataset which stored in cloud storage such as OBS, Minio or AWS S3.

mindspore.dataset.TFRecordDataset

A source dataset that reads and parses datasets stored on disk in TFData format.

Open Source

Vision

mindspore.dataset.Caltech101Dataset

Caltech 101 dataset.

mindspore.dataset.Caltech256Dataset

Caltech 256 dataset.

mindspore.dataset.CelebADataset

CelebA(CelebFaces Attributes) dataset.

mindspore.dataset.Cifar10Dataset

CIFAR-10 dataset.

mindspore.dataset.Cifar100Dataset

CIFAR-100 dataset.

mindspore.dataset.CityscapesDataset

Cityscapes dataset.

mindspore.dataset.CocoDataset

COCO(Common Objects in Context) dataset.

mindspore.dataset.DIV2KDataset

DIV2K(DIVerse 2K resolution image) dataset.

mindspore.dataset.EMnistDataset

EMNIST(Extended MNIST) dataset.

mindspore.dataset.FakeImageDataset

A source dataset for generating fake images.

mindspore.dataset.FashionMnistDataset

Fashion-MNIST dataset.

mindspore.dataset.FlickrDataset

Flickr8k and Flickr30k datasets.

mindspore.dataset.Flowers102Dataset

Oxfird 102 Flower dataset.

mindspore.dataset.Food101Dataset

Food101 dataset.

mindspore.dataset.ImageFolderDataset

A source dataset that reads images from a tree of directories.

mindspore.dataset.KITTIDataset

KITTI dataset.

mindspore.dataset.KMnistDataset

KMNIST(Kuzushiji-MNIST) dataset.

mindspore.dataset.LFWDataset

LFW(Labeled Faces in the Wild) dataset.

mindspore.dataset.LSUNDataset

LSUN(Large-scale Scene UNderstarding) dataset.

mindspore.dataset.ManifestDataset

A source dataset for reading images from a Manifest file.

mindspore.dataset.MnistDataset

MNIST dataset.

mindspore.dataset.OmniglotDataset

Omniglot dataset.

mindspore.dataset.PhotoTourDataset

PhotoTour dataset.

mindspore.dataset.Places365Dataset

Places365 dataset.

mindspore.dataset.QMnistDataset

QMNIST dataset.

mindspore.dataset.RenderedSST2Dataset

RenderedSST2(Rendered Stanford Sentiment Treebank v2) dataset.

mindspore.dataset.SBDataset

SB(Semantic Boundaries) Dataset.

mindspore.dataset.SBUDataset

SBU(SBU Captioned Photo) dataset.

mindspore.dataset.SemeionDataset

Semeion dataset.

mindspore.dataset.STL10Dataset

STL-10 dataset.

mindspore.dataset.SUN397Dataset

SUN397(Scene UNderstanding) dataset.

mindspore.dataset.SVHNDataset

SVHN(Street View House Numbers) dataset.

mindspore.dataset.USPSDataset

USPS(U.S.

mindspore.dataset.VOCDataset

VOC(Visual Object Classes) dataset.

mindspore.dataset.WIDERFaceDataset

WIDERFace dataset.

Text

mindspore.dataset.AGNewsDataset

AG News dataset.

mindspore.dataset.AmazonReviewDataset

Amazon Review Polarity and Amazon Review Full datasets.

mindspore.dataset.CLUEDataset

CLUE(Chinese Language Understanding Evaluation) dataset.

mindspore.dataset.CSVDataset

A source dataset that reads and parses comma-separated values (CSV) files as dataset.

mindspore.dataset.CoNLL2000Dataset

CoNLL-2000(Conference on Computational Natural Language Learning) chunking dataset.

mindspore.dataset.DBpediaDataset

DBpedia dataset.

mindspore.dataset.EnWik9Dataset

EnWik9 dataset.

mindspore.dataset.IMDBDataset

IMDb(Internet Movie Database) dataset.

mindspore.dataset.IWSLT2016Dataset

IWSLT2016(International Workshop on Spoken Language Translation) dataset.

mindspore.dataset.IWSLT2017Dataset

IWSLT2017(International Workshop on Spoken Language Translation) dataset.

mindspore.dataset.Multi30kDataset

Multi30k dataset.

mindspore.dataset.PennTreebankDataset

PennTreebank dataset.

mindspore.dataset.SogouNewsDataset

Sogou News dataset.

mindspore.dataset.SQuADDataset

SQuAD 1.1 and SQuAD 2.0 datasets.

mindspore.dataset.SST2Dataset

SST2(Stanford Sentiment Treebank v2) dataset.

mindspore.dataset.TextFileDataset

A source dataset that reads and parses datasets stored on disk in text format.

mindspore.dataset.UDPOSDataset

UDPOS(Universal Dependencies dataset for Part of Speech) dataset.

mindspore.dataset.WikiTextDataset

WikiText2 and WikiText103 datasets.

mindspore.dataset.YahooAnswersDataset

YahooAnswers dataset.

mindspore.dataset.YelpReviewDataset

Yelp Review Polarity and Yelp Review Full datasets.

Audio

mindspore.dataset.CMUArcticDataset

CMU Arctic dataset.

mindspore.dataset.GTZANDataset

GTZAN dataset.

mindspore.dataset.LibriTTSDataset

LibriTTS dataset.

mindspore.dataset.LJSpeechDataset

LJSpeech dataset.

mindspore.dataset.SpeechCommandsDataset

Speech Commands dataset.

mindspore.dataset.TedliumDataset

Tedlium dataset.

mindspore.dataset.YesNoDataset

YesNo dataset.

Others

mindspore.dataset.NumpySlicesDataset

Creates a dataset with given data slices, mainly for loading Python data into dataset.

mindspore.dataset.PaddedDataset

Creates a dataset with filler data provided by user.

mindspore.dataset.RandomDataset

A source dataset that generates random data.

Sampler

mindspore.dataset.DistributedSampler

A sampler that accesses a shard of the dataset, it helps divide dataset into multi-subset for distributed training.

mindspore.dataset.PKSampler

Samples K elements for each P class in the dataset.

mindspore.dataset.RandomSampler

Samples the elements randomly.

mindspore.dataset.SequentialSampler

Samples the dataset elements sequentially that is equivalent to not using a sampler.

mindspore.dataset.SubsetRandomSampler

Samples the elements randomly from a sequence of indices.

mindspore.dataset.SubsetSampler

Samples the elements from a sequence of indices.

mindspore.dataset.WeightedRandomSampler

Samples the elements from [0, len(weights) - 1] randomly with the given weights (probabilities).

Config

The configuration module provides various functions to set and get the supported configuration parameters, and read a configuration file.

mindspore.dataset.config.set_sending_batches

Set the upper limit on the number of batches of data that the Host can send to the Device.

mindspore.dataset.config.load

Load the project configuration from the file.

mindspore.dataset.config.set_seed

Set the seed for the random number generator in data pipeline.

mindspore.dataset.config.get_seed

Get random number seed.

mindspore.dataset.config.set_prefetch_size

Set the buffer queue size between dataset operations in the pipeline.

mindspore.dataset.config.get_prefetch_size

Get the prefetch size as for number of rows.

mindspore.dataset.config.set_num_parallel_workers

Set a new global configuration default value for the number of parallel workers.

mindspore.dataset.config.get_num_parallel_workers

Get the global configuration of number of parallel workers.

mindspore.dataset.config.set_numa_enable

Set the default state of numa enabled.

mindspore.dataset.config.get_numa_enable

Get the state of numa to indicate enabled/disabled.

mindspore.dataset.config.set_monitor_sampling_interval

Set the default interval (in milliseconds) for monitor sampling.

mindspore.dataset.config.get_monitor_sampling_interval

Get the global configuration of sampling interval of performance monitor.

mindspore.dataset.config.set_callback_timeout

Set the default timeout (in seconds) for mindspore.dataset.WaitedDSCallback .

mindspore.dataset.config.get_callback_timeout

Get the default timeout (in seconds) for mindspore.dataset.WaitedDSCallback .

mindspore.dataset.config.set_auto_num_workers

Set num_parallel_workers for each op automatically(This feature is turned off by default).

mindspore.dataset.config.get_auto_num_workers

Get the setting (turned on or off) automatic number of workers, it is disabled by default.

mindspore.dataset.config.set_enable_shared_mem

Set whether to use shared memory for interprocess communication when data processing multiprocessing is turned on.

mindspore.dataset.config.get_enable_shared_mem

Get the default state of shared mem enabled variable.

mindspore.dataset.config.set_enable_autotune

Set whether to enable AutoTune for data pipeline parameters.

mindspore.dataset.config.get_enable_autotune

Get whether AutoTune is currently enabled, it is disabled by default.

mindspore.dataset.config.set_autotune_interval

Set the configuration adjustment interval (in steps) for AutoTune.

mindspore.dataset.config.get_autotune_interval

Get the current configuration adjustment interval (in steps) for AutoTune.

mindspore.dataset.config.set_auto_offload

Set the automatic offload flag of the dataset.

mindspore.dataset.config.get_auto_offload

Get the state of the automatic offload flag (True or False), it is disabled by default.

mindspore.dataset.config.set_enable_watchdog

Set the default state of watchdog Python thread as enabled, the default state of watchdog Python thread is enabled.

mindspore.dataset.config.get_enable_watchdog

Get the state of watchdog Python thread to indicate enabled or disabled state.

mindspore.dataset.config.set_fast_recovery

Set whether dataset pipeline should recover in fast mode during failover (In fast mode, random augmentations may not get same results as before the failure occurred).

mindspore.dataset.config.get_fast_recovery

Get whether the fast recovery mode is enabled for the current dataset pipeline.

mindspore.dataset.config.set_multiprocessing_timeout_interval

Set the default interval (in seconds) for multiprocessing/multithreading timeout when main process/thread gets data from subprocesses/child threads.

mindspore.dataset.config.get_multiprocessing_timeout_interval

Get the global configuration of multiprocessing/multithreading timeout when main process/thread gets data from subprocesses/child threads.

mindspore.dataset.config.set_error_samples_mode

Set the method in which erroneous samples should be processed in a dataset pipeline.

mindspore.dataset.config.get_error_samples_mode

Get the current configuration for strategy for processing erroneous samples in a dataset pipeline.

mindspore.dataset.config.ErrorSamplesMode

An enumeration for error_samples_mode .

mindspore.dataset.config.set_debug_mode

Set the debug_mode flag of the dataset pipeline.

mindspore.dataset.config.get_debug_mode

Get whether debug mode is currently enabled for the data pipeline.

Tools

mindspore.dataset.BatchInfo

This class helps to get dataset information dynamically when the input of batch_size or per_batch_map in batch operation is a callable object.

mindspore.dataset.DatasetCache

A client to interface with tensor caching service.

mindspore.dataset.DSCallback

Abstract base class used to build dataset callback classes.

mindspore.dataset.Schema

Class to represent a schema of a dataset.

mindspore.dataset.Shuffle

Specify the shuffle mode.

mindspore.dataset.WaitedDSCallback

Abstract base class used to build dataset callback classes that are synchronized with the training callback class mindspore.train.Callback .

mindspore.dataset.compare

Compare if two dataset pipelines are the same.

mindspore.dataset.debug.DebugHook

The base class for Dataset Pipeline Python Debugger hook.

mindspore.dataset.deserialize

Construct dataset pipeline from a JSON file produced by dataset serialize function.

mindspore.dataset.serialize

Serialize dataset pipeline into a JSON file.

mindspore.dataset.show

Write the dataset pipeline graph to logger.info file.

mindspore.dataset.sync_wait_for_dataset

Wait util the dataset files required by all devices are downloaded.

mindspore.dataset.utils.imshow_det_bbox

Draw an image with given bboxes and class labels (with scores).

mindspore.dataset.utils.LineReader

Line-based file reader.