Class Dataset

Inheritance Relationships

Base Type

  • public std::enable_shared_from_this< Dataset >

Derived Types

Class Documentation

class Dataset : public std::enable_shared_from_this<Dataset>

A base class to represent a dataset in the data pipeline.

Subclassed by mindspore::dataset::AlbumDataset, mindspore::dataset::BatchDataset, mindspore::dataset::MapDataset, mindspore::dataset::MnistDataset, mindspore::dataset::ProjectDataset, mindspore::dataset::ShuffleDataset

Public Functions

Dataset()

Constructor.

virtual ~Dataset() = default

Destructor.

int64_t GetDatasetSize(bool estimate = false)

Gets the dataset size.

Parameters

estimate[in] This is only supported by some of the ops and it’s used to speed up the process of getting dataset size at the expense of accuracy.

Returns

dataset size. If failed, return -1

std::vector<mindspore::DataType> GetOutputTypes()

Gets the output type.

Returns

a vector of DataType. If failed, return an empty vector

std::vector<std::vector<int64_t>> GetOutputShapes()

Gets the output shape.

Returns

a vector of TensorShape. If failed, return an empty vector

int64_t GetBatchSize()

Gets the batch size.

Returns

int64_t

int64_t GetRepeatCount()

Gets the repeat count.

Returns

int64_t

int64_t GetNumClasses()

Gets the number of classes.

Returns

number of classes. If failed, return -1

inline std::vector<std::string> GetColumnNames()

Gets the column names.

Returns

Names of the columns. If failed, return an empty vector

inline std::vector<std::pair<std::string, std::vector<int32_t>>> GetClassIndexing()

Gets the class indexing.

Returns

a map of ClassIndexing. If failed, return an empty map

std::shared_ptr<Dataset> SetNumWorkers(int32_t num_workers)

Setter function for runtime number of workers.

Parameters

num_workers[in] The number of threads in this operator

Returns

Shared pointer to the original object

样例
/* Set number of workers(threads) to process the dataset in parallel */
std::shared_ptr<Dataset> ds = ImageFolder(folder_path, true);
ds = ds->SetNumWorkers(16);
std::shared_ptr<PullIterator> CreatePullBasedIterator(const std::vector<std::vector<char>> &columns = {})

Function to create an PullBasedIterator over the Dataset.

Parameters

columns[in] List of columns to be used to specify the order of columns

Returns

Shared pointer to the Iterator

样例
/* dataset is an instance of Dataset object */
std::shared_ptr<Iterator> = dataset->CreatePullBasedIterator();
std::unordered_map<std::string, mindspore::MSTensor> row;
iter->GetNextRow(&row);
inline std::shared_ptr<Iterator> CreateIterator(const std::vector<std::string> &columns = {}, int32_t num_epochs = -1)

Function to create an Iterator over the Dataset pipeline.

Parameters
  • columns[in] List of columns to be used to specify the order of columns

  • num_epochs[in] Number of epochs to run through the pipeline, default -1 which means infinite epochs. An empty row is returned at the end of each epoch

Returns

Shared pointer to the Iterator

样例
/* dataset is an instance of Dataset object */
std::shared_ptr<Iterator> = dataset->CreateIterator();
std::unordered_map<std::string, mindspore::MSTensor> row;
iter->GetNextRow(&row);
inline bool DeviceQueue(const std::string &queue_name = "", const std::string &device_type = "", int32_t device_id = 0, int32_t num_epochs = -1, bool send_epoch_end = true, int32_t total_batches = 0, bool create_data_info_queue = false)

Function to transfer data through a device.

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Parameters
  • queue_name[in] Channel name (default=””, create new unique name).

  • device_type[in] Type of device (default=””, get from MSContext).

  • device_id[in] id of device (default=1, get from MSContext).

  • num_epochs[in] Number of epochs (default=-1, infinite epochs).

  • send_epoch_end[in] Whether to send end of sequence to device or not (default=true).

  • total_batches[in] Number of batches to be sent to the device (default=0, all data).

  • create_data_info_queue[in] Whether to create queue which stores types and shapes of data or not(default=false).

Returns

Returns true if no error encountered else false.

inline bool Save(const std::string &dataset_path, int32_t num_files = 1, const std::string &dataset_type = "mindrecord")

Function to create a Saver to save the dynamic data processed by the dataset pipeline.

Note

Usage restrictions:

  1. Supported dataset formats: ‘mindrecord’ only

  2. To save the samples in order, set dataset’s shuffle to false and num_files to 1.

  3. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  4. Mindrecord does not support bool, uint64, multi-dimensional uint8(drop dimension) nor multi-dimensional string.

Parameters
  • dataset_path[in] Path to dataset file

  • num_files[in] Number of dataset files (default=1)

  • dataset_type[in] Dataset format (default=”mindrecord”)

Returns

Returns true if no error encountered else false

样例
/* Create a dataset and save its data into MindRecord */
std::string folder_path = "/path/to/cifar_dataset";
std::shared_ptr<Dataset> ds = Cifar10(folder_path, "all", std::make_shared<SequentialSampler>(0, 10));
std::string save_file = "Cifar10Data.mindrecord";
bool rc = ds->Save(save_file);
std::shared_ptr<BatchDataset> Batch(int32_t batch_size, bool drop_remainder = false)

Function to create a BatchDataset.

Note

Combines batch_size number of consecutive rows into batches

Parameters
  • batch_size[in] The number of rows each batch is created with

  • drop_remainder[in] Determines whether or not to drop the last possibly incomplete batch. If true, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the next node

Returns

Shared pointer to the current BatchDataset

样例
/* Create a dataset where every 100 rows is combined into a batch */
std::shared_ptr<Dataset> ds = ImageFolder(folder_path, true);
ds = ds->Batch(100, true);
inline std::shared_ptr<MapDataset> Map(const std::vector<TensorTransform*> &operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::vector<std::string> &project_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, const std::vector<std::shared_ptr<DSCallback>> &callbacks = {})

Function to create a MapDataset.

Note

Applies each operation in operations to this dataset

Parameters
  • operations[in] Vector of raw pointers to TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list

  • input_columns[in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column

  • output_columns[in] Vector of names assigned to the columns outputted by the last operation This parameter is mandatory if len(input_columns) != len(output_columns) The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced

  • project_columns[in] A list of column names to project

  • cache[in] Tensor cache to use. (default=nullptr which means no cache is used).

  • callbacks[in] List of Dataset callbacks to be called.

Returns

Shared pointer to the current MapDataset

样例
 // Create objects for the tensor ops
 std::shared_ptr<TensorTransform> decode_op = std::make_shared<vision::Decode>(true);
 std::shared_ptr<TensorTransform> random_color_op = std::make_shared<vision::RandomColor>(0.0, 0.0);

 /* 1) Simple map example */
 // Apply decode_op on column "image". This column will be replaced by the outputted
 // column of decode_op. Since column_order is not provided, both columns "image"
 // and "label" will be propagated to the child node in their original order.
 dataset = dataset->Map({decode_op}, {"image"});

 // Decode and rename column "image" to "decoded_image".
 dataset = dataset->Map({decode_op}, {"image"}, {"decoded_image"});

 // Specify the order of the output columns.
 dataset = dataset->Map({decode_op}, {"image"}, {}, {"label", "image"});

 // Rename column "image" to "decoded_image" and also specify the order of the output columns.
 dataset = dataset->Map({decode_op}, {"image"}, {"decoded_image"}, {"label", "decoded_image"});

 // Rename column "image" to "decoded_image" and keep only this column.
 dataset = dataset->Map({decode_op}, {"image"}, {"decoded_image"}, {"decoded_image"});

/* 2) Map example with more than one operation */
// Create a dataset where the images are decoded, then randomly color jittered.
// decode_op takes column "image" as input and outputs one column. The column
// outputted by decode_op is passed as input to random_jitter_op.
// random_jitter_op will output one column. Column "image" will be replaced by
// the column outputted by random_jitter_op (the very last operation). All other
// columns are unchanged. Since column_order is not specified, the order of the
// columns will remain the same.
dataset = dataset->Map({decode_op, random_jitter_op}, {"image"})
inline std::shared_ptr<MapDataset> Map(const std::vector<std::shared_ptr<TensorTransform>> &operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::vector<std::string> &project_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, const std::vector<std::shared_ptr<DSCallback>> &callbacks = {})

Function to create a MapDataset.

Note

Applies each operation in operations to this dataset

Parameters
  • operations[in] Vector of shared pointers to TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list

  • input_columns[in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column

  • output_columns[in] Vector of names assigned to the columns outputted by the last operation This parameter is mandatory if len(input_columns) != len(output_columns) The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced

  • project_columns[in] A list of column names to project

  • cache[in] Tensor cache to use. (default=nullptr which means no cache is used).

  • callbacks[in] List of Dataset callbacks to be called.

Returns

Shared pointer to the current MapDataset

inline std::shared_ptr<MapDataset> Map(const std::vector<std::reference_wrapper<TensorTransform>> &operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::vector<std::string> &project_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, const std::vector<std::shared_ptr<DSCallback>> &callbacks = {})

Function to create a MapDataset.

Note

Applies each operation in operations to this dataset

Parameters
  • operations[in] Vector of TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list

  • input_columns[in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column

  • output_columns[in] Vector of names assigned to the columns outputted by the last operation This parameter is mandatory if len(input_columns) != len(output_columns) The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced

  • project_columns[in] A list of column names to project

  • cache[in] Tensor cache to use. (default=nullptr which means no cache is used).

  • callbacks[in] List of Dataset callbacks to be called.

Returns

Shared pointer to the current MapDataset

inline std::shared_ptr<ProjectDataset> Project(const std::vector<std::string> &columns)

Function to create a Project Dataset.

Note

Applies project to the dataset

Parameters

columns[in] The name of columns to project

Returns

Shared pointer to the current Dataset

样例
/* Reorder the original column names in dataset */
std::shared_ptr<Dataset> ds = Mnist(folder_path, "all", std::make_shared<RandomSampler>(false, 10));
ds = ds->Project({"label", "image"});
inline std::shared_ptr<ShuffleDataset> Shuffle(int32_t buffer_size)

Function to create a Shuffle Dataset.

Note

Randomly shuffles the rows of this dataset

Parameters

buffer_size[in] The size of the buffer (must be larger than 1) for shuffling

Returns

Shared pointer to the current ShuffleDataset

样例
/* Rename the original column names in dataset */
std::shared_ptr<Dataset> ds = Mnist(folder_path, "all", std::make_shared<RandomSampler>(false, 10));
ds = ds->Rename({"image", "label"}, {"image_output", "label_output"});