Class Dataset

Defined in File datasets.h

Inheritance Relationships

Base Type

public std::enable_shared_from_this< Dataset >

Derived Types

public mindspore::dataset::AlbumDataset (Class AlbumDataset)
public mindspore::dataset::BatchDataset (Class BatchDataset)
public mindspore::dataset::MapDataset (Class MapDataset)
public mindspore::dataset::MnistDataset (Class MnistDataset)
public mindspore::dataset::ProjectDataset (Class ProjectDataset)
public mindspore::dataset::ShuffleDataset (Class ShuffleDataset)

Class Documentation

class Dataset : public std::enable_shared_from_this<Dataset>

A base class to represent a dataset in the data pipeline.

Subclassed by mindspore::dataset::AlbumDataset, mindspore::dataset::BatchDataset, mindspore::dataset::MapDataset, mindspore::dataset::MnistDataset, mindspore::dataset::ProjectDataset, mindspore::dataset::ShuffleDataset

Public Functions

Dataset(): Constructor.

virtual ~Dataset() = default: Destructor.

int64_t GetDatasetSize(bool estimate = false)

Gets the dataset size.

Parameters: estimate – [in] This is only supported by some of the ops and it’s used to speed up the process of getting dataset size at the expense of accuracy.
Returns: dataset size. If failed, return -1

std::vector<mindspore::DataType> GetOutputTypes()

Gets the output type.

Returns: a vector of DataType. If failed, return an empty vector

std::vector<std::vector<int64_t>> GetOutputShapes()

Gets the output shape.

Returns: a vector of TensorShape. If failed, return an empty vector

int64_t GetBatchSize()

Gets the batch size.

Returns: int64_t

int64_t GetRepeatCount()

Gets the repeat count.

Returns: int64_t

int64_t GetNumClasses()

Gets the number of classes.

Returns: number of classes. If failed, return -1

inline std::vector<std::string> GetColumnNames()

Gets the column names.

Returns: Names of the columns. If failed, return an empty vector

inline std::vector<std::pair<std::string, std::vector<int32_t>>> GetClassIndexing()

Gets the class indexing.

Returns: a map of ClassIndexing. If failed, return an empty map

std::shared_ptr<Dataset> SetNumWorkers(int32_t num_workers)

Setter function for runtime number of workers.

Parameters

num_workers – [in] The number of threads in this operator

Returns

Shared pointer to the original object

Example

/* Set number of workers(threads) to process the dataset in parallel */
std::shared_ptr<Dataset> ds = ImageFolder(folder_path, true);
ds = ds->SetNumWorkers(16);

std::shared_ptr<PullIterator> CreatePullBasedIterator(const std::vector<std::vector<char>> &columns = {})

Function to create an PullBasedIterator over the Dataset.

Parameters

columns – [in] List of columns to be used to specify the order of columns

Returns

Shared pointer to the Iterator

Example

/* dataset is an instance of Dataset object */
std::shared_ptr<Iterator> = dataset->CreatePullBasedIterator();
std::unordered_map<std::string, mindspore::MSTensor> row;
iter->GetNextRow(&row);

inline std::shared_ptr<Iterator> CreateIterator(const std::vector<std::string> &columns = {}, int32_t num_epochs = -1)

Function to create an Iterator over the Dataset pipeline.

Parameters

columns – [in] List of columns to be used to specify the order of columns
num_epochs – [in] Number of epochs to run through the pipeline, default -1 which means infinite epochs. An empty row is returned at the end of each epoch

Returns

Shared pointer to the Iterator

Example

/* dataset is an instance of Dataset object */
std::shared_ptr<Iterator> = dataset->CreateIterator();
std::unordered_map<std::string, mindspore::MSTensor> row;
iter->GetNextRow(&row);

inline bool DeviceQueue(const std::string &queue_name = "", const std::string &device_type = "", int32_t device_id = 0, int32_t num_epochs = -1, bool send_epoch_end = true, int32_t total_batches = 0, bool create_data_info_queue = false)

Function to transfer data through a device.

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Parameters

queue_name – [in] Channel name (default=””, create new unique name).
device_type – [in] Type of device (default=””, get from MSContext).
device_id – [in] id of device (default=1, get from MSContext).
num_epochs – [in] Number of epochs (default=-1, infinite epochs).
send_epoch_end – [in] Whether to send end of sequence to device or not (default=true).
total_batches – [in] Number of batches to be sent to the device (default=0, all data).
create_data_info_queue – [in] Whether to create queue which stores types and shapes of data or not(default=false).

Returns

Returns true if no error encountered else false.

inline bool Save(const std::string &dataset_path, int32_t num_files = 1, const std::string &dataset_type = "mindrecord")

Function to create a Saver to save the dynamic data processed by the dataset pipeline.

Note

Usage restrictions:

Supported dataset formats: ‘mindrecord’ only
To save the samples in order, set dataset’s shuffle to false and num_files to 1.
Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.
Mindrecord does not support bool, uint64, multi-dimensional uint8(drop dimension) nor multi-dimensional string.

Parameters

dataset_path – [in] Path to dataset file
num_files – [in] Number of dataset files (default=1)
dataset_type – [in] Dataset format (default=”mindrecord”)

Returns

Returns true if no error encountered else false

Example

/* Create a dataset and save its data into MindRecord */
std::string folder_path = "/path/to/cifar_dataset";
std::shared_ptr<Dataset> ds = Cifar10(folder_path, "all", std::make_shared<SequentialSampler>(0, 10));
std::string save_file = "Cifar10Data.mindrecord";
bool rc = ds->Save(save_file);

std::shared_ptr<BatchDataset> Batch(int32_t batch_size, bool drop_remainder = false)

Function to create a BatchDataset.

Note

Combines batch_size number of consecutive rows into batches

Parameters

batch_size – [in] The number of rows each batch is created with
drop_remainder – [in] Determines whether or not to drop the last possibly incomplete batch. If true, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the next node

Returns

Shared pointer to the current BatchDataset

Example

/* Create a dataset where every 100 rows is combined into a batch */
std::shared_ptr<Dataset> ds = ImageFolder(folder_path, true);
ds = ds->Batch(100, true);

inline std::shared_ptr<MapDataset> Map(const std::vector<TensorTransform*> &operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::vector<std::string> &project_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, const std::vector<std::shared_ptr<DSCallback>> &callbacks = {})

Function to create a MapDataset.

Note

Applies each operation in operations to this dataset

Parameters

operations – [in] Vector of raw pointers to TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list
input_columns – [in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column
output_columns – [in] Vector of names assigned to the columns outputted by the last operation This parameter is mandatory if len(input_columns) != len(output_columns) The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced
project_columns – [in] A list of column names to project
cache – [in] Tensor cache to use. (default=nullptr which means no cache is used).
callbacks – [in] List of Dataset callbacks to be called.

Returns

Shared pointer to the current MapDataset

Example

 // Create objects for the tensor ops
 std::shared_ptr<TensorTransform> decode_op = std::make_shared<vision::Decode>(true);
 std::shared_ptr<TensorTransform> random_color_op = std::make_shared<vision::RandomColor>(0.0, 0.0);

 /* 1) Simple map example */
 // Apply decode_op on column "image". This column will be replaced by the outputted
 // column of decode_op. Since column_order is not provided, both columns "image"
 // and "label" will be propagated to the child node in their original order.
 dataset = dataset->Map({decode_op}, {"image"});

 // Decode and rename column "image" to "decoded_image".
 dataset = dataset->Map({decode_op}, {"image"}, {"decoded_image"});

 // Specify the order of the output columns.
 dataset = dataset->Map({decode_op}, {"image"}, {}, {"label", "image"});

 // Rename column "image" to "decoded_image" and also specify the order of the output columns.
 dataset = dataset->Map({decode_op}, {"image"}, {"decoded_image"}, {"label", "decoded_image"});

 // Rename column "image" to "decoded_image" and keep only this column.
 dataset = dataset->Map({decode_op}, {"image"}, {"decoded_image"}, {"decoded_image"});

/* 2) Map example with more than one operation */
// Create a dataset where the images are decoded, then randomly color jittered.
// decode_op takes column "image" as input and outputs one column. The column
// outputted by decode_op is passed as input to random_jitter_op.
// random_jitter_op will output one column. Column "image" will be replaced by
// the column outputted by random_jitter_op (the very last operation). All other
// columns are unchanged. Since column_order is not specified, the order of the
// columns will remain the same.
dataset = dataset->Map({decode_op, random_jitter_op}, {"image"})

inline std::shared_ptr<MapDataset> Map(const std::vector<std::shared_ptr<TensorTransform>> &operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::vector<std::string> &project_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, const std::vector<std::shared_ptr<DSCallback>> &callbacks = {})

Function to create a MapDataset.

Note

Applies each operation in operations to this dataset

Parameters

operations – [in] Vector of shared pointers to TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list
input_columns – [in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column
output_columns – [in] Vector of names assigned to the columns outputted by the last operation This parameter is mandatory if len(input_columns) != len(output_columns) The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced
project_columns – [in] A list of column names to project
cache – [in] Tensor cache to use. (default=nullptr which means no cache is used).
callbacks – [in] List of Dataset callbacks to be called.

Returns

Shared pointer to the current MapDataset

inline std::shared_ptr<MapDataset> Map(const std::vector<std::reference_wrapper<TensorTransform>> &operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::vector<std::string> &project_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, const std::vector<std::shared_ptr<DSCallback>> &callbacks = {})

Function to create a MapDataset.

Note

Applies each operation in operations to this dataset

Parameters

operations – [in] Vector of TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list
input_columns – [in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column
output_columns – [in] Vector of names assigned to the columns outputted by the last operation This parameter is mandatory if len(input_columns) != len(output_columns) The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced
project_columns – [in] A list of column names to project
cache – [in] Tensor cache to use. (default=nullptr which means no cache is used).
callbacks – [in] List of Dataset callbacks to be called.

Returns

Shared pointer to the current MapDataset

inline std::shared_ptr<ProjectDataset> Project(const std::vector<std::string> &columns)

Function to create a Project Dataset.

Note

Applies project to the dataset

Parameters

columns – [in] The name of columns to project

Returns

Shared pointer to the current Dataset

Example

/* Reorder the original column names in dataset */
std::shared_ptr<Dataset> ds = Mnist(folder_path, "all", std::make_shared<RandomSampler>(false, 10));
ds = ds->Project({"label", "image"});

inline std::shared_ptr<ShuffleDataset> Shuffle(int32_t buffer_size)

Function to create a Shuffle Dataset.

Note

Randomly shuffles the rows of this dataset

Parameters

buffer_size – [in] The size of the buffer (must be larger than 1) for shuffling

Returns

Shared pointer to the current ShuffleDataset

Example

/* Rename the original column names in dataset */
std::shared_ptr<Dataset> ds = Mnist(folder_path, "all", std::make_shared<RandomSampler>(false, 10));
ds = ds->Rename({"image", "label"}, {"image_output", "label_output"});