Class Dataset

Defined in File datasets.h

Inheritance Relationships

Base Type

public std::enable_shared_from_this< Dataset >

Derived Types

public mindspore::dataset::AlbumDataset (Class AlbumDataset)
public mindspore::dataset::BatchDataset (Class BatchDataset)
public mindspore::dataset::BucketBatchByLengthDataset (Class BucketBatchByLengthDataset)
public mindspore::dataset::CLUEDataset (Class CLUEDataset)
public mindspore::dataset::CSVDataset (Class CSVDataset)
public mindspore::dataset::CelebADataset (Class CelebADataset)
public mindspore::dataset::Cifar100Dataset (Class Cifar100Dataset)
public mindspore::dataset::Cifar10Dataset (Class Cifar10Dataset)
public mindspore::dataset::CityscapesDataset (Class CityscapesDataset)
public mindspore::dataset::CocoDataset (Class CocoDataset)
public mindspore::dataset::ConcatDataset (Class ConcatDataset)
public mindspore::dataset::DIV2KDataset (Class DIV2KDataset)
public mindspore::dataset::FilterDataset (Class FilterDataset)
public mindspore::dataset::FlickrDataset (Class FlickrDataset)
public mindspore::dataset::ImageFolderDataset (Class ImageFolderDataset)
public mindspore::dataset::ManifestDataset (Class ManifestDataset)
public mindspore::dataset::MapDataset (Class MapDataset)
public mindspore::dataset::MindDataDataset (Class MindDataDataset)
public mindspore::dataset::MnistDataset (Class MnistDataset)
public mindspore::dataset::ProjectDataset (Class ProjectDataset)
public mindspore::dataset::RandomDataDataset (Class RandomDataDataset)
public mindspore::dataset::RenameDataset (Class RenameDataset)
public mindspore::dataset::RepeatDataset (Class RepeatDataset)
public mindspore::dataset::SBUDataset (Class SBUDataset)
public mindspore::dataset::ShuffleDataset (Class ShuffleDataset)
public mindspore::dataset::SkipDataset (Class SkipDataset)
public mindspore::dataset::TFRecordDataset (Class TFRecordDataset)
public mindspore::dataset::TakeDataset (Class TakeDataset)
public mindspore::dataset::TextFileDataset (Class TextFileDataset)
public mindspore::dataset::USPSDataset (Class USPSDataset)
public mindspore::dataset::VOCDataset (Class VOCDataset)
public mindspore::dataset::ZipDataset (Class ZipDataset)

Class Documentation

class Dataset : public std::enable_shared_from_this<Dataset>

A base class to represent a dataset in the data pipeline.

Subclassed by mindspore::dataset::AlbumDataset, mindspore::dataset::BatchDataset, mindspore::dataset::BucketBatchByLengthDataset, mindspore::dataset::CLUEDataset, mindspore::dataset::CSVDataset, mindspore::dataset::CelebADataset, mindspore::dataset::Cifar100Dataset, mindspore::dataset::Cifar10Dataset, mindspore::dataset::CityscapesDataset, mindspore::dataset::CocoDataset, mindspore::dataset::ConcatDataset, mindspore::dataset::DIV2KDataset, mindspore::dataset::FilterDataset, mindspore::dataset::FlickrDataset, mindspore::dataset::ImageFolderDataset, mindspore::dataset::ManifestDataset, mindspore::dataset::MapDataset, mindspore::dataset::MindDataDataset, mindspore::dataset::MnistDataset, mindspore::dataset::ProjectDataset, mindspore::dataset::RandomDataDataset, mindspore::dataset::RenameDataset, mindspore::dataset::RepeatDataset, mindspore::dataset::SBUDataset, mindspore::dataset::ShuffleDataset, mindspore::dataset::SkipDataset, mindspore::dataset::TFRecordDataset, mindspore::dataset::TakeDataset, mindspore::dataset::TextFileDataset, mindspore::dataset::USPSDataset, mindspore::dataset::VOCDataset, mindspore::dataset::ZipDataset

Public Functions

Dataset(): Constructor.

~Dataset() = default: Destructor.

int64_t GetDatasetSize(bool estimate = false)

Get the dataset size.

Parameters: estimate – [in] This is only supported by some of the ops and it’s used to speed up the process of getting dataset size at the expense of accuracy.
Returns: Dataset size. If failed, return -1.

std::vector<mindspore::DataType> GetOutputTypes()

Get the output type.

Returns: A vector contains output DataType of dataset. If failed, return an empty vector.

std::vector<std::vector<int64_t>> GetOutputShapes()

Get the output shape.

Returns: A vector contains output TensorShape of dataset. If failed, return an empty vector.

int64_t GetBatchSize()

Get the batch size.

Returns: Batch size configuration of dataset.

int64_t GetRepeatCount()

Get the repeat count.

Returns: Repeat count configuration of dataset.

int64_t GetNumClasses()

Get the number of classes.

Returns: Number of classes of dataset. If failed, return -1.

inline std::vector<std::string> GetColumnNames()

Get the column names.

Returns: A vector contains all column names of dataset. If failed, return an empty vector.

inline std::vector<std::pair<std::string, std::vector<int32_t>>> GetClassIndexing()

Get the class indexing.

Returns: A map of ClassIndexing of dataset. If failed, return an empty map.

std::shared_ptr<Dataset> SetNumWorkers(int32_t num_workers)

Function to set runtime number of workers.

Parameters: num_workers – [in] The number of threads in this operator.
Returns: Shared pointer to the original object.

std::shared_ptr<PullIterator> CreatePullBasedIterator(std::vector<std::vector<char>> columns = {})

A Function to create an PullBasedIterator over the Dataset.

Parameters: columns – [in] List of columns to be used to specify the order of columns.
Returns: Shared pointer to the Iterator.

inline std::shared_ptr<Iterator> CreateIterator(std::vector<std::string> columns = {}, int32_t num_epochs = -1)

Function to create an Iterator over the Dataset pipeline.

Parameters

columns – [in] List of columns to be used to specify the order of columns.
num_epochs – [in] Number of epochs to run through the pipeline (default=-1, which means infinite epochs). An empty row is returned at the end of each epoch.

Returns

Shared pointer to the Iterator.

inline bool DeviceQueue(std::string queue_name = "", std::string device_type = "", int32_t device_id = 0, int32_t num_epochs = -1, bool send_epoch_end = true, int32_t total_batches = 0, bool create_data_info_queue = false)

Function to transfer data through a device.

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Parameters

queue_name – [in] Channel name (default=””, create new unique name).
device_type – [in] Type of device (default=””, get from MSContext).
device_id – [in] id of device (default=1, get from MSContext).
num_epochs – [in] Number of epochs (default=-1, infinite epochs).
send_epoch_end – [in] Whether to send end of sequence to device or not (default=true).
total_batches – [in] Number of batches to be sent to the device (default=0, all data).
create_data_info_queue – [in] Whether to create queue which stores types and shapes of data or not (default=false).

Returns

Returns true if no error encountered else false.

inline bool Save(std::string dataset_path, int32_t num_files = 1, std::string dataset_type = "mindrecord")

Function to create a Saver to save the dynamic data processed by the dataset pipeline.

Note

Usage restrictions:

Supported dataset formats: ‘mindrecord’ only.
To save the samples in order, set dataset’s shuffle to false and num_files to 1.
Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.
Mindrecord does not support bool, uint64, multi-dimensional uint8(drop dimension) nor multi-dimensional string.

Parameters

dataset_path – [in] Path to dataset file.
num_files – [in] Number of dataset files (default=1).
dataset_type – [in] Dataset format (default=”mindrecord”).

Returns

Returns true if no error encountered else false.

std::shared_ptr<BatchDataset> Batch(int32_t batch_size, bool drop_remainder = false)

Function to create a BatchDataset.

Note

Combines batch_size number of consecutive rows into batches.

Parameters

batch_size – [in] The number of rows each batch is created with.
drop_remainder – [in] Determines whether or not to drop the last possibly incomplete batch. If true, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the next node.

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<BucketBatchByLengthDataset> BucketBatchByLength(const std::vector<std::string> &column_names, const std::vector<int32_t> &bucket_boundaries, const std::vector<int32_t> &bucket_batch_sizes, std::function<MSTensorVec(MSTensorVec)> element_length_function = nullptr, const std::map<std::string, std::pair<std::vector<int64_t>, MSTensor>> &pad_info = {}, bool pad_to_bucket_boundary = false, bool drop_remainder = false)

Function to create a BucketBatchByLengthDataset.

Note

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

Parameters

column_names – [in] Columns passed to element_length_function.
bucket_boundaries – [in] A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).
bucket_batch_sizes – [in] A list consisting of the batch sizes for each bucket. Must contain elements equal to the size of bucket_boundaries + 1.
element_length_function – [in] A function pointer that takes in MSTensorVec and outputs a MSTensorVec. The output must contain a single tensor containing a single int32_t. If no value is provided, then size of column_names must be 1, and the size of the first dimension of that column will be taken as the length (default=nullptr).
pad_info – [in] Represents how to batch each column. The key corresponds to the column name, the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any unspecified dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is true. If no padding is wanted, set pad_info to None (default=empty dictionary).
pad_to_bucket_boundary – [in] If true, will pad each unspecified dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=false).
drop_remainder – [in] If true, will drop the last batch for each bucket if it is not a full batch (default=false).

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<SentencePieceVocab> BuildSentencePieceVocab(const std::vector<std::string> &col_names, int32_t vocab_size, float character_coverage, SentencePieceModel model_type, const std::unordered_map<std::string, std::string> &params)

Function to create a SentencePieceVocab from source dataset.

Note

Build a SentencePieceVocab from a dataset.

Parameters

col_names – [in] Column names to get words from. It can be a vector of column names.
vocab_size – [in] Vocabulary size.
character_coverage – [in] Percentage of characters covered by the model, must be between 0.98 and 1.0 Good defaults are: 0.9995 for languages with rich character sets like Japanese or Chinese character sets, and 1.0 for other languages with small character sets.
model_type – [in] Model type. Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.
params – [in] A vector contains more option parameters of sentencepiece library.

Returns

Shared pointer to the SentencePieceVocab.

inline std::shared_ptr<Vocab> BuildVocab(const std::vector<std::string> &columns = {}, const std::pair<int64_t, int64_t> &freq_range = {0, kDeMaxFreq}, int64_t top_k = kDeMaxTopk, const std::vector<std::string> &special_tokens = {}, bool special_first = true)

Function to create a Vocab from source dataset.

Note

Build a vocab from a dataset. This would collect all the unique words in a dataset and return a vocab which contains top_k most frequent words (if top_k is specified).

Parameters

columns – [in] Column names to get words from. It can be a vector of column names.
freq_range – [in] A tuple of integers (min_frequency, max_frequency). Words within the frequency range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency/max_frequency can be set to default, which corresponds to 0/total_words separately.
top_k – [in] Number of words to be built into vocab. top_k most frequent words are taken. The top_k is taken after freq_range. If not enough top_k, all words will be taken.
special_tokens – [in] A list of strings, each one is a special token.
special_first – [in] Whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to default, special_tokens will be prepended.

Returns

Shared pointer to the Vocab.

inline std::shared_ptr<ConcatDataset> Concat(const std::vector<std::shared_ptr<Dataset>> &datasets)

Function to create a ConcatDataset.

Note

Concat the datasets in the input.

Parameters: datasets – [in] List of shared pointers to the dataset that should be concatenated together.
Returns: Shared pointer to the current Dataset.

inline std::shared_ptr<FilterDataset> Filter(std::function<MSTensorVec(MSTensorVec)> predicate, const std::vector<std::string> &input_columns = {})

Function to filter dataset by predicate.

Note

If input_columns is not provided or empty, all columns will be used.

Parameters

predicate – [in] Function callable which returns a boolean value. If false then filter the element.
input_columns – [in] List of names of the input columns to filter.

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<MapDataset> Map(std::vector<TensorTransform*> operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::vector<std::string> &project_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, std::vector<std::shared_ptr<DSCallback>> callbacks = {})

Function to create a MapDataset.

Note

Applies each operation in operations to this dataset.

Parameters

operations – [in] Vector of raw pointers to TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list.
input_columns – [in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column.
output_columns – [in] Vector of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced.
project_columns – [in] A list of column names to project.
cache – [in] Tensor cache to use (default=nullptr which means no cache is used).
callbacks – [in] List of Dataset callbacks to be called.

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<MapDataset> Map(std::vector<std::shared_ptr<TensorTransform>> operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::vector<std::string> &project_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, std::vector<std::shared_ptr<DSCallback>> callbacks = {})

Function to create a MapDataset.

Note

Applies each operation in operations to this dataset.

Parameters

operations – [in] Vector of shared pointers to TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list.
input_columns – [in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column.
output_columns – [in] Vector of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced.
project_columns – [in] A list of column names to project.
cache – [in] Tensor cache to use (default=nullptr which means no cache is used).
callbacks – [in] List of Dataset callbacks to be called.

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<MapDataset> Map(const std::vector<std::reference_wrapper<TensorTransform>> operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::vector<std::string> &project_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, std::vector<std::shared_ptr<DSCallback>> callbacks = {})

Function to create a MapDataset.

Note

Applies each operation in operations to this dataset.

Parameters

operations – [in] Vector of TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list.
input_columns – [in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column.
output_columns – [in] Vector of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced.
project_columns – [in] A list of column names to project.
cache – [in] Tensor cache to use (default=nullptr which means no cache is used).
callbacks – [in] List of Dataset callbacks to be called.

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<ProjectDataset> Project(const std::vector<std::string> &columns)

Function to create a Project Dataset.

Note

Applies project to the dataset.

Parameters: columns – [in] The name of columns to project.
Returns: Shared pointer to the current Dataset.

inline std::shared_ptr<RenameDataset> Rename(const std::vector<std::string> &input_columns, const std::vector<std::string> &output_columns)

Function to create a Rename Dataset.

Note

Renames the columns in the input dataset.

Parameters

input_columns – [in] List of the input columns to rename.
output_columns – [in] List of the output columns.

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<RepeatDataset> Repeat(int32_t count = -1)

Function to create a RepeatDataset.

Note

Repeats this dataset count times. Repeat indefinitely if count is -1.

Parameters: count – [in] Number of times the dataset should be repeated.
Returns: Shared pointer to the current Dataset.

inline std::shared_ptr<ShuffleDataset> Shuffle(int32_t buffer_size)

Function to create a Shuffle Dataset.

Note

Randomly shuffles the rows of this dataset.

Parameters: buffer_size – [in] The size of the buffer (must be larger than 1) for shuffling
Returns: Shared pointer to the current Dataset.

inline std::shared_ptr<SkipDataset> Skip(int32_t count)

Function to create a SkipDataset.

Note

Skips count elements in this dataset.

Parameters: count – [in] Number of elements the dataset to be skipped.
Returns: Shared pointer to the current Dataset.

inline std::shared_ptr<TakeDataset> Take(int32_t count = -1)

Function to create a TakeDataset.

Note

Takes count elements in this dataset.

Parameters: count – [in] Number of elements the dataset to be taken.
Returns: Shared pointer to the current Dataset.

inline std::shared_ptr<ZipDataset> Zip(const std::vector<std::shared_ptr<Dataset>> &datasets)

Function to create a Zip Dataset.

Note

Applies zip to the dataset.

Parameters: datasets – [in] A list of shared pointers to the datasets that we want to zip.
Returns: Shared pointer to the current Dataset.