Class Dataset

Inheritance Relationships

Base Type

  • public std::enable_shared_from_this< Dataset >

Derived Types

Class Documentation

class Dataset : public std::enable_shared_from_this<Dataset>

A base class to represent a dataset in the data pipeline.

Subclassed by mindspore::dataset::AlbumDataset, mindspore::dataset::BatchDataset, mindspore::dataset::BucketBatchByLengthDataset, mindspore::dataset::CLUEDataset, mindspore::dataset::CSVDataset, mindspore::dataset::CelebADataset, mindspore::dataset::Cifar100Dataset, mindspore::dataset::Cifar10Dataset, mindspore::dataset::CocoDataset, mindspore::dataset::ConcatDataset, mindspore::dataset::FilterDataset, mindspore::dataset::ImageFolderDataset, mindspore::dataset::ManifestDataset, mindspore::dataset::MapDataset, mindspore::dataset::MindDataDataset, mindspore::dataset::MnistDataset, mindspore::dataset::ProjectDataset, mindspore::dataset::RandomDataDataset, mindspore::dataset::RenameDataset, mindspore::dataset::RepeatDataset, mindspore::dataset::ShuffleDataset, mindspore::dataset::SkipDataset, mindspore::dataset::TFRecordDataset, mindspore::dataset::TakeDataset, mindspore::dataset::TextFileDataset, mindspore::dataset::VOCDataset, mindspore::dataset::ZipDataset

Public Functions

Dataset()

Constructor.

~Dataset() = default

Destructor.

int64_t GetDatasetSize(bool estimate = false)

Get the dataset size.

Parameters

estimate[in] This is only supported by some of the ops and it’s used to speed up the process of getting dataset size at the expense of accuracy.

Returns

Dataset size. If failed, return -1.

std::vector<mindspore::DataType> GetOutputTypes()

Get the output type.

Returns

A vector contains output DataType of dataset. If failed, return an empty vector.

std::vector<std::vector<int64_t>> GetOutputShapes()

Get the output shape.

Returns

A vector contains output TensorShape of dataset. If failed, return an empty vector.

int64_t GetBatchSize()

Get the batch size.

Returns

Batch size configuration of dataset.

int64_t GetRepeatCount()

Get the repeat count.

Returns

Repeat count configuration of dataset.

int64_t GetNumClasses()

Get the number of classes.

Returns

Number of classes of dataset. If failed, return -1.

inline std::vector<std::string> GetColumnNames()

Get the column names.

Returns

A vector contains all column names of dataset. If failed, return an empty vector.

inline std::vector<std::pair<std::string, std::vector<int32_t>>> GetClassIndexing()

Get the class indexing.

Returns

A map of ClassIndexing of dataset. If failed, return an empty map.

std::shared_ptr<Dataset> SetNumWorkers(int32_t num_workers)

Function to set runtime number of workers.

Parameters

num_workers[in] The number of threads in this operator.

Returns

Shared pointer to the original object.

std::shared_ptr<PullIterator> CreatePullBasedIterator(std::vector<std::vector<char>> columns = {})

A Function to create an PullBasedIterator over the Dataset.

Parameters

columns[in] List of columns to be used to specify the order of columns.

Returns

Shared pointer to the Iterator.

inline std::shared_ptr<Iterator> CreateIterator(std::vector<std::string> columns = {}, int32_t num_epochs = -1)

Function to create an Iterator over the Dataset pipeline.

Parameters
  • columns[in] List of columns to be used to specify the order of columns.

  • num_epochs[in] Number of epochs to run through the pipeline (default=-1, which means infinite epochs). An empty row is returned at the end of each epoch.

Returns

Shared pointer to the Iterator.

inline bool DeviceQueue(std::string queue_name = "", std::string device_type = "", int32_t device_id = 0, int32_t num_epochs = -1, bool send_epoch_end = true, int32_t total_batches = 0, bool create_data_info_queue = false)

Function to transfer data through a device.

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Parameters
  • queue_name[in] Channel name (default=””, create new unique name).

  • device_type[in] Type of device (default=””, get from MSContext).

  • device_id[in] id of device (default=1, get from MSContext).

  • num_epochs[in] Number of epochs (default=-1, infinite epochs).

  • send_epoch_end[in] Whether to send end of sequence to device or not (default=true).

  • total_batches[in] Number of batches to be sent to the device (default=0, all data).

  • create_data_info_queue[in] Whether to create queue which stores types and shapes of data or not (default=false).

Returns

Returns true if no error encountered else false.

inline bool Save(std::string dataset_path, int32_t num_files = 1, std::string dataset_type = "mindrecord")

Function to create a Saver to save the dynamic data processed by the dataset pipeline.

Note

Usage restrictions:

  1. Supported dataset formats: ‘mindrecord’ only.

  2. To save the samples in order, set dataset’s shuffle to false and num_files to 1.

  3. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  4. Mindrecord does not support bool, uint64, multi-dimensional uint8(drop dimension) nor multi-dimensional string.

Parameters
  • file_name[in] Path to dataset file.

  • num_files[in] Number of dataset files (default=1).

  • file_type[in] Dataset format (default=”mindrecord”).

Returns

Returns true if no error encountered else false.

std::shared_ptr<BatchDataset> Batch(int32_t batch_size, bool drop_remainder = false)

Function to create a BatchDataset.

Note

Combines batch_size number of consecutive rows into batches.

Parameters
  • batch_size[in] The number of rows each batch is created with.

  • drop_remainder[in] Determines whether or not to drop the last possibly incomplete batch. If true, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the next node.

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<BucketBatchByLengthDataset> BucketBatchByLength(const std::vector<std::string> &column_names, const std::vector<int32_t> &bucket_boundaries, const std::vector<int32_t> &bucket_batch_sizes, std::function<MSTensorVec(MSTensorVec)> element_length_function = nullptr, const std::map<std::string, std::pair<std::vector<int64_t>, MSTensor>> &pad_info = {}, bool pad_to_bucket_boundary = false, bool drop_remainder = false)

Function to create a BucketBatchByLengthDataset.

Note

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

Parameters
  • column_names[in] Columns passed to element_length_function.

  • bucket_boundaries[in] A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes[in] A list consisting of the batch sizes for each bucket. Must contain elements equal to the size of bucket_boundaries + 1.

  • element_length_function[in] A function pointer that takes in MSTensorVec and outputs a MSTensorVec. The output must contain a single tensor containing a single int32_t. If no value is provided, then size of column_names must be 1, and the size of the first dimension of that column will be taken as the length (default=nullptr).

  • pad_info[in] Represents how to batch each column. The key corresponds to the column name, the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any unspecified dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is true. If no padding is wanted, set pad_info to None (default=empty dictionary).

  • pad_to_bucket_boundary[in] If true, will pad each unspecified dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=false).

  • drop_remainder[in] If true, will drop the last batch for each bucket if it is not a full batch (default=false).

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<SentencePieceVocab> BuildSentencePieceVocab(const std::vector<std::string> &col_names, int32_t vocab_size, float character_coverage, SentencePieceModel model_type, const std::unordered_map<std::string, std::string> &params)

Function to create a SentencePieceVocab from source dataset.

Note

Build a SentencePieceVocab from a dataset.

Parameters
  • col_names[in] Column names to get words from. It can be a vector of column names.

  • vocab_size[in] Vocabulary size.

  • character_coverage[in] Percentage of characters covered by the model, must be between 0.98 and 1.0 Good defaults are: 0.9995 for languages with rich character sets like Japanese or Chinese character sets, and 1.0 for other languages with small character sets.

  • model_type[in] Model type. Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.

  • params[in] A vector contains more option parameters of sentencepiece library.

Returns

Shared pointer to the SentencePieceVocab.

inline std::shared_ptr<Vocab> BuildVocab(const std::vector<std::string> &columns = {}, const std::pair<int64_t, int64_t> &freq_range = {0, kDeMaxFreq}, int64_t top_k = kDeMaxTopk, const std::vector<std::string> &special_tokens = {}, bool special_first = true)

Function to create a Vocab from source dataset.

Note

Build a vocab from a dataset. This would collect all the unique words in a dataset and return a vocab which contains top_k most frequent words (if top_k is specified).

Parameters
  • columns[in] Column names to get words from. It can be a vector of column names.

  • freq_range[in] A tuple of integers (min_frequency, max_frequency). Words within the frequency range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency/max_frequency can be set to default, which corresponds to 0/total_words separately.

  • top_k[in] Number of words to be built into vocab. top_k most frequent words are taken. The top_k is taken after freq_range. If not enough top_k, all words will be taken.

  • special_tokens[in] A list of strings, each one is a special token.

  • special_first[in] Whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to default, special_tokens will be prepended.

Returns

Shared pointer to the Vocab.

inline std::shared_ptr<ConcatDataset> Concat(const std::vector<std::shared_ptr<Dataset>> &datasets)

Function to create a ConcatDataset.

Note

Concat the datasets in the input.

Parameters

datasets[in] List of shared pointers to the dataset that should be concatenated together.

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<FilterDataset> Filter(std::function<MSTensorVec(MSTensorVec)> predicate, const std::vector<std::string> &input_columns = {})

Function to filter dataset by predicate.

Note

If input_columns is not provided or empty, all columns will be used.

Parameters
  • predicate[in] Function callable which returns a boolean value. If false then filter the element.

  • input_columns[in] List of names of the input columns to filter.

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<MapDataset> Map(std::vector<TensorTransform*> operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::vector<std::string> &project_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, std::vector<std::shared_ptr<DSCallback>> callbacks = {})

Function to create a MapDataset.

Note

Applies each operation in operations to this dataset.

Parameters
  • operations[in] Vector of raw pointers to TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns[in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column.

  • output_columns[in] Vector of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced.

  • project_columns[in] A list of column names to project.

  • cache[in] Tensor cache to use (default=nullptr which means no cache is used).

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<MapDataset> Map(std::vector<std::shared_ptr<TensorTransform>> operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::vector<std::string> &project_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, std::vector<std::shared_ptr<DSCallback>> callbacks = {})

Function to create a MapDataset.

Note

Applies each operation in operations to this dataset.

Parameters
  • operations[in] Vector of shared pointers to TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns[in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column.

  • output_columns[in] Vector of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced.

  • project_columns[in] A list of column names to project.

  • cache[in] Tensor cache to use (default=nullptr which means no cache is used).

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<MapDataset> Map(const std::vector<std::reference_wrapper<TensorTransform>> operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::vector<std::string> &project_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, std::vector<std::shared_ptr<DSCallback>> callbacks = {})

Function to create a MapDataset.

Note

Applies each operation in operations to this dataset.

Parameters
  • operations[in] Vector of TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns[in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column.

  • output_columns[in] Vector of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced.

  • project_columns[in] A list of column names to project.

  • cache[in] Tensor cache to use (default=nullptr which means no cache is used).

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<ProjectDataset> Project(const std::vector<std::string> &columns)

Function to create a Project Dataset.

Note

Applies project to the dataset.

Parameters

columns[in] The name of columns to project.

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<RenameDataset> Rename(const std::vector<std::string> &input_columns, const std::vector<std::string> &output_columns)

Function to create a Rename Dataset.

Note

Renames the columns in the input dataset.

Parameters
  • input_columns[in] List of the input columns to rename.

  • output_columns[in] List of the output columns.

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<RepeatDataset> Repeat(int32_t count = -1)

Function to create a RepeatDataset.

Note

Repeats this dataset count times. Repeat indefinitely if count is -1.

Note

Repeat will return shared pointer to Dataset instead of RepeatDataset due to a limitation in the current implementation.

Parameters

count[in] Number of times the dataset should be repeated.

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<ShuffleDataset> Shuffle(int32_t buffer_size)

Function to create a Shuffle Dataset.

Note

Randomly shuffles the rows of this dataset.

Parameters

buffer_size[in] The size of the buffer (must be larger than 1) for shuffling

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<SkipDataset> Skip(int32_t count)

Function to create a SkipDataset.

Note

Skips count elements in this dataset.

Parameters

count[in] Number of elements the dataset to be skipped.

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<TakeDataset> Take(int32_t count = -1)

Function to create a TakeDataset.

Note

Takes count elements in this dataset.

Parameters

count[in] Number of elements the dataset to be taken.

Returns

Shared pointer to the current Dataset.

inline std::shared_ptr<ZipDataset> Zip(const std::vector<std::shared_ptr<Dataset>> &datasets)

Function to create a Zip Dataset.

Note

Applies zip to the dataset.

Parameters

datasets[in] A list of shared pointers to the datasets that we want to zip.

Returns

Shared pointer to the current Dataset.