Class Dataset
Defined in File datasets.h
Inheritance Relationships
Base Type
public std::enable_shared_from_this< Dataset >
Derived Types
public mindspore::dataset::AlbumDataset
(Class AlbumDataset)public mindspore::dataset::BatchDataset
(Class BatchDataset)public mindspore::dataset::BucketBatchByLengthDataset
(Class BucketBatchByLengthDataset)public mindspore::dataset::CLUEDataset
(Class CLUEDataset)public mindspore::dataset::CSVDataset
(Class CSVDataset)public mindspore::dataset::CelebADataset
(Class CelebADataset)public mindspore::dataset::Cifar100Dataset
(Class Cifar100Dataset)public mindspore::dataset::Cifar10Dataset
(Class Cifar10Dataset)public mindspore::dataset::CocoDataset
(Class CocoDataset)public mindspore::dataset::ConcatDataset
(Class ConcatDataset)public mindspore::dataset::FilterDataset
(Class FilterDataset)public mindspore::dataset::ImageFolderDataset
(Class ImageFolderDataset)public mindspore::dataset::ManifestDataset
(Class ManifestDataset)public mindspore::dataset::MapDataset
(Class MapDataset)public mindspore::dataset::MindDataDataset
(Class MindDataDataset)public mindspore::dataset::MnistDataset
(Class MnistDataset)public mindspore::dataset::ProjectDataset
(Class ProjectDataset)public mindspore::dataset::RandomDataDataset
(Class RandomDataDataset)public mindspore::dataset::RenameDataset
(Class RenameDataset)public mindspore::dataset::RepeatDataset
(Class RepeatDataset)public mindspore::dataset::ShuffleDataset
(Class ShuffleDataset)public mindspore::dataset::SkipDataset
(Class SkipDataset)public mindspore::dataset::TFRecordDataset
(Class TFRecordDataset)public mindspore::dataset::TakeDataset
(Class TakeDataset)public mindspore::dataset::TextFileDataset
(Class TextFileDataset)public mindspore::dataset::VOCDataset
(Class VOCDataset)public mindspore::dataset::ZipDataset
(Class ZipDataset)
Class Documentation
-
class Dataset : public std::enable_shared_from_this<Dataset>
A base class to represent a dataset in the data pipeline.
Subclassed by mindspore::dataset::AlbumDataset, mindspore::dataset::BatchDataset, mindspore::dataset::BucketBatchByLengthDataset, mindspore::dataset::CLUEDataset, mindspore::dataset::CSVDataset, mindspore::dataset::CelebADataset, mindspore::dataset::Cifar100Dataset, mindspore::dataset::Cifar10Dataset, mindspore::dataset::CocoDataset, mindspore::dataset::ConcatDataset, mindspore::dataset::FilterDataset, mindspore::dataset::ImageFolderDataset, mindspore::dataset::ManifestDataset, mindspore::dataset::MapDataset, mindspore::dataset::MindDataDataset, mindspore::dataset::MnistDataset, mindspore::dataset::ProjectDataset, mindspore::dataset::RandomDataDataset, mindspore::dataset::RenameDataset, mindspore::dataset::RepeatDataset, mindspore::dataset::ShuffleDataset, mindspore::dataset::SkipDataset, mindspore::dataset::TFRecordDataset, mindspore::dataset::TakeDataset, mindspore::dataset::TextFileDataset, mindspore::dataset::VOCDataset, mindspore::dataset::ZipDataset
Public Functions
-
Dataset()
Constructor.
-
~Dataset() = default
Destructor.
-
int64_t GetDatasetSize(bool estimate = false)
Get the dataset size.
- Parameters
estimate – [in] This is only supported by some of the ops and it’s used to speed up the process of getting dataset size at the expense of accuracy.
- Returns
Dataset size. If failed, return -1.
-
std::vector<mindspore::DataType> GetOutputTypes()
Get the output type.
- Returns
A vector contains output DataType of dataset. If failed, return an empty vector.
-
std::vector<std::vector<int64_t>> GetOutputShapes()
Get the output shape.
- Returns
A vector contains output TensorShape of dataset. If failed, return an empty vector.
-
int64_t GetBatchSize()
Get the batch size.
- Returns
Batch size configuration of dataset.
-
int64_t GetRepeatCount()
Get the repeat count.
- Returns
Repeat count configuration of dataset.
-
int64_t GetNumClasses()
Get the number of classes.
- Returns
Number of classes of dataset. If failed, return -1.
-
inline std::vector<std::string> GetColumnNames()
Get the column names.
- Returns
A vector contains all column names of dataset. If failed, return an empty vector.
-
inline std::vector<std::pair<std::string, std::vector<int32_t>>> GetClassIndexing()
Get the class indexing.
- Returns
A map of ClassIndexing of dataset. If failed, return an empty map.
-
std::shared_ptr<Dataset> SetNumWorkers(int32_t num_workers)
Function to set runtime number of workers.
- Parameters
num_workers – [in] The number of threads in this operator.
- Returns
Shared pointer to the original object.
-
std::shared_ptr<PullIterator> CreatePullBasedIterator(std::vector<std::vector<char>> columns = {})
A Function to create an PullBasedIterator over the Dataset.
- Parameters
columns – [in] List of columns to be used to specify the order of columns.
- Returns
Shared pointer to the Iterator.
-
inline std::shared_ptr<Iterator> CreateIterator(std::vector<std::string> columns = {}, int32_t num_epochs = -1)
Function to create an Iterator over the Dataset pipeline.
- Parameters
columns – [in] List of columns to be used to specify the order of columns.
num_epochs – [in] Number of epochs to run through the pipeline (default=-1, which means infinite epochs). An empty row is returned at the end of each epoch.
- Returns
Shared pointer to the Iterator.
-
inline bool DeviceQueue(std::string queue_name = "", std::string device_type = "", int32_t device_id = 0, int32_t num_epochs = -1, bool send_epoch_end = true, int32_t total_batches = 0, bool create_data_info_queue = false)
Function to transfer data through a device.
Note
If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.
- Parameters
queue_name – [in] Channel name (default=””, create new unique name).
device_type – [in] Type of device (default=””, get from MSContext).
device_id – [in] id of device (default=1, get from MSContext).
num_epochs – [in] Number of epochs (default=-1, infinite epochs).
send_epoch_end – [in] Whether to send end of sequence to device or not (default=true).
total_batches – [in] Number of batches to be sent to the device (default=0, all data).
create_data_info_queue – [in] Whether to create queue which stores types and shapes of data or not (default=false).
- Returns
Returns true if no error encountered else false.
-
inline bool Save(std::string dataset_path, int32_t num_files = 1, std::string dataset_type = "mindrecord")
Function to create a Saver to save the dynamic data processed by the dataset pipeline.
Note
Usage restrictions:
Supported dataset formats: ‘mindrecord’ only.
To save the samples in order, set dataset’s shuffle to false and num_files to 1.
Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.
Mindrecord does not support bool, uint64, multi-dimensional uint8(drop dimension) nor multi-dimensional string.
- Parameters
file_name – [in] Path to dataset file.
num_files – [in] Number of dataset files (default=1).
file_type – [in] Dataset format (default=”mindrecord”).
- Returns
Returns true if no error encountered else false.
-
std::shared_ptr<BatchDataset> Batch(int32_t batch_size, bool drop_remainder = false)
Function to create a BatchDataset.
Note
Combines batch_size number of consecutive rows into batches.
- Parameters
batch_size – [in] The number of rows each batch is created with.
drop_remainder – [in] Determines whether or not to drop the last possibly incomplete batch. If true, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the next node.
- Returns
Shared pointer to the current Dataset.
-
inline std::shared_ptr<BucketBatchByLengthDataset> BucketBatchByLength(const std::vector<std::string> &column_names, const std::vector<int32_t> &bucket_boundaries, const std::vector<int32_t> &bucket_batch_sizes, std::function<MSTensorVec(MSTensorVec)> element_length_function = nullptr, const std::map<std::string, std::pair<std::vector<int64_t>, MSTensor>> &pad_info = {}, bool pad_to_bucket_boundary = false, bool drop_remainder = false)
Function to create a BucketBatchByLengthDataset.
Note
Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.
- Parameters
column_names – [in] Columns passed to element_length_function.
bucket_boundaries – [in] A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).
bucket_batch_sizes – [in] A list consisting of the batch sizes for each bucket. Must contain elements equal to the size of bucket_boundaries + 1.
element_length_function – [in] A function pointer that takes in MSTensorVec and outputs a MSTensorVec. The output must contain a single tensor containing a single int32_t. If no value is provided, then size of column_names must be 1, and the size of the first dimension of that column will be taken as the length (default=nullptr).
pad_info – [in] Represents how to batch each column. The key corresponds to the column name, the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any unspecified dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is true. If no padding is wanted, set pad_info to None (default=empty dictionary).
pad_to_bucket_boundary – [in] If true, will pad each unspecified dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=false).
drop_remainder – [in] If true, will drop the last batch for each bucket if it is not a full batch (default=false).
- Returns
Shared pointer to the current Dataset.
-
inline std::shared_ptr<SentencePieceVocab> BuildSentencePieceVocab(const std::vector<std::string> &col_names, int32_t vocab_size, float character_coverage, SentencePieceModel model_type, const std::unordered_map<std::string, std::string> ¶ms)
Function to create a SentencePieceVocab from source dataset.
Note
Build a SentencePieceVocab from a dataset.
- Parameters
col_names – [in] Column names to get words from. It can be a vector of column names.
vocab_size – [in] Vocabulary size.
character_coverage – [in] Percentage of characters covered by the model, must be between 0.98 and 1.0 Good defaults are: 0.9995 for languages with rich character sets like Japanese or Chinese character sets, and 1.0 for other languages with small character sets.
model_type – [in] Model type. Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.
params – [in] A vector contains more option parameters of sentencepiece library.
- Returns
Shared pointer to the SentencePieceVocab.
-
inline std::shared_ptr<Vocab> BuildVocab(const std::vector<std::string> &columns = {}, const std::pair<int64_t, int64_t> &freq_range = {0, kDeMaxFreq}, int64_t top_k = kDeMaxTopk, const std::vector<std::string> &special_tokens = {}, bool special_first = true)
Function to create a Vocab from source dataset.
Note
Build a vocab from a dataset. This would collect all the unique words in a dataset and return a vocab which contains top_k most frequent words (if top_k is specified).
- Parameters
columns – [in] Column names to get words from. It can be a vector of column names.
freq_range – [in] A tuple of integers (min_frequency, max_frequency). Words within the frequency range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency/max_frequency can be set to default, which corresponds to 0/total_words separately.
top_k – [in] Number of words to be built into vocab. top_k most frequent words are taken. The top_k is taken after freq_range. If not enough top_k, all words will be taken.
special_tokens – [in] A list of strings, each one is a special token.
special_first – [in] Whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to default, special_tokens will be prepended.
- Returns
Shared pointer to the Vocab.
Function to create a ConcatDataset.
Note
Concat the datasets in the input.
- Parameters
datasets – [in] List of shared pointers to the dataset that should be concatenated together.
- Returns
Shared pointer to the current Dataset.
-
inline std::shared_ptr<FilterDataset> Filter(std::function<MSTensorVec(MSTensorVec)> predicate, const std::vector<std::string> &input_columns = {})
Function to filter dataset by predicate.
Note
If input_columns is not provided or empty, all columns will be used.
- Parameters
predicate – [in] Function callable which returns a boolean value. If false then filter the element.
input_columns – [in] List of names of the input columns to filter.
- Returns
Shared pointer to the current Dataset.
Function to create a MapDataset.
Note
Applies each operation in operations to this dataset.
- Parameters
operations – [in] Vector of raw pointers to TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list.
input_columns – [in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column.
output_columns – [in] Vector of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced.
project_columns – [in] A list of column names to project.
cache – [in] Tensor cache to use (default=nullptr which means no cache is used).
- Returns
Shared pointer to the current Dataset.
Function to create a MapDataset.
Note
Applies each operation in operations to this dataset.
- Parameters
operations – [in] Vector of shared pointers to TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list.
input_columns – [in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column.
output_columns – [in] Vector of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced.
project_columns – [in] A list of column names to project.
cache – [in] Tensor cache to use (default=nullptr which means no cache is used).
- Returns
Shared pointer to the current Dataset.
Function to create a MapDataset.
Note
Applies each operation in operations to this dataset.
- Parameters
operations – [in] Vector of TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list.
input_columns – [in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column.
output_columns – [in] Vector of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced.
project_columns – [in] A list of column names to project.
cache – [in] Tensor cache to use (default=nullptr which means no cache is used).
- Returns
Shared pointer to the current Dataset.
-
inline std::shared_ptr<ProjectDataset> Project(const std::vector<std::string> &columns)
Function to create a Project Dataset.
Note
Applies project to the dataset.
- Parameters
columns – [in] The name of columns to project.
- Returns
Shared pointer to the current Dataset.
-
inline std::shared_ptr<RenameDataset> Rename(const std::vector<std::string> &input_columns, const std::vector<std::string> &output_columns)
Function to create a Rename Dataset.
Note
Renames the columns in the input dataset.
- Parameters
input_columns – [in] List of the input columns to rename.
output_columns – [in] List of the output columns.
- Returns
Shared pointer to the current Dataset.
-
inline std::shared_ptr<RepeatDataset> Repeat(int32_t count = -1)
Function to create a RepeatDataset.
Note
Repeats this dataset count times. Repeat indefinitely if count is -1.
Note
Repeat will return shared pointer to
Dataset
instead ofRepeatDataset
due to a limitation in the current implementation.- Parameters
count – [in] Number of times the dataset should be repeated.
- Returns
Shared pointer to the current Dataset.
-
inline std::shared_ptr<ShuffleDataset> Shuffle(int32_t buffer_size)
Function to create a Shuffle Dataset.
Note
Randomly shuffles the rows of this dataset.
- Parameters
buffer_size – [in] The size of the buffer (must be larger than 1) for shuffling
- Returns
Shared pointer to the current Dataset.
-
inline std::shared_ptr<SkipDataset> Skip(int32_t count)
Function to create a SkipDataset.
Note
Skips count elements in this dataset.
- Parameters
count – [in] Number of elements the dataset to be skipped.
- Returns
Shared pointer to the current Dataset.
-
inline std::shared_ptr<TakeDataset> Take(int32_t count = -1)
Function to create a TakeDataset.
Note
Takes count elements in this dataset.
- Parameters
count – [in] Number of elements the dataset to be taken.
- Returns
Shared pointer to the current Dataset.
-
Dataset()