Class Dataset
Defined in File datasets.h
Inheritance Relationships
Base Type
public std::enable_shared_from_this< Dataset >
Derived Types
public mindspore::dataset::AlbumDataset
(Class AlbumDataset)public mindspore::dataset::BatchDataset
(Class BatchDataset)public mindspore::dataset::MapDataset
(Class MapDataset)public mindspore::dataset::MnistDataset
(Class MnistDataset)public mindspore::dataset::ProjectDataset
(Class ProjectDataset)public mindspore::dataset::ShuffleDataset
(Class ShuffleDataset)
Class Documentation
-
class Dataset : public std::enable_shared_from_this<Dataset>
A base class to represent a dataset in the data pipeline.
Subclassed by mindspore::dataset::AlbumDataset, mindspore::dataset::BatchDataset, mindspore::dataset::MapDataset, mindspore::dataset::MnistDataset, mindspore::dataset::ProjectDataset, mindspore::dataset::ShuffleDataset
Public Functions
-
Dataset()
Constructor.
-
virtual ~Dataset() = default
Destructor.
-
int64_t GetDatasetSize(bool estimate = false)
Gets the dataset size.
- Parameters
estimate – [in] This is only supported by some of the ops and it’s used to speed up the process of getting dataset size at the expense of accuracy.
- Returns
dataset size. If failed, return -1
-
std::vector<mindspore::DataType> GetOutputTypes()
Gets the output type.
- Returns
a vector of DataType. If failed, return an empty vector
-
std::vector<std::vector<int64_t>> GetOutputShapes()
Gets the output shape.
- Returns
a vector of TensorShape. If failed, return an empty vector
-
int64_t GetBatchSize()
Gets the batch size.
- Returns
int64_t
-
int64_t GetRepeatCount()
Gets the repeat count.
- Returns
int64_t
-
int64_t GetNumClasses()
Gets the number of classes.
- Returns
number of classes. If failed, return -1
-
inline std::vector<std::string> GetColumnNames()
Gets the column names.
- Returns
Names of the columns. If failed, return an empty vector
-
inline std::vector<std::pair<std::string, std::vector<int32_t>>> GetClassIndexing()
Gets the class indexing.
- Returns
a map of ClassIndexing. If failed, return an empty map
-
std::shared_ptr<Dataset> SetNumWorkers(int32_t num_workers)
Setter function for runtime number of workers.
- Parameters
num_workers – [in] The number of threads in this operator
- Returns
Shared pointer to the original object
Example/* Set number of workers(threads) to process the dataset in parallel */ std::shared_ptr<Dataset> ds = ImageFolder(folder_path, true); ds = ds->SetNumWorkers(16);
-
std::shared_ptr<PullIterator> CreatePullBasedIterator(const std::vector<std::vector<char>> &columns = {})
Function to create an PullBasedIterator over the Dataset.
- Parameters
columns – [in] List of columns to be used to specify the order of columns
- Returns
Shared pointer to the Iterator
Example/* dataset is an instance of Dataset object */ std::shared_ptr<Iterator> = dataset->CreatePullBasedIterator(); std::unordered_map<std::string, mindspore::MSTensor> row; iter->GetNextRow(&row);
-
inline std::shared_ptr<Iterator> CreateIterator(const std::vector<std::string> &columns = {}, int32_t num_epochs = -1)
Function to create an Iterator over the Dataset pipeline.
- Parameters
columns – [in] List of columns to be used to specify the order of columns
num_epochs – [in] Number of epochs to run through the pipeline, default -1 which means infinite epochs. An empty row is returned at the end of each epoch
- Returns
Shared pointer to the Iterator
Example/* dataset is an instance of Dataset object */ std::shared_ptr<Iterator> = dataset->CreateIterator(); std::unordered_map<std::string, mindspore::MSTensor> row; iter->GetNextRow(&row);
-
inline bool DeviceQueue(const std::string &queue_name = "", const std::string &device_type = "", int32_t device_id = 0, int32_t num_epochs = -1, bool send_epoch_end = true, int32_t total_batches = 0, bool create_data_info_queue = false)
Function to transfer data through a device.
Note
If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.
- Parameters
queue_name – [in] Channel name (default=””, create new unique name).
device_type – [in] Type of device (default=””, get from MSContext).
device_id – [in] id of device (default=1, get from MSContext).
num_epochs – [in] Number of epochs (default=-1, infinite epochs).
send_epoch_end – [in] Whether to send end of sequence to device or not (default=true).
total_batches – [in] Number of batches to be sent to the device (default=0, all data).
create_data_info_queue – [in] Whether to create queue which stores types and shapes of data or not(default=false).
- Returns
Returns true if no error encountered else false.
-
inline bool Save(const std::string &dataset_path, int32_t num_files = 1, const std::string &dataset_type = "mindrecord")
Function to create a Saver to save the dynamic data processed by the dataset pipeline.
Note
Usage restrictions:
Supported dataset formats: ‘mindrecord’ only
To save the samples in order, set dataset’s shuffle to false and num_files to 1.
Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.
Mindrecord does not support bool, uint64, multi-dimensional uint8(drop dimension) nor multi-dimensional string.
- Parameters
dataset_path – [in] Path to dataset file
num_files – [in] Number of dataset files (default=1)
dataset_type – [in] Dataset format (default=”mindrecord”)
- Returns
Returns true if no error encountered else false
Example/* Create a dataset and save its data into MindRecord */ std::string folder_path = "/path/to/cifar_dataset"; std::shared_ptr<Dataset> ds = Cifar10(folder_path, "all", std::make_shared<SequentialSampler>(0, 10)); std::string save_file = "Cifar10Data.mindrecord"; bool rc = ds->Save(save_file);
-
std::shared_ptr<BatchDataset> Batch(int32_t batch_size, bool drop_remainder = false)
Function to create a BatchDataset.
Note
Combines batch_size number of consecutive rows into batches
- Parameters
batch_size – [in] The number of rows each batch is created with
drop_remainder – [in] Determines whether or not to drop the last possibly incomplete batch. If true, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the next node
- Returns
Shared pointer to the current BatchDataset
Example/* Create a dataset where every 100 rows is combined into a batch */ std::shared_ptr<Dataset> ds = ImageFolder(folder_path, true); ds = ds->Batch(100, true);
-
inline std::shared_ptr<MapDataset> Map(const std::vector<TensorTransform*> &operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::vector<std::string> &project_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, const std::vector<std::shared_ptr<DSCallback>> &callbacks = {})
Function to create a MapDataset.
Note
Applies each operation in operations to this dataset
- Parameters
operations – [in] Vector of raw pointers to TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list
input_columns – [in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column
output_columns – [in] Vector of names assigned to the columns outputted by the last operation This parameter is mandatory if len(input_columns) != len(output_columns) The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced
project_columns – [in] A list of column names to project
cache – [in] Tensor cache to use. (default=nullptr which means no cache is used).
callbacks – [in] List of Dataset callbacks to be called.
- Returns
Shared pointer to the current MapDataset
Example// Create objects for the tensor ops std::shared_ptr<TensorTransform> decode_op = std::make_shared<vision::Decode>(true); std::shared_ptr<TensorTransform> random_color_op = std::make_shared<vision::RandomColor>(0.0, 0.0); /* 1) Simple map example */ // Apply decode_op on column "image". This column will be replaced by the outputted // column of decode_op. Since column_order is not provided, both columns "image" // and "label" will be propagated to the child node in their original order. dataset = dataset->Map({decode_op}, {"image"}); // Decode and rename column "image" to "decoded_image". dataset = dataset->Map({decode_op}, {"image"}, {"decoded_image"}); // Specify the order of the output columns. dataset = dataset->Map({decode_op}, {"image"}, {}, {"label", "image"}); // Rename column "image" to "decoded_image" and also specify the order of the output columns. dataset = dataset->Map({decode_op}, {"image"}, {"decoded_image"}, {"label", "decoded_image"}); // Rename column "image" to "decoded_image" and keep only this column. dataset = dataset->Map({decode_op}, {"image"}, {"decoded_image"}, {"decoded_image"}); /* 2) Map example with more than one operation */ // Create a dataset where the images are decoded, then randomly color jittered. // decode_op takes column "image" as input and outputs one column. The column // outputted by decode_op is passed as input to random_jitter_op. // random_jitter_op will output one column. Column "image" will be replaced by // the column outputted by random_jitter_op (the very last operation). All other // columns are unchanged. Since column_order is not specified, the order of the // columns will remain the same. dataset = dataset->Map({decode_op, random_jitter_op}, {"image"})
-
inline std::shared_ptr<MapDataset> Map(const std::vector<std::shared_ptr<TensorTransform>> &operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::vector<std::string> &project_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, const std::vector<std::shared_ptr<DSCallback>> &callbacks = {})
Function to create a MapDataset.
Note
Applies each operation in operations to this dataset
- Parameters
operations – [in] Vector of shared pointers to TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list
input_columns – [in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column
output_columns – [in] Vector of names assigned to the columns outputted by the last operation This parameter is mandatory if len(input_columns) != len(output_columns) The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced
project_columns – [in] A list of column names to project
cache – [in] Tensor cache to use. (default=nullptr which means no cache is used).
callbacks – [in] List of Dataset callbacks to be called.
- Returns
Shared pointer to the current MapDataset
-
inline std::shared_ptr<MapDataset> Map(const std::vector<std::reference_wrapper<TensorTransform>> &operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::vector<std::string> &project_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, const std::vector<std::shared_ptr<DSCallback>> &callbacks = {})
Function to create a MapDataset.
Note
Applies each operation in operations to this dataset
- Parameters
operations – [in] Vector of TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list
input_columns – [in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column
output_columns – [in] Vector of names assigned to the columns outputted by the last operation This parameter is mandatory if len(input_columns) != len(output_columns) The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced
project_columns – [in] A list of column names to project
cache – [in] Tensor cache to use. (default=nullptr which means no cache is used).
callbacks – [in] List of Dataset callbacks to be called.
- Returns
Shared pointer to the current MapDataset
-
inline std::shared_ptr<ProjectDataset> Project(const std::vector<std::string> &columns)
Function to create a Project Dataset.
Note
Applies project to the dataset
- Parameters
columns – [in] The name of columns to project
- Returns
Shared pointer to the current Dataset
Example/* Reorder the original column names in dataset */ std::shared_ptr<Dataset> ds = Mnist(folder_path, "all", std::make_shared<RandomSampler>(false, 10)); ds = ds->Project({"label", "image"});
-
inline std::shared_ptr<ShuffleDataset> Shuffle(int32_t buffer_size)
Function to create a Shuffle Dataset.
Note
Randomly shuffles the rows of this dataset
- Parameters
buffer_size – [in] The size of the buffer (must be larger than 1) for shuffling
- Returns
Shared pointer to the current ShuffleDataset
Example/* Rename the original column names in dataset */ std::shared_ptr<Dataset> ds = Mnist(folder_path, "all", std::make_shared<RandomSampler>(false, 10)); ds = ds->Rename({"image", "label"}, {"image_output", "label_output"});
-
Dataset()