Class BucketBatchByLengthDataset

Defined in File datasets.h

Inheritance Relationships

Base Type

public mindspore::dataset::Dataset (Class Dataset)

Class Documentation

class BucketBatchByLengthDataset : public mindspore::dataset::Dataset

The result of applying BucketBatchByLength operator to the input dataset.

Public Functions

BucketBatchByLengthDataset(const std::shared_ptr<Dataset> &input, const std::vector<std::vector<char>> &column_names, const std::vector<int32_t> &bucket_boundaries, const std::vector<int32_t> &bucket_batch_sizes, const std::function<MSTensorVec(MSTensorVec)> &element_length_function = nullptr, const std::map<std::vector<char>, std::pair<std::vector<int64_t>, MSTensor>> &pad_info = {}, bool pad_to_bucket_boundary = false, bool drop_remainder = false)

Constructor of BucketBatchByLengthDataset.

Note

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

Parameters

input – [in] The dataset which need to apply bucket batch by length operation.
column_names – [in] Columns passed to element_length_function.
bucket_boundaries – [in] A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).
bucket_batch_sizes – [in] A list consisting of the batch sizes for each bucket. Must contain elements equal to the size of bucket_boundaries + 1.
element_length_function – [in] A function pointer that takes in MSTensorVec and outputs a MSTensorVec. The output must contain a single tensor containing a single int32_t. If no value is provided, then size of column_names must be 1, and the size of the first dimension of that column will be taken as the length (default=nullptr).
pad_info – [in] Represents how to batch each column. The key corresponds to the column name, the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any unspecified dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is true. If no padding is wanted, set pad_info to None (default=empty dictionary).
pad_to_bucket_boundary – [in] If true, will pad each unspecified dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=false).
drop_remainder – [in] If true, will drop the last batch for each bucket if it is not a full batch (default=false).

~BucketBatchByLengthDataset() override = default: Destructor of BucketBatchByLengthDataset.