Class BucketBatchByLengthDataset

Inheritance Relationships

Base Type

Class Documentation

class BucketBatchByLengthDataset : public mindspore::dataset::Dataset

The result of applying BucketBatchByLength operator to the input dataset.

Public Functions

BucketBatchByLengthDataset(std::shared_ptr<Dataset> input, const std::vector<std::vector<char>> &column_names, const std::vector<int32_t> &bucket_boundaries, const std::vector<int32_t> &bucket_batch_sizes, std::function<MSTensorVec(MSTensorVec)> element_length_function = nullptr, const std::map<std::vector<char>, std::pair<std::vector<int64_t>, MSTensor>> &pad_info = {}, bool pad_to_bucket_boundary = false, bool drop_remainder = false)

Constructor of BucketBatchByLengthDataset.

Note

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

Parameters
  • input[in] The dataset which need to apply bucket batch by length operation.

  • column_names[in] Columns passed to element_length_function.

  • bucket_boundaries[in] A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes[in] A list consisting of the batch sizes for each bucket. Must contain elements equal to the size of bucket_boundaries + 1.

  • element_length_function[in] A function pointer that takes in MSTensorVec and outputs a MSTensorVec. The output must contain a single tensor containing a single int32_t. If no value is provided, then size of column_names must be 1, and the size of the first dimension of that column will be taken as the length (default=nullptr).

  • pad_info[in] Represents how to batch each column. The key corresponds to the column name, the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any unspecified dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is true. If no padding is wanted, set pad_info to None (default=empty dictionary).

  • pad_to_bucket_boundary[in] If true, will pad each unspecified dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=false).

  • drop_remainder[in] If true, will drop the last batch for each bucket if it is not a full batch (default=false).

~BucketBatchByLengthDataset() = default

Destructor of BucketBatchByLengthDataset.