mindflow.data.MindDataset

View Source On Gitee
class mindflow.data.MindDataset(dataset_files, dataset_name='dataset', constraint_type='Label', shuffle=True, num_shards=None, shard_id=None, sampler=None, num_samples=None, num_parallel_workers=None)[source]

Create dataset from MindRecord-type data.

Parameters
  • dataset_files (Union[str, list[str]]) – If dataset_file is a str, it represents for a file name of one component of a mindrecord source, other files with identical source in the same path will be found and loaded automatically. If dataset_file is a list, it represents for a list of dataset files to be read directly.

  • dataset_name (str, optional) – name of dataset. Default: "dataset_name".

  • constraint_type (str, optional) – constraint type of the specified dataset to get it’s corresponding loss function. Default: "Label". Other supported types can be found in mindflow.data.Dataset.

  • shuffle (Union[bool, Shuffle level], optional) –

    Perform reshuffling of the data every epoch If shuffle is False, no shuffling will be performed. If shuffle is True, performs global shuffle. Default: True. Otherwise, there are two levels of shuffling:

    • Shuffle.GLOBAL: Shuffle both the files and sample.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, ‘num_samples’ reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards. Default: None. This argument can only be specified when num_shards is also specified.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, sampler is exclusive with shuffle and block_reader. Support list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, all samples.

  • num_parallel_workers (int, optional) – The number of readers. Default: None.

Raises
  • ValueError – If dataset_files are not valid or do not exist.

  • TypeError – If dataset_name is not string.

  • ValueError – If constraint_type.lower() not in ["equation", "bc", "ic", "label", "function", "custom"].

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards).

Supported Platforms:

Ascend GPU

Examples

>>> from mindflow.data import MindDataset
>>> dataset_files = ["./data_dir"] # contains 1 or multiple MindRecord files
>>> dataset = MindDataset(dataset_files=dataset_files)
create_dataset(batch_size=1, preprocess_fn=None, updated_columns_list=None, drop_remainder=True, prebatched_data=False, num_parallel_workers=1, python_multiprocessing=False)[source]

create the final mindspore type dataset.

Parameters
  • batch_size (int, optional) – An int number of rows each batch is created with. Default: 1.

  • preprocess_fn (Union[list[TensorOp], list[functions]], optional) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list. Default: None.

  • updated_columns_list (list, optional) – List of columns to be applied on the dataset. Default: None.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last block whose data row number is less than batch size. If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node. Default: True.

  • prebatched_data (bool, optional) – Generate pre-batched data before data preprocessing. Default: False.

  • num_parallel_workers (int, optional) – Number of workers(threads) to process the dataset in parallel. Default: 1.

  • python_multiprocessing (bool, optional) – Parallelize Python function per_batch_map with multi-processing. This option could be beneficial if the function is computational heavy. Default: False.

Returns

BatchDataset, dataset batched.

Examples

>>> data = dataset.create_dataset()
get_columns_list()[source]

get columns list

Returns

list[str]. column names list of the final unified dataset.

Examples

>>> columns_list = dataset.get_columns_list()
set_constraint_type(constraint_type='Equation')[source]

set constraint type of dataset

Parameters

constraint_type (Union[str, dict]) – The constraint type of specified dataset. If is string, the constraint type of all subdataset will be set to the same one. If is dict, the subdataset and it’s constraint type is specified by the pair (key, value).

Examples

>>> dataset.set_constraint_type("Equation")
split_dataset(dataset_dict, constraint_dict=None)[source]

split the original dataset in order to set difference loss functions.

Parameters
  • dataset_dict (dict) – dictionary of each sub-dataset, the key is the labeled name while the value refers to the specified columns contained in the sub-dataset.

  • constraint_dict (Union[None, str, dict]) – The constraint type of specified dataset. If None, “Label” will be set for all. If is string, all will be set to the same one. If is dict, the subdataset and it’s constraint type is specified by the pair (key, value). Default: None.

Examples

>>> dataset.split_dataset({"Equation" : "inner_points", "BC" : "bc_points"})