mindspore.dataset.TextFileDataset

class mindspore.dataset.TextFileDataset(dataset_files, num_samples=None, num_parallel_workers=None, shuffle=Shuffle.GLOBAL, num_shards=None, shard_id=None, cache=None)[source]

A source dataset that reads and parses datasets stored on disk in text format. The generated dataset has one column [text] with type string.

Parameters:

dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.
num_samples (int, optional) – The number of samples to be included in the dataset. Default: None , will include all samples.
num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None , will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers() .
shuffle (Union[bool, Shuffle], optional) –
Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: Shuffle.GLOBAL . If shuffle is False , no shuffling will be performed. If shuffle is True , performs global shuffle. There are two levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle .
- Shuffle.GLOBAL : Shuffle both the files and samples, same as setting shuffle to True.
- Shuffle.FILES : Shuffle files only.
num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None . When this argument is specified, num_samples reflects the maximum sample number per shard. Used in data parallel training .
shard_id (int, optional) – The shard ID within num_shards . Default: None . This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None , which means no cache is used.

Raises:

ValueError – If dataset_files are not valid or do not exist.
ValueError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If shard_id is not in range of [0, num_shards ).

Tutorial Examples:

Load & Process Data With Dataset Pipeline

Examples

>>> import mindspore.dataset as ds
>>> text_file_list = ["/path/to/text_file_dataset_file"] # contains 1 or multiple text files
>>> dataset = ds.TextFileDataset(dataset_files=text_file_list)

Pre-processing Operation

`mindspore.dataset.Dataset.apply`	Apply a function in this dataset.
`mindspore.dataset.Dataset.concat`	Concatenate the dataset objects in the input list.
`mindspore.dataset.Dataset.filter`	Filter dataset by predicate.
`mindspore.dataset.Dataset.flat_map`	Map func to each row in dataset and flatten the result.
`mindspore.dataset.Dataset.map`	Apply each operation in operations to this dataset.
`mindspore.dataset.Dataset.project`	The specified columns will be selected from the dataset and passed into the pipeline with the order specified.
`mindspore.dataset.Dataset.rename`	Rename the columns in input datasets.
`mindspore.dataset.Dataset.repeat`	Repeat this dataset count times.
`mindspore.dataset.Dataset.reset`	Reset the dataset for next epoch.
`mindspore.dataset.Dataset.save`	Save the dynamic data processed by the dataset pipeline in common dataset format.
`mindspore.dataset.Dataset.shuffle`	Shuffle the dataset by creating a cache with the size of buffer_size .
`mindspore.dataset.Dataset.skip`	Skip the first N elements of this dataset.
`mindspore.dataset.Dataset.split`	Split the dataset into smaller, non-overlapping datasets.
`mindspore.dataset.Dataset.take`	Take the first specified number of samples from the dataset.
`mindspore.dataset.Dataset.zip`	Zip the datasets in the sense of input tuple of datasets.

Batch

`mindspore.dataset.Dataset.batch`	Combine batch_size number of consecutive rows into batch which apply per_batch_map to the samples first.
`mindspore.dataset.Dataset.bucket_batch_by_length`	Bucket elements according to their lengths.
`mindspore.dataset.Dataset.padded_batch`	Combine batch_size number of consecutive rows into batches which apply pad_info to the samples first.

Iterator

`mindspore.dataset.Dataset.create_dict_iterator`	Create an iterator over the dataset that yields samples of type dict, while the key is the column name and the value is the data.
`mindspore.dataset.Dataset.create_tuple_iterator`	Create an iterator over the dataset that yields samples of type list, whose elements are the data for each column.

Attribute

`mindspore.dataset.Dataset.get_batch_size`	Return the size of batch.
`mindspore.dataset.Dataset.get_class_indexing`	Get the mapping dictionary from category names to category indexes.
`mindspore.dataset.Dataset.get_col_names`	Return the names of the columns in dataset.
`mindspore.dataset.Dataset.get_dataset_size`	Return the number of batches in an epoch.
`mindspore.dataset.Dataset.get_repeat_count`	Get the replication times in RepeatDataset.
`mindspore.dataset.Dataset.input_indexs`	Get the column index, which represents the corresponding relationship between the data column order and the network when using the sink mode.
`mindspore.dataset.Dataset.num_classes`	Get the number of classes in a dataset.
`mindspore.dataset.Dataset.output_shapes`	Get the shapes of output data.
`mindspore.dataset.Dataset.output_types`	Get the types of output data.

Apply Sampler

`mindspore.dataset.MappableDataset.add_sampler`	Add a child sampler for the current dataset.
`mindspore.dataset.MappableDataset.use_sampler`	Replace the last child sampler of the current dataset, leaving the parent sampler unchanged.

Others

`mindspore.dataset.Dataset.recv`	The dataset communication interface receives data sent by the source Dataset using `mindspore.dataset.Dataset.send` .
`mindspore.dataset.Dataset.send`	The dataset communication interface sends data to the target Dataset, which can be received through `mindspore.dataset.Dataset.recv`.
`mindspore.dataset.Dataset.sync_update`	Release a blocking condition and trigger callback with given data.
`mindspore.dataset.Dataset.sync_wait`	Add a blocking condition to the input Dataset and a synchronize action will be applied.
`mindspore.dataset.Dataset.to_json`	Serialize a pipeline into JSON string and dump into file if filename is provided.