mindspore.dataset.SogouNewsDataset

class mindspore.dataset.SogouNewsDataset(dataset_dir, usage=None, num_samples=None, shuffle=Shuffle.GLOBAL, num_shards=None, shard_id=None, num_parallel_workers=None, cache=None)[source]

Sogou News dataset.

The generated dataset has three columns: [index, title, content] , and the data type of three columns is string.

Parameters:

dataset_dir (str) – Path to the root directory that contains the dataset.
usage (str, optional) – Usage of this dataset, can be 'train' , 'test' or 'all' . 'train' will read from 450,000 train samples, 'test' will read from 60,000 test samples, 'all' will read from all 510,000 samples. Default: None , all samples.
num_samples (int, optional) – Number of samples (rows) to read. Default: None , read all samples.
shuffle (Union[bool, Shuffle], optional) –
Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: Shuffle.GLOBAL . If shuffle is False , no shuffling will be performed. If shuffle is True , it is equivalent to setting shuffle to Shuffle.GLOBAL . Set the mode of data shuffling by passing in enumeration variables:
- Shuffle.GLOBAL : Shuffle both the files and samples, same as setting shuffle to True.
- Shuffle.FILES : Shuffle files only.
num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None . When this argument is specified, num_samples reflects the max sample number per shard. Used in data parallel training .
shard_id (int, optional) – The shard ID within num_shards . Default: None . This argument can only be specified when num_shards is also specified.
num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None , will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers() .
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None , which means no cache is used.

Raises:

RuntimeError – If dataset_dir does not contain data files.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If num_parallel_workers exceeds the max thread numbers.

Tutorial Examples:

Load & Process Data With Dataset Pipeline

Examples

>>> import mindspore.dataset as ds
>>> sogou_news_dataset_dir = "/path/to/sogou_news_dataset_dir"
>>> dataset = ds.SogouNewsDataset(dataset_dir=sogou_news_dataset_dir, usage='all')

About Sogou News Dataset:

Sogou News dataset includes 3 columns, corresponding to class index (1 to 5), title and content. The title and content are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".

You can unzip the dataset files into the following structure and read by MindSpore's API:

.
└── sogou_news_dir
     ├── classes.txt
     ├── readme.txt
     ├── test.csv
     └── train.csv

Citation:

@misc{zhang2015characterlevel,
    title={Character-level Convolutional Networks for Text Classification},
    author={Xiang Zhang and Junbo Zhao and Yann LeCun},
    year={2015},
    eprint={1509.01626},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Pre-processing Operation

`mindspore.dataset.Dataset.apply`	Apply a function in this dataset.
`mindspore.dataset.Dataset.concat`	Concatenate the dataset objects in the input list.
`mindspore.dataset.Dataset.filter`	Filter dataset by predicate.
`mindspore.dataset.Dataset.flat_map`	Map func to each row in dataset and flatten the result.
`mindspore.dataset.Dataset.map`	Apply each operation in operations to this dataset.
`mindspore.dataset.Dataset.project`	The specified columns will be selected from the dataset and passed into the pipeline with the order specified.
`mindspore.dataset.Dataset.rename`	Rename the columns in input datasets.
`mindspore.dataset.Dataset.repeat`	Repeat this dataset count times.
`mindspore.dataset.Dataset.reset`	Reset the dataset for next epoch.
`mindspore.dataset.Dataset.save`	Save the dynamic data processed by the dataset pipeline in common dataset format.
`mindspore.dataset.Dataset.shuffle`	Shuffle the dataset by creating a cache with the size of buffer_size .
`mindspore.dataset.Dataset.skip`	Skip the first N elements of this dataset.
`mindspore.dataset.Dataset.split`	Split the dataset into smaller, non-overlapping datasets.
`mindspore.dataset.Dataset.take`	Take the first specified number of samples from the dataset.
`mindspore.dataset.Dataset.zip`	Zip the datasets in the sense of input tuple of datasets.

Batch

`mindspore.dataset.Dataset.batch`	Combine batch_size number of consecutive rows into batch which apply per_batch_map to the samples first.
`mindspore.dataset.Dataset.bucket_batch_by_length`	Bucket elements according to their lengths.
`mindspore.dataset.Dataset.padded_batch`	Combine batch_size number of consecutive rows into batches which apply pad_info to the samples first.

Iterator

`mindspore.dataset.Dataset.create_dict_iterator`	Create an iterator over the dataset that yields samples of type dict, while the key is the column name and the value is the data.
`mindspore.dataset.Dataset.create_tuple_iterator`	Create an iterator over the dataset that yields samples of type list, whose elements are the data for each column.

Attribute

`mindspore.dataset.Dataset.get_batch_size`	Return the size of batch.
`mindspore.dataset.Dataset.get_class_indexing`	Get the mapping dictionary from category names to category indexes.
`mindspore.dataset.Dataset.get_col_names`	Return the names of the columns in dataset.
`mindspore.dataset.Dataset.get_dataset_size`	Return the number of batches in an epoch.
`mindspore.dataset.Dataset.get_repeat_count`	Get the replication times in RepeatDataset.
`mindspore.dataset.Dataset.input_indexs`	Get the column index, which represents the corresponding relationship between the data column order and the network when using the sink mode.
`mindspore.dataset.Dataset.num_classes`	Get the number of classes in a dataset.
`mindspore.dataset.Dataset.output_shapes`	Get the shapes of output data.
`mindspore.dataset.Dataset.output_types`	Get the types of output data.

Apply Sampler

`mindspore.dataset.MappableDataset.add_sampler`	Add a child sampler for the current dataset.
`mindspore.dataset.MappableDataset.use_sampler`	Replace the last child sampler of the current dataset, leaving the parent sampler unchanged.

Others

`mindspore.dataset.Dataset.recv`	The dataset communication interface receives data sent by the source Dataset using `mindspore.dataset.Dataset.send` .
`mindspore.dataset.Dataset.send`	The dataset communication interface sends data to the target Dataset, which can be received through `mindspore.dataset.Dataset.recv`.
`mindspore.dataset.Dataset.sync_update`	Release a blocking condition and trigger callback with given data.
`mindspore.dataset.Dataset.sync_wait`	Add a blocking condition to the input Dataset and a synchronize action will be applied.
`mindspore.dataset.Dataset.to_json`	Serialize a pipeline into JSON string and dump into file if filename is provided.