mindspore.dataset.SogouNewsDataset
- class mindspore.dataset.SogouNewsDataset(dataset_dir, usage=None, num_samples=None, shuffle=Shuffle.GLOBAL, num_shards=None, shard_id=None, num_parallel_workers=None, cache=None)[source]
Sogou News dataset.
The generated dataset has three columns:
[index, title, content]
, and the data type of three columns is string.- Parameters
dataset_dir (str) – Path to the root directory that contains the dataset.
usage (str, optional) – Usage of this dataset, can be
'train'
,'test'
or'all'
.'train'
will read from 450,000 train samples,'test'
will read from 60,000 test samples,'all'
will read from all 510,000 samples. Default:None
, all samples.num_samples (int, optional) – Number of samples (rows) to read. Default:
None
, read all samples.shuffle (Union[bool, Shuffle], optional) –
Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default:
Shuffle.GLOBAL
. If shuffle isFalse
, no shuffling will be performed. If shuffle isTrue
, it is equivalent to setting shuffle tomindspore.dataset.Shuffle.GLOBAL
. Set the mode of data shuffling by passing in enumeration variables:Shuffle.GLOBAL
: Shuffle both the files and samples, same as setting shuffle to True.Shuffle.FILES
: Shuffle files only.
num_shards (int, optional) – Number of shards that the dataset will be divided into. Default:
None
. When this argument is specified, num_samples reflects the max sample number of per shard. Used in data parallel training .shard_id (int, optional) – The shard ID within num_shards . Default:
None
. This argument can only be specified when num_shards is also specified.num_parallel_workers (int, optional) – Number of worker threads to read the data. Default:
None
, will use global default workers(8), it can be set bymindspore.dataset.config.set_num_parallel_workers()
.cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default:
None
, which means no cache is used.
- Raises
RuntimeError – If dataset_dir does not contain data files.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If num_parallel_workers exceeds the max thread numbers.
- Tutorial Examples:
Examples
>>> import mindspore.dataset as ds >>> sogou_news_dataset_dir = "/path/to/sogou_news_dataset_dir" >>> dataset = ds.SogouNewsDataset(dataset_dir=sogou_news_dataset_dir, usage='all')
About SogouNews Dataset:
SogouNews dataset includes 3 columns, corresponding to class index (1 to 5), title and content. The title and content are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "n".
You can unzip the dataset files into the following structure and read by MindSpore's API:
. └── sogou_news_dir ├── classes.txt ├── readme.txt ├── test.csv └── train.csv
Citation:
@misc{zhang2015characterlevel, title={Character-level Convolutional Networks for Text Classification}, author={Xiang Zhang and Junbo Zhao and Yann LeCun}, year={2015}, eprint={1509.01626}, archivePrefix={arXiv}, primaryClass={cs.LG} }
Pre-processing Operation
Apply a function in this dataset. |
|
Concatenate the dataset objects in the input list. |
|
Filter dataset by prediction. |
|
Map func to each row in dataset and flatten the result. |
|
Apply each operation in operations to this dataset. |
|
The specified columns will be selected from the dataset and passed into the pipeline with the order specified. |
|
Rename the columns in input datasets. |
|
Repeat this dataset count times. |
|
Reset the dataset for next epoch. |
|
Save the dynamic data processed by the dataset pipeline in common dataset format. |
|
Shuffle the dataset by creating a cache with the size of buffer_size . |
|
Skip the first N elements of this dataset. |
|
Split the dataset into smaller, non-overlapping datasets. |
|
Take the first specified number of samples from the dataset. |
|
Zip the datasets in the sense of input tuple of datasets. |
Batch
Combine batch_size number of consecutive rows into batch which apply per_batch_map to the samples first. |
|
Bucket elements according to their lengths. |
|
Combine batch_size number of consecutive rows into batch which apply pad_info to the samples first. |
Iterator
Create an iterator over the dataset that yields samples of type dict, while the key is the column name and the value is the data. |
|
Create an iterator over the dataset that yields samples of type list, whose elements are the data for each column. |
Attribute
Return the size of batch. |
|
Get the mapping dictionary from category names to category indexes. |
|
Return the names of the columns in dataset. |
|
Return the number of batches in an epoch. |
|
Get the replication times in RepeatDataset. |
|
Get the column index, which represents the corresponding relationship between the data column order and the network when using the sink mode. |
|
Get the number of classes in a dataset. |
|
Get the shapes of output data. |
|
Get the types of output data. |
Apply Sampler
Add a child sampler for the current dataset. |
|
Replace the last child sampler of the current dataset, remaining the parent sampler unchanged. |
Others
Release a blocking condition and trigger callback with given data. |
|
Add a blocking condition to the input Dataset and a synchronize action will be applied. |
|
Serialize a pipeline into JSON string and dump into file if filename is provided. |