mindspore.dataset.YelpReviewDataset
- class mindspore.dataset.YelpReviewDataset(dataset_dir, usage=None, num_samples=None, shuffle=Shuffle.GLOBAL, num_shards=None, shard_id=None, num_parallel_workers=None, cache=None)[source]
- Yelp Review Polarity and Yelp Review Full datasets. - The generated dataset has two columns: - [label, text], and the data type of two columns is string.- Parameters
- dataset_dir (str) – Path to the root directory that contains the dataset. 
- usage (str, optional) – - Usage of this dataset, can be - 'train',- 'test'or- 'all'. Default:- None, all samples.- For Polarity, - 'train'will read from 560,000 train samples,- 'test'will read from 38,000 test samples,- 'all'will read from all 598,000 samples.
- For Full, - 'train'will read from 650,000 train samples,- 'test'will read from 50,000 test samples,- 'all'will read from all 700,000 samples.
 
- num_samples (int, optional) – Number of samples (rows) to read. Default: - None, reads all samples.
- shuffle (Union[bool, Shuffle], optional) – - Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: - Shuffle.GLOBAL. If shuffle is- False, no shuffling will be performed. If shuffle is- True, it is equivalent to setting shuffle to- mindspore.dataset.Shuffle.GLOBAL. Set the mode of data shuffling by passing in enumeration variables:- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
 
- num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: - None. When this argument is specified, num_samples reflects the max sample number of per shard. Used in data parallel training .
- shard_id (int, optional) – The shard ID within num_shards . Default: - None. This argument can only be specified when num_shards is also specified.
- num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: - None, will use global default workers(8), it can be set by- mindspore.dataset.config.set_num_parallel_workers().
- cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: - None, which means no cache is used.
 
- Raises
- RuntimeError – If dataset_dir does not contain data files. 
- RuntimeError – If num_shards is specified but shard_id is None. 
- RuntimeError – If shard_id is specified but num_shards is None. 
- ValueError – If num_parallel_workers exceeds the max thread numbers. 
 
 - Tutorial Examples:
 - Examples - >>> import mindspore.dataset as ds >>> yelp_review_dataset_dir = "/path/to/yelp_review_dataset_dir" >>> dataset = ds.YelpReviewDataset(dataset_dir=yelp_review_dataset_dir, usage='all') - About YelpReview Dataset: - The Yelp Review Full dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data, and it is mainly used for text classification. - The Yelp Review Polarity dataset is constructed from the above dataset, by considering stars 1 and 2 negative, and 3 and 4 positive. - The directory structures of these two datasets are the same. You can unzip the dataset files into the following structure and read by MindSpore's API: - . └── yelp_review_dir ├── train.csv ├── test.csv └── readme.txt- Citation: - For Yelp Review Polarity: - @article{zhangCharacterlevelConvolutionalNetworks2015, archivePrefix = {arXiv}, eprinttype = {arxiv}, eprint = {1509.01626}, primaryClass = {cs}, title = {Character-Level {{Convolutional Networks}} for {{Text Classification}}}, abstract = {This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.}, journal = {arXiv:1509.01626 [cs]}, author = {Zhang, Xiang and Zhao, Junbo and LeCun, Yann}, month = sep, year = {2015}, } - Citation: - For Yelp Review Full: - @article{zhangCharacterlevelConvolutionalNetworks2015, archivePrefix = {arXiv}, eprinttype = {arxiv}, eprint = {1509.01626}, primaryClass = {cs}, title = {Character-Level {{Convolutional Networks}} for {{Text Classification}}}, abstract = {This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.}, journal = {arXiv:1509.01626 [cs]}, author = {Zhang, Xiang and Zhao, Junbo and LeCun, Yann}, month = sep, year = {2015}, } 
Pre-processing Operation
| Apply a function in this dataset. | |
| Concatenate the dataset objects in the input list. | |
| Filter dataset by prediction. | |
| Map func to each row in dataset and flatten the result. | |
| Apply each operation in operations to this dataset. | |
| The specified columns will be selected from the dataset and passed into the pipeline with the order specified. | |
| Rename the columns in input datasets. | |
| Repeat this dataset count times. | |
| Reset the dataset for next epoch. | |
| Save the dynamic data processed by the dataset pipeline in common dataset format. | |
| Shuffle the dataset by creating a cache with the size of buffer_size . | |
| Skip the first N elements of this dataset. | |
| Split the dataset into smaller, non-overlapping datasets. | |
| Take the first specified number of samples from the dataset. | |
| Zip the datasets in the sense of input tuple of datasets. | 
Batch
| Combine batch_size number of consecutive rows into batch which apply per_batch_map to the samples first. | |
| Bucket elements according to their lengths. | |
| Combine batch_size number of consecutive rows into batch which apply pad_info to the samples first. | 
Iterator
| Create an iterator over the dataset that yields samples of type dict, while the key is the column name and the value is the data. | |
| Create an iterator over the dataset that yields samples of type list, whose elements are the data for each column. | 
Attribute
| Return the size of batch. | |
| Get the mapping dictionary from category names to category indexes. | |
| Return the names of the columns in dataset. | |
| Return the number of batches in an epoch. | |
| Get the replication times in RepeatDataset. | |
| Get the column index, which represents the corresponding relationship between the data column order and the network when using the sink mode. | |
| Get the number of classes in a dataset. | |
| Get the shapes of output data. | |
| Get the types of output data. | 
Apply Sampler
| Add a child sampler for the current dataset. | |
| Replace the last child sampler of the current dataset, remaining the parent sampler unchanged. | 
Others
| Release a blocking condition and trigger callback with given data. | |
| Add a blocking condition to the input Dataset and a synchronize action will be applied. | |
| Serialize a pipeline into JSON string and dump into file if filename is provided. |