Document feedback

Question document fragment

When a question document fragment contains a formula, it is displayed as a space.

Submission type
issue

It's a little complicated...

I'd like to ask someone.

Please select the submission type

Problem type
Specifications and Common Mistakes

- Specifications and Common Mistakes:

- Misspellings or punctuation mistakes,incorrect formulas, abnormal display.

- Incorrect links, empty cells, or wrong formats.

- Chinese characters in English context.

- Minor inconsistencies between the UI and descriptions.

- Low writing fluency that does not affect understanding.

- Incorrect version numbers, including software package names and version numbers on the UI.

Usability

- Usability:

- Incorrect or missing key steps.

- Missing main function descriptions, keyword explanation, necessary prerequisites, or precautions.

- Ambiguous descriptions, unclear reference, or contradictory context.

- Unclear logic, such as missing classifications, items, and steps.

Correctness

- Correctness:

- Technical principles, function descriptions, supported platforms, parameter types, or exceptions inconsistent with that of software implementation.

- Incorrect schematic or architecture diagrams.

- Incorrect commands or command parameters.

- Incorrect code.

- Commands inconsistent with the functions.

- Wrong screenshots.

- Sample code running error, or running results inconsistent with the expectation.

Risk Warnings

- Risk Warnings:

- Lack of risk warnings for operations that may damage the system or important data.

Content Compliance

- Content Compliance:

- Contents that may violate applicable laws and regulations or geo-cultural context-sensitive words and expressions.

- Copyright infringement.

Please select the type of question

Problem description

Describe the bug so that we can quickly locate the problem.

Loading Text Dataset

Linux Ascend GPU CPU Data Preparation Beginner Intermediate Expert

image0image1image2

Overview

The mindspore.dataset module provided by MindSpore enables users to customize their data fetching strategy from disk. At the same time, data processing and tokenization operators are applied to the data. Pipelined data processing produces a continuous flow of data to the training network, improving overall performance.

In addition, MindSpore supports data loading in distributed scenarios. Users can define the number of shards while loading. For more details, see Loading the Dataset in Data Parallel Mode.

This tutorial briefly demonstrates how to load and process text data using MindSpore.

Preparations

  1. Prepare the following text data.

Welcome to Beijing!

北京欢迎您!

我喜欢English!
  1. Create the tokenizer.txt file, copy the text data to the file, and save the file under ./datasets directory. The directory structure is as follow.

[5]:
    import os

    if not os.path.exists('./datasets'):
        os.mkdir('./datasets')
    file_handle=open('./datasets/tokenizer.txt',mode='w')
    file_handle.write('Welcome to Beijing \n北京欢迎您! \n我喜欢English! \n')
    file_handle.close()
[6]:
    ! tree ./datasets
./datasets
├── MNIST_Data
│   ├── test
│   │   ├── t10k-images-idx3-ubyte
│   │   └── t10k-labels-idx1-ubyte
│   └── train
│       ├── train-images-idx3-ubyte
│       └── train-labels-idx1-ubyte
└── tokenizer.txt

3 directories, 5 files
  1. Import the mindspore.dataset and mindspore.dataset.text modules.

[7]:
    import mindspore.dataset as ds
    import mindspore.dataset.text as text

Loading Dataset

MindSpore supports loading common datasets in the field of text processing that come in a variety of on-disk formats. Users can also implement custom dataset class to load customized data. For detailed loading methods of various datasets, please refer to the Loading Dataset chapter in the Programming Guide.

The following tutorial demonstrates loading datasets using the TextFileDataset in the mindspore.dataset module.

  1. Configure the dataset directory as follows and create a dataset object.

[8]:
    DATA_FILE = "./datasets/tokenizer.txt"
    dataset = ds.TextFileDataset(DATA_FILE, shuffle=False)
  1. Create an iterator then obtain data through the iterator.

[9]:
    for data in dataset.create_dict_iterator(output_numpy=True):
        print(text.to_str(data['text']))
Welcome to Beijing
北京欢迎您!
我喜欢English!

Processing Data

For the data processing operators currently supported by MindSpore and their detailed usage methods, please refer to the Processing Data chapter in the Programming Guide

The following tutorial demonstrates how to construct a pipeline and perform operations such as shuffle and RegexReplace on the text dataset.

  1. Shuffle the dataset.

[10]:
    ds.config.set_seed(58)
    dataset = dataset.shuffle(buffer_size=3)

    for data in dataset.create_dict_iterator(output_numpy=True):
        print(text.to_str(data['text']))
我喜欢English!
Welcome to Beijing
北京欢迎您!
  1. Perform RegexReplace on the dataset.

[11]:
    replace_op1 = text.RegexReplace("Beijing", "Shanghai")
    replace_op2 = text.RegexReplace("北京", "上海")
    dataset = dataset.map(operations=replace_op1)
    dataset = dataset.map(operations=replace_op2)

    for data in dataset.create_dict_iterator(output_numpy=True):
        print(text.to_str(data['text']))
我喜欢English!
Welcome to Shanghai
上海欢迎您!

Tokenization

For the data tokenization operators currently supported by MindSpore and their detailed usage methods, please refer to the Tokenizer chapter in the Programming Guide.

The following tutorial demonstrates how to use the WhitespaceTokenizer to tokenize words with space.

  1. Create a tokenizer.

[12]:
    tokenizer = text.WhitespaceTokenizer()
  1. Apply the tokenizer.

[13]:
    dataset = dataset.map(operations=tokenizer)
  1. Create an iterator and obtain data through the iterator.

[14]:
    for data in dataset.create_dict_iterator(output_numpy=True):
        print(text.to_str(data['text']).tolist())
['我喜欢English!']
['Welcome', 'to', 'Shanghai']
['上海欢迎您!']