mindformers.dataset.ContrastiveLanguageImagePretrainDataset

class mindformers.dataset.ContrastiveLanguageImagePretrainDataset(dataset_config: Optional[dict] = None, data_loader: Union[dict, Callable] = None, transforms: Union[dict, list] = None, text_transforms: Union[dict, list] = None, tokenizer: Union[dict, Callable] = None, sampler: Union[dict, Callable] = None, batch_size: int = 8, drop_remainder: bool = True, num_parallel_workers: int = 8, python_multiprocessing: bool = False, repeat: int = 1, seed: int = 0, prefetch_size: int = 1, numa_enable: bool = False, auto_tune: bool = False, filepath_prefix: str = './autotune', autotune_per_step: int = 10, profile: bool = False, **kwargs)[source]

CLIP (Contrastive Language Image Pre-training) dataset.

The generated dataset has two columns: [image, text] . The data type of tensor of each column depend on the dataset files format and data transform operations in use.

Parameters

dataset_config (dict, optional) – Config for dataset. When dataset_config is an empty dict or is None, all arguments below will build a non-empty dataset_config. Otherwise, they will be ignored. Default: None.
data_loader (Union[dict, Callable]) –
Config for data loader or a data loader object. When data_loader is a dict, the string "type", "dataset_dir", "stage" and "column_names" are the keys can be parsed.
- type: Required. Indicates the type of dataset. The value must be string or class type.
- dataset_dir: Required. The directory of dataset.
- stage: Optional. The dataset subset to be loaded. The value can be 'train', "test", "dev" and "all". Default when missing: 'train'.
- column_names: Optional. The column names of dataset. Must be a list or tuple of string. Default when missing: py:obj:[image, text] .
transforms (Union[dict, list]) – Configurations or objects of one or more transformers. Default: None, no image transform operation to use.
text_transforms (Union[dict, list]) – Configurations or objects of one or more transformers of text. Default: None, no text transform operation to use.
tokenizer (Union[dict, Callable]) – Tokenizer configuration or object. Default: None, no tokenizer to use.
sampler (Union[dict, Callable]) – Sampler configuration or object. Default: None, no sampler to use.
batch_size (int) – Size of each batch. Default: 8.
drop_remainder (bool) – Whether to discard the last batch when the number of data items contained in the last batch is smaller than batch_size. Default: True.
num_parallel_workers (int) – Specifies the number of concurrent processes or threads for map operations to accelerate processing. Default: 8.
python_multiprocessing (bool) – Enabling the Python Multi-Process Mode to Accelerate Map Operations. Default: False.
repeat (int) – Number of times this dataset is repeated. Default: 1.
seed (int) – Random seed number. Default: 0.
prefetch_size (int) – Buffer queue size of each data processing operation in the pipeline. Default: 1.
numa_enable (bool) – Indicates whether to use the NUMA binding function. Default: False.
auto_tune (bool) – Indicates whether to enable automatic optimization of data processing parameters. Default: False.
autotune_per_step (int) – Specifies the interval for adjusting the configuration step of automatic data acceleration. Default: 10.
filepath_prefix (str) – Path for saving optimized parameter configurations. Default: './autotune'.
profile (bool) – Whether to enable data collection. Default: False.

Returns

Instance of ContrastiveLanguageImagePretrainDataset.

Examples

>>> from mindspore.dataset.vision import CenterCrop, ToTensor, Normalize
>>> from mindformers import AutoTokenizer
>>> from mindformers.dataset import Flickr8kDataLoader, ContrastiveLanguageImagePretrainDataset
>>> from mindformers.dataset import Resize, RandomChoiceTokenizerForward
>>> tokenizer = AutoTokenizer.from_pretrained("clip_vit_b_32")
>>> # Note:
>>> #     `"/dir/to/dataset"` should be replaced with the real directory of dataset.
>>> #     The detailed data setting could refer to
>>> #     https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/clip.md
>>> data_loader = Flickr8kDataLoader(dataset_dir="/dir/to/dataset", stage="train",
...                                  column_names=["image", "text"])
>>> text_transforms = RandomChoiceTokenizerForward(max_length=77, padding="max_length", random_seed=2022,
...                                                tokenizer=tokenizer)
>>> transforms = [Resize(size=224), CenterCrop(size=224), ToTensor(),
...               Normalize(mean=[0.48145466, 0.4578275, 0.40821073],
...                         std=[0.26862954, 0.26130258, 0.27577711],
...                         is_hwc=False)]
>>> dataset_from_param = ContrastiveLanguageImagePretrainDataset(data_loader=data_loader,
...                                                              text_transforms=text_transforms,
...                                                              transforms=transforms,
...                                                              tokenizer=tokenizer)