mindspore.dataset.DatasetCache

class mindspore.dataset.DatasetCache(session_id, size=0, spilling=False, hostname=None, port=None, num_connections=None, prefetch_size=None)[source]

A client to interface with tensor caching service.

For details, please check Tutorial .

Parameters

session_id (int) – A user assigned session id for the current pipeline.
size (int, optional) – Size of the memory set aside for the row caching. Default: 0, which means unlimited, note that it might bring in the risk of running out of memory on the machine.
spilling (bool, optional) – Whether or not spilling to disk if out of memory. Default: False.
hostname (str, optional) – Host name. Default: None , use default hostname '127.0.0.1'.
port (int, optional) – Port to connect to server. Default: None , use default port 50052.
num_connections (int, optional) – Number of tcp/ip connections. Default: None , use default value 12.
prefetch_size (int, optional) – The size of the cache queue between operations. Default: None , use default value 20.

Examples

>>> import subprocess
>>> import mindspore.dataset as ds
>>>
>>> # Create a cache instance with command line `dataset-cache --start`
>>> # Create a session with `dataset-cache -g`
>>> # After creating cache with a valid session, get session id with command `dataset-cache --list_sessions`
>>> command = "dataset-cache --list_sessions | tail -1 | awk -F ' ' '{{print $1;}}'"
>>> session_id = subprocess.getoutput(command).split('\n')[-1]
>>> some_cache = ds.DatasetCache(session_id=int(session_id), size=0)
>>>
>>> dataset_dir = "/path/to/image_folder_dataset_directory"
>>> dataset = ds.ImageFolderDataset(dataset_dir, cache=some_cache)

get_stat()[source]

Get the statistics from a cache. After data pipeline, three types of statistics can be obtained, including average number of cache hits (avg_cache_sz), number of caches in memory (num_mem_cached) and number of caches in disk (num_disk_cached).

Examples

>>> import os
>>> import subprocess
>>> import mindspore.dataset as ds
>>>
>>> # In example above, we created cache with a valid session id
>>> command = "dataset-cache --list_sessions | tail -1 | awk -F ' ' '{{print $1;}}'"
>>> id = subprocess.getoutput(command).split('\n')[-1]
>>> some_cache = ds.DatasetCache(session_id=int(id), size=0)
>>>
>>> # run the dataset pipeline to trigger cache
>>> dataset = ds.ImageFolderDataset("/path/to/image_folder_dataset_directory", cache=some_cache)
>>> data = list(dataset)
>>>
>>> # get status of cache
>>> stat = some_cache.get_stat()
>>> # Average cache size
>>> cache_sz = stat.avg_cache_sz
>>> # Number of rows cached in memory
>>> num_mem_cached = stat.num_mem_cached
>>> # Number of rows spilled to disk
>>> num_disk_cached = stat.num_disk_cached