mindspore.dataset.SBDataset
- class mindspore.dataset.SBDataset(dataset_dir, task='Boundaries', usage='all', num_samples=None, num_parallel_workers=1, shuffle=None, decode=None, sampler=None, num_shards=None, shard_id=None)[source]
SB(Semantic Boundaries) Dataset.
By configuring the task parameter, the generated dataset has different output columns.
task is
'Boundaries'
, there are two output columns: the 'image' column has the data type uint8 and the 'label' column contains one image of the data type uint8.task is
'Segmentation'
, there are two output columns: the 'image' column has the data type uint8 and the 'label' column contains 20 images of the data type uint8.
- Parameters
dataset_dir (str) – Path to the root directory that contains the dataset.
task (str, optional) – Acceptable tasks include
'Boundaries'
or'Segmentation'
. Default:'Boundaries'
.usage (str, optional) – Acceptable usages include
'train'
,'val'
,'train_noval'
and'all'
. Default:'all'
.num_samples (int, optional) – The number of images to be included in the dataset. Default:
None
, all images.num_parallel_workers (int, optional) – Number of worker subprocesses to read the data. Default:
1
.shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default:
None
, expected order behavior shown in the table below.decode (bool, optional) – Decode the images after reading. Default:
None
, meansFalse
.sampler (Sampler, optional) – Object used to choose samples from the dataset. Default:
None
, expected order behavior shown in the table below.num_shards (int, optional) – Number of shards that the dataset will be divided into. Default:
None
. When this argument is specified, num_samples reflects the max sample number of per shard. Used in data parallel training .shard_id (int, optional) – The shard ID within num_shards . Default:
None
. This argument can only be specified when num_shards is also specified.
- Raises
RuntimeError – If dataset_dir is not valid or does not contain data files.
RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and num_shards/shard_id are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If dataset_dir is not exist.
ValueError – If num_parallel_workers exceeds the max thread numbers.
ValueError – If task is not
'Boundaries'
or'Segmentation'
.ValueError – If usage is not
'train'
,'val'
,'train_noval'
or'all'
.ValueError – If shard_id is not in range of [0, num_shards ).
- Tutorial Examples:
Note
The parameters num_samples , shuffle , num_shards , shard_id can be used to control the sampler used in the dataset, and their effects when combined with parameter sampler are as follows.
Parameter sampler
Parameter num_shards / shard_id
Parameter shuffle
Parameter num_samples
Sampler Used
mindspore.dataset.Sampler type
None
None
None
sampler
numpy.ndarray,list,tuple,int type
/
/
num_samples
SubsetSampler(indices = sampler , num_samples = num_samples )
iterable type
/
/
num_samples
IterSampler(sampler = sampler , num_samples = num_samples )
None
num_shards / shard_id
None / True
num_samples
DistributedSampler(num_shards = num_shards , shard_id = shard_id , shuffle = True , num_samples = num_samples )
None
num_shards / shard_id
False
num_samples
DistributedSampler(num_shards = num_shards , shard_id = shard_id , shuffle = False , num_samples = num_samples )
None
None
None / True
None
RandomSampler(num_samples = num_samples )
None
None
None / True
num_samples
RandomSampler(replacement = True , num_samples = num_samples )
None
None
False
num_samples
SequentialSampler(num_samples = num_samples )
Examples
>>> import mindspore.dataset as ds >>> sb_dataset_dir = "/path/to/sb_dataset_directory" >>> >>> # 1) Get all samples from Semantic Boundaries Dataset in sequence >>> dataset = ds.SBDataset(dataset_dir=sb_dataset_dir, shuffle=False) >>> >>> # 2) Randomly select 350 samples from Semantic Boundaries Dataset >>> dataset = ds.SBDataset(dataset_dir=sb_dataset_dir, num_samples=350, shuffle=True) >>> >>> # 3) Get samples from Semantic Boundaries Dataset for shard 0 in a 2-way distributed training >>> dataset = ds.SBDataset(dataset_dir=sb_dataset_dir, num_shards=2, shard_id=0) >>> >>> # In Semantic Boundaries Dataset, each dictionary has keys "image" and "task"
About Semantic Boundaries Dataset:
The Semantic Boundaries Dataset consists of 11355 color images. There are 8498 images' name in the train.txt, 2857 images' name in the val.txt and 5623 images' name in the train_noval.txt. The category cls/ contains the Segmentation and Boundaries results of category-level, the category inst/ contains the Segmentation and Boundaries results of instance-level.
You can unzip the dataset files into the following structure and read by MindSpore's API:
. └── benchmark_RELEASE ├── dataset ├── img │ ├── 2008_000002.jpg │ ├── 2008_000003.jpg │ ├── ... ├── cls │ ├── 2008_000002.mat │ ├── 2008_000003.mat │ ├── ... ├── inst │ ├── 2008_000002.mat │ ├── 2008_000003.mat │ ├── ... ├── train.txt └── val.txt
@InProceedings{BharathICCV2011, author = "Bharath Hariharan and Pablo Arbelaez and Lubomir Bourdev and Subhransu Maji and Jitendra Malik", title = "Semantic Contours from Inverse Detectors", booktitle = "International Conference on Computer Vision (ICCV)", year = "2011", }
Pre-processing Operation
Apply a function in this dataset. |
|
Concatenate the dataset objects in the input list. |
|
Filter dataset by prediction. |
|
Map func to each row in dataset and flatten the result. |
|
Apply each operation in operations to this dataset. |
|
The specified columns will be selected from the dataset and passed into the pipeline with the order specified. |
|
Rename the columns in input datasets. |
|
Repeat this dataset count times. |
|
Reset the dataset for next epoch. |
|
Save the dynamic data processed by the dataset pipeline in common dataset format. |
|
Shuffle the dataset by creating a cache with the size of buffer_size . |
|
Skip the first N elements of this dataset. |
|
Split the dataset into smaller, non-overlapping datasets. |
|
Take the first specified number of samples from the dataset. |
|
Zip the datasets in the sense of input tuple of datasets. |
Batch
Combine batch_size number of consecutive rows into batch which apply per_batch_map to the samples first. |
|
Bucket elements according to their lengths. |
|
Combine batch_size number of consecutive rows into batch which apply pad_info to the samples first. |
Iterator
Create an iterator over the dataset that yields samples of type dict, while the key is the column name and the value is the data. |
|
Create an iterator over the dataset that yields samples of type list, whose elements are the data for each column. |
Attribute
Return the size of batch. |
|
Get the mapping dictionary from category names to category indexes. |
|
Return the names of the columns in dataset. |
|
Return the number of batches in an epoch. |
|
Get the replication times in RepeatDataset. |
|
Get the column index, which represents the corresponding relationship between the data column order and the network when using the sink mode. |
|
Get the number of classes in a dataset. |
|
Get the shapes of output data. |
|
Get the types of output data. |
Apply Sampler
Add a child sampler for the current dataset. |
|
Replace the last child sampler of the current dataset, remaining the parent sampler unchanged. |
Others
Release a blocking condition and trigger callback with given data. |
|
Add a blocking condition to the input Dataset and a synchronize action will be applied. |
|
Serialize a pipeline into JSON string and dump into file if filename is provided. |