mindpandas.config

Mindpandas config file

mindpandas.config.get_adaptive_concurrency()[source]

Get the flag for using adaptive concurrency or not.

Returns: bool, value of adaptive_concurrency flag.

Examples

>>> # Get the adaptive concurrency flag.
>>> import mindpandas as pd
>>> adaptive = pd.get_adaptive_concurrency()

mindpandas.config.get_concurrency_mode()[source]

Get the current concurrency mode. It would be one of {‘multithread’, ‘multiprocess’}.

Returns: str, current concurrency mode.

Examples

>>> # Get the current concurrency mode.
>>> import mindpandas as pd
>>> mode = pd.get_concurrency_mode()

mindpandas.config.get_min_block_size()[source]

Get the current min block size of each partition.

Returns: int, current min_block_size of each partition in config.

Examples

>>> # Get the current min block size.
>>> import mindpandas as pd
>>> mode = pd.get_min_block_size()

mindpandas.config.get_partition_shape()[source]

Get the current partition shape.

Returns

Number of expected partitions along each axis. It is a tuple of two positive integers.: The first element is the row-wise number of partitions and the second element is the column-wise number of partitions.

Return type

shape(tuple)

Examples

>>> # Get the current partition shape.
>>> import mindpandas as pd
>>> mode = pd.get_partition_shape()

mindpandas.config.set_adaptive_concurrency(adaptive)[source]

Users can set adaptive concurrency to allow read_csv to automatically select the concurrency mode based on the file size. Available options are “True” or “False”. When set to True, file sizes read from read_csv greater than 18 MB and DataFrame initialized from pandas DataFrame using more than 1 GB CPU memory will use the multiprocess mode, otherwise they will use the multithread mode. When set to False, it will use the current concurrency mode.

Parameters: adaptive (bool) – True to turn on adaptive concurrency, False to turn off adaptive concurrency.
Raises: ValueError – if adaptive is not True or False.

Examples

>>> # Set adaptive concurrency to True.
>>> import mindpandas as pd
>>> pd.set_adaptive_concurrency(True)

mindpandas.config.set_concurrency_mode(mode, **kwargs)[source]

Set the backend concurrency mode to parallelize the computation. Default mode is multithread. Available options are {‘multithread’, ‘multiprocess’}. For the instruction and usage of two modes, please referring to MindPandas execution mode introduction and configuration instructions for more information.

Parameters

mode (str) – This parameter can be set to ‘multithread’ for multithread backend, or ‘multiprocess’ for distributed multiprocess backend.
**kwargs –
When running on multithread mode, no additional kwargs needed. When running on multiprocess mode, additional parameters include:
- address: The ip address of the master node, required.

Raises

ValueError – If mode is not ‘multithread’ or ‘multiprocess’.

Examples

>>> # Change the mode to multiprocess.
>>> import mindpandas as pd
>>> pd.set_concurrency_mode('multiprocess', address='127.0.0.1')

mindpandas.config.set_min_block_size(min_block_size)[source]

Users can set the min block size of each partition using this API. It means the minimum size of each axis of each partition. In other words, each partition’s size would be larger or equal to (min_block_size, min_block_size), unless the original data is smaller than this size. For example, if the min_block_size is set to be 32, and I have a dataframe which only has 16 columns and the partition shape is (2, 2), then during the partitioning we won’t further split the columns.

Parameters: min_block_size (int) – Minimum size of a partition’s number of rows and number of columns during partitioning.
Raises: ValueError – if min_block_size is not int type.

Examples

>>> # Set the min block size of each partition to 8.
>>> import mindpandas as pd
>>> pd.set_min_block_size(8)

mindpandas.config.set_partition_shape(shape)[source]

Users can set the partition shape of the data, where shape[0] is the expected number of partitions along axis 0 ( row-wise) and shape[1] is the expected number of partitions along axis 1 (column-wise). e.g. If the shape is (16, 16), then mindpandas will try to slice original data into 16 * 16 partitions.

Parameters: shape (tuple) – Number of expected partitions along each axis. It should be a tuple of two positive integers. The first element is the row-wise number of partitions and the second element is the column-wise number of partitions.
Raises: ValueError – If shape is not tuple type or the value of shape is not int.

Examples

>>> # Set the shape of each partition to (16, 16).
>>> import mindpandas as pd
>>> pd.set_partition_shape((16, 16))