mindspore_rl

Components for MindSpore Reinforcement Learning Framework.

mindspore_rl.agent

Components for agent, actor, learner, trainer.

class mindspore_rl.agent.Actor[source]

Base class for all actors. Actor is a class used to interact with the environment and generate data.

Examples

>>> from mindspore_rl.agent.actor import Actor
>>> from mindspore_rl.network import FullyConnectedNet
>>> from mindspore_rl.environment import GymEnvironment
>>> class MyActor(Actor):
...   def __init__(self):
...     super(MyActor, self).__init__()
...     self.argmax = P.Argmax()
...     self.actor_net = FullyConnectedNet(4, 10, 2)
...     self.env = GymEnvironment({'name': 'CartPole-v0'})
>>> my_actor = MyActor()
>>> print(my_actor)
MyActor<
(actor_net): FullyConnectedNet<
(linear1): Dense<input_channels=4, output_channels=10, has_bias=True>
(linear2): Dense<input_channels=10, output_channels=2, has_bias=True>
(relu): ReLU<>
>
(environment): GymEnvironment<>

act(phase, params)[source]

The act function will take an enumerate value and observation or other data which is needed during calculating the action. It will return a set of output which contains new observation, or other experience. In this function, agent will interact with environment.

Parameters

phase (enum) – A enumerate value states for init, collect, eval or other user-defined stage.
params (tuple(Tensor)) – A tuple of tensor as input, which is used to calculate action

Returns

tuple(Tensor), a tuple of tensor as output, which states for experience data.

get_action(phase, params)[source]

get_action is the method used to obtain the action. User will need to overload this function according to the algorithm. But argument of this function should be phase and params. This interface will not interact with environment

Parameters

phase (enum) – A enumerate value states for init, collect, eval or other user-defined stage.
params (tuple(Tensor)) – A tuple of tensor as input, which is used to calculate action

Returns

tuple(Tensor), a tuple of tensor as output, containing actions and other data.

class mindspore_rl.agent.Agent(actors, learner)[source]

The base class for the Agent. As the definition of agent, it is composed of actor and leanrner. It has basic act and learn functions for interaction with environment and update itself.

Parameters

actors (Actor) – The actor instance.
learner (Learner) – The learner instance.

Examples

>>> from mindspore_rl.agent.learner import Learner
>>> from mindspore_rl.agent.actor import Actor
>>> from mindspore_rl.agent.agent import Agent
>>> actors = Actor()
>>> learner = Learner()
>>> agent = Agent(actors, learner)
>>> print(agent)
Agent<
(_actors): Actor<>
(_learner): Learner<>
>

act(phase, params)[source]

The act function will take an enumerate value and observation or other data which is needed during calculating the action. It will return a set of output which contains new observation, or other experience. In this function, agent will interact with environment.

Parameters

phase (enum) – A enumerate value states for init, collect or eval stage.
params (tuple(Tensor)) – A tuple of tensor as input, which is used to calculate action

Returns

tuple(Tensor), a tuple of tensor as output, which states for experience

get_action(phase, params)[source]

The get_action function will take an enumerate value and observation or other data which is needed during calculating the action. It will return a set of outputs containing actions and other data. In this function, agent will not interact with environment.

Parameters

phase (enum) – A enumerate value states for init, collect or eval stage.
params (tuple(Tensor)) – A tuple of tensor as input, which is used to calculate action

Returns

tuple(Tensor), a tuple of tensor as output, containing actions and other data.

learn(experience)[source]

The learn function will take a set of experience as input to calculate the loss and update the weights.

Parameters: experience (tuple(Tensor)) – A tuple of tensor states for experience
Returns: tuple(Tensor), result which outputs after updating weights

class mindspore_rl.agent.Learner[source]

The base class of the learner. Calculate and update the self generated network through the input experience.

Examples

>>> from mindspore_rl.agent.learner import Learner
>>> from mindspore_rl.network import FullyConnectedNet
>>> class MyLearner(Learner):
...   def init(self):
...     super(MyLearner, self).init()
...     self.target_network = FullyConnectedNet(4, 10, 2)
>>> my_learner = MyLearner()
>>> print(my_learner)
MyLearner<
(target_network): FullyConnectedNet<
(linear1): Dense<input_channels=4, output_channels=10, has_bias=True>
(linear2): Dense<input_channels=10, output_channels=2, has_bias=True>
(relu): ReLU<>
>

learn(experience)[source]

The interface for the learn function. The behavior of the learn function depend on the user’s implementation. Usually, it takes the samples form replay buffer or other Tensors, and calculates the loss for updating the networks.

Parameters: experience (tuple(Tensor)) – Sampling from the buffer.
Returns: tuple(Tensor), result which outputs after updating weights

class mindspore_rl.agent.Trainer(msrl)[source]

The trainer base class. It is a process class that provides the basic mode of training.

Note

Reference to dqn_trainer.py.

Parameters: msrl (MSRL) – the function handler class.

evaluate()[source]: The interface of the evaluate function for evaluate in train.

load_and_eval(ckpt_path=None)[source]

The interface of the eval function for offline. A checkpoint must be provided.

Parameters: ckpt_path (string) – The checkpoint file to restore net. Default: None.

train(episodes, callbacks=None, ckpt_path=None)[source]

The train method provides a standard training process, including the whole loop and callbacks. Users can inherit or overwrite as needed.

Parameters

episodes (int) – the number of training episodes.
callbacks (Optional[list[Callback]]) – List of callback objects. Default: None.
ckpt_path (Optional[str]) – The checkpoint file path to init or restore net. Default: None.

train_one_episode()[source]: The interface of train one episode function in train. And the output of this function must be constricted as loss, rewards, steps, [optional]others in order.

trainable_variables()[source]: The variables for saving to checkpoint.

mindspore_rl.core

Helper components used to implement RL algorithms.

class mindspore_rl.core.MSRL(alg_config, deploy_config=None)[source]

The MSRL class provides the function handlers and APIs for reinforcement learning algorithm development.

It exposes the following function handler to the user. The input and output of these function handlers are identical to the user defined functions.

agent_act
agent_get_action
sample_buffer
agent_learn
replay_buffer_sample
replay_buffer_insert
replay_buffer_reset

Parameters

alg_config (dict) – provides the algorithm configuration.
deploy_config (dict) –
provides the distribute configuration.
- Top level: defines the algorithm components.
  - key: ‘actor’, value: the actor configuration (dict).
  - key: ‘learner’, value: the learner configuration (dict).
  - key: ‘policy_and_network’, value: the policy and networks used by actor and learner (dict).
  - key: ‘collect_environment’, value: the collect environment configuration (dict).
  - key: ‘eval_environment’, value: the eval environment configuration (dict).
  - key: ‘replay_buffer’, value: the replay buffer configuration (dict).
- Second level: the configuration of each algorithm component.
  - key: ‘number’, value: the number of actor/learner (int).
  - key: ‘type’, value: the type of the actor/learner/policy_and_network/environment (class name).
  - key: ‘params’, value: the parameters of actor/learner/policy_and_network/environment (dict).
  - key: ‘policies’, value: the list of policies used by the actor/learner (list).
  - key: ‘networks’, value: the list of networks used by the actor/learner (list).
  - key: ‘pass_environment’, value: True user needs to pass the environment instance into actor, False otherwise (Bool).

static create_environments(config, env_type, deploy_config=None, need_batched=False)[source]

Create the environments object from the configuration file, and return the instance of environment and evaluate environment.

Parameters

config (dict) – algorithm configuration file.
env_type (str) – type of environment in collect_environment and eval_environment.
deploy_config (dict) – the configuration for deploy. Default: None.
need_batched (bool) – whether to create batched environment. Default: False.

Returns

env (object), created environment object.
num_env (int), the number of environment.

get_replay_buffer()[source]

It will return the instance of replay buffer.

Returns: Buffers (object), The instance of relay buffer. If the buffer is None, the return value will be None.

get_replay_buffer_elements(transpose=False, shape=None)[source]

It will return all the elements in the replay buffer.

Parameters

transpose (bool) – whether the output element needs to be transpose, if transpose is True, shape will also need to be filled. Default: False.
shape (tuple[int]) – the shape used in transpose. Default: None.

Returns

elements (List[Tensor]), A set of tensor contains all the elements in the replay buffer.

init(config)[source]

Initialization of MSRL object. The function creates all the data/objects that the algorithm requires. It also initializes all the function handler.

Parameters: config (dict) – algorithm configuration file.

class mindspore_rl.core.PriorityReplayBuffer(alpha, capacity, sample_size, shapes, types, seed0=0, seed1=0)[source]

PriorityReplayBuffer is experience container used in Deep Q-Networks. The algorithm is proposed in Prioritized Experience Replay <https://arxiv.org/abs/1511.05952>. Same as the normal replay buffer, it lets the reinforcement learning agents remember and reuse experiences from the past. Besides, it replays important transitions more frequently and improve sample effciency.

Parameters

alpha (float) – parameter to control degree of prioritization. 0 means the uniform sampling, 1 means priority sampling.
capacity (int) – the capacity of the buffer.
sample_size (int) – size for sampling from the buffer.
shapes (list[int]) – the shape of each tensor in a buffer element.
types (list[mindspore.dtype]) – the data type of each tensor in a buffer element.
seed0 (int) – Seed0 value for random generating. Default: 0.
seed1 (int) – Seed1 value for random generating. Default: 0.

Examples

>>> import mindspore as ms
>>> from mindspore import Tensor
>>> from mindspore_rl.core.priority_replay_buffer import PriorityReplayBuffer
>>> capacity = 10000
>>> batch_size = 10
>>> shapes = [(4,), (1,), (1,), (4,)]
>>> types = [ms.float32, ms.int32, ms.float32, ms.float32]
>>> replaybuffer = PriorityReplayBuffer(alpha, capacity, batch_size, shapes, types)
>>> print(replaybuffer)
PriorityReplayBuffer<>

destroy()[source]

Destroy the replay buffer.

Returns: Priority replay buffer instance handle with dtype int64 and shape $(1,)$ .

full()[source]: whether the buffer is full

insert(*transition)[source]

Push a transition to the buffer. If the buffer is full, the oldest one will be removed.

Parameters: transition (List[Tensor]) – insert a list of tensor which matches with the initialized shapes and dtypes into the buffer.
Returns: handle(Tensor), Priority replay buffer instance handle with dtype int64 and shape $(1,)$ .

reset()[source]: reset the buffer

sample(beta)[source]

Samples a batch of transitions from the replay buffer.

Parameters: beta (float) – parameter to control degree of sampling correction. 0 means the no correction, 1 means full correction.
Returns: indices (Tensor), the transition indices in the replay buffer. weights (Tensor), the weight used to correct for sampling bias. transitions (tuple(Tensor)), transitions with variable-length tensors.

update_priorities(indices, priorities)[source]

Update transition prorities.

Parameters

indices (Tensor) – transition indices. The caller needs to ensure the validity of the indices.
priorities (Tensor) – transition priorities.

Returns

tuple(Tensor), Transition with its indices and correction weights.

class mindspore_rl.core.Session(alg_config, deploy_config=None, params=None, callbacks=None)[source]

The Session is a class for running MindSpore RL algorithms.

Parameters

alg_config (dict) – the algorithm configuration or the deployment configuration of the algorithm.
deploy_config (dict) – the deployment configuration for distribution. Default: None. For more details of configuration of algorithm, please have a look at detail.
params (dict) – The algorithm specific training parameters. Default: None.
callbacks (list[Callback]) – The callback list. Default: None.

run(class_type=None, is_train=True, episode=0, duration=0)[source]

Execute the reinforcement learning algorithm.

Parameters

class_type (Trainer) – The class type of the algorithm”s trainer class. Default: None.
is_train (bool) – Run the algorithm in train mode or eval mode. Default: True.
episode (int) – The number of episode of the training. Default: 0.
duration (int) – The number of duration of the training. Default: 0.

class mindspore_rl.core.UniformReplayBuffer(sample_size, capacity, shapes, types)[source]

The replay buffer class. The replay buffer will store the experience from environment. In replay buffer, each element is a list of tensors. Therefore, the constructor of the UniformReplayBuffer class takes the shape and type of each tensor as an argument.

Parameters

sample_size (int) – size for sampling from the buffer.
capacity (int) – the capacity of the buffer.
shapes (list[int]) – the shape of each tensor in a buffer element.
types (list[mindspore.dtype]) – the data type of each tensor in a buffer element.

Examples

>>> import mindspore as ms
>>> from mindspore_rl.core.uniform_replay_buffer import UniformReplayBuffer
>>> sample_size = 10
>>> capacity = 10000
>>> shapes = [(4,), (1,), (1,), (4,)]
>>> types = [ms.float32, ms.int32, ms.float32, ms.float32]
>>> replaybuffer = UniformReplayBuffer(sample_size, capacity, shapes, types)
>>> print(replaybuffer)
UniformReplayBuffer<>

full()[source]

Check if the replaybuffer is full or not.

Returns: Full(bool), True if the replaybuffer is full, False otherwise.

get_item(index)[source]

Get an element from the replaybuffer in specific position(index).

Parameters: index (int) – the location of the item.
Returns: element (List[Tensor]), the element from the buffer.

insert(exp)[source]

Insert an element to the buffer. If the buffer is full, FIFO strategy will be used to replace the element in the buffer.

Parameters: exp (list[Tensor]) – insert a list of tensor which matches with the initialized shape and type into the buffer.
Returns: element (list[Tensor]), return the whole buffer after insertion

reset()[source]

Reset the replaybuffer. It changes the value of self.count to zero.

Returns: success (boolean), whether the reset successful or not.

sample()[source]

Sampling the replaybuffer, which means that it will randomly choose a set of element and output them.

Returns: data (Tuple(Tensor)), A set of sampled elements from the buffer.

size()[source]

Return the size of the replybuffer.

Returns: size (int), the number of element in the replaybuffer.

mindspore_rl.environment

Component used to implement custom environments.

class mindspore_rl.environment.DeepMindControlEnvironment(params, env_id=0)[source]

DeepMindControlEnvironment is a wrapper which encapsulates the DeepMind Control Suite(DMC). It stacks for physics-based simulation and Reinforcement Learning environments, using MUJOCO physics.

Parameters

params (dict) –

A dictionary contains all the parameters which are used in this class.

Configuration Parameters	Notices
env_name	the name of game in DMC
seed	seed used in Gym
camera	The camera pos used in render
action_repeat	How many times an action interacts with env
normalize_action	Whether needs to normalize the input action
img_size	The rendered img size

env_id (int, optional) – A integer which is used to set the seed of this environment, default value means the 0th environment. Default: 0 .

Examples

>>> env_params = {'env_name': 'walker_walk', 'img_size': (64, 64),
                  'action_repeat': 2, 'normalize_action': True, 'seed': 1,
                  'episode_limits': 1000, 'prefill_value': 5000}
>>> environment = DeepMindControlEnvironment(env_params, 0)
>>> print(environment)
DeepMindControlEnvironment<>

close()[source]

Close the environment to release the resource.

Returns: Success(np.bool_), Whether shutdown the process or threading successfully.

class mindspore_rl.environment.Environment[source]

The abstract base class for environment. All the environments or wrappers need to inherit this base class. Moreover, subclass needs to overridden corresponding functions and properties.

Supported Platforms:: Ascend GPU CPU

property action_space: mindspore_rl.environment.space.Space

Get the action space of the environment.

Returns: action_space (Space), The action space of environment.

property batched: bool

Whether the environment is batched.

Returns: batched (bool), Whether the environment is batched. Default is False .

close()[source]

Close the environment to release the resource.

Returns: Success (np.bool_), Whether shutdown the process or threading successfully.

property config: dict

Get the config of environment.

Returns: config (dict), A dictionary which contains environment’s info.

property done_space: mindspore_rl.environment.space.Space

Get the done space of the environment.

Returns: done_space (Space), The done space of environment.

property num_agent: int

Number of agents in the environment.

Returns

num_agent (int), Number of agent in the current environment. If the environment is: single agent, it will return 1 . Otherwise, subclass needs to override this property to return correct number of agent. Default: 1 .

property num_environment: int

Number of environment

Returns: num_env (int), Number of environment.

property observation_space: mindspore_rl.environment.space.Space

Get the state space of the environment.

Returns: observation_space (Space), The state space of environment.

recv()[source]

Receive the result of interacting with environment.

Returns

state (Union[np.ndarray, Tensor]), The environment state after performing the action.
reward (Union[np.ndarray, Tensor]), The reward after performing the action.
done (Union[np.ndarray, Tensor]), whether the simulation finishes or not.
env_id (Union[np.ndarray, Tensor]), Which environments are interacted.env
args (Union[np.ndarray, Tensor], optional), Support arbitrary outputs, but user needs to ensure the dtype. This output is optional.

render()[source]

Generate the image for current frame of environment.

Returns: img (Union[Tensor, np.ndarray]), The image of environment at current frame.

reset()[source]

Reset the environment to the initial state. It is always used at the beginning of each episode. It will return the value of initial state or other initial information.

Returns

state (Union[np.ndarray, Tensor]), A numpy array or Tensor which states for the initial state of environment.
args (Union[np.ndarray, Tensor], optional), Support arbitrary outputs, but user needs to ensure the dtype. This output is optional.

property reward_space: mindspore_rl.environment.space.Space

Get the reward space of the environment.

Returns: reward_space (Space), The reward space of environment.

send(action: Union[Tensor, np.ndarray], env_id: Union[Tensor, np.ndarray])[source]

Execute the environment step asynchronously. User can obtain result by using recv.

Parameters

action (Union[Tensor, np.ndarray]) – A tensor or array that contains the action information.
env_id (Union[Tensor, np.ndarray]) – Which environment these actions will interact with.

Returns

Success (bool), True if the action is successfully executed, otherwise False.

set_seed(seed_value: Union[int, Sequence[int]])[source]

Set seed to control the randomness of environment.

Parameters: seed_value (Union[int, Sequence[int]]) – The value that is used to set.
Returns: Success (np.bool_), Whether successfully set the seed.

step(action: Union[Tensor, np.ndarray])[source]

Execute the environment step, which means that interact with environment once.

Parameters

action (Union[Tensor, np.ndarray]) – A tensor that contains the action information.

Returns

state (Union[np.ndarray, Tensor]), The environment state after performing the action.
reward (Union[np.ndarray, Tensor]), The reward after performing the action.
done (Union[np.ndarray, Tensor]), Whether the simulation finishes or not.
args (Union[np.ndarray, Tensor], optional), Support arbitrary outputs, but user needs to ensure the dtype. This output is optional.

class mindspore_rl.environment.EnvironmentProcess(proc_no, env_num, envs, actions, observations, initial_states)[source]

An independent process responsible for creating and interacting with one or more environments.

Parameters

proc_no (int) – The process number assigned by the caller.
env_num (int) – The number of input environments.
envs (list(Environment)) – A list that contains instance of environment (subclass of Environment).
actions (Queue) – The queue used to pass actions to the environment process.
observations (Queue) – The queue used to pass observations to the caller process.
initial_states (Queue) – The queue used to pass initial states to the caller process.

Examples

>>> from multiprocessing import Queue
>>> actions = Queue()
>>> observations = Queue()
>>> initial_states = Queue()
>>> proc_no = 1
>>> env_num = 2
>>> env_params = {'name': 'CartPole-v0'}
>>> multi_env = [GymEnvironment(env_params), GymEnvironment(env_params)]
>>> env_proc = EnvironmentProcess(proc_no, env_num, multi_env, actions, observations, initial_states)
>>> env_proc.start()

run()[source]: Method to be run in sub-process; can be overridden in sub-class

class mindspore_rl.environment.GymEnvironment(params, env_id=0)[source]

The GymEnvironment class is a wrapper that encapsulates Gym to provide the ability to interact with Gym environments in MindSpore Graph Mode.

Parameters

params (dict) –
A dictionary contains all the parameters which are used in this class.

Configuration Parameters

Notices

name

the name of game in Gym

seed

seed used in Gym
env_id (int, optional) – A integer which is used to set the seed of this environment, default value means the 0th environment. Default: 0 .

Supported Platforms:: Ascend GPU CPU

Examples

>>> env_params = {'name': 'CartPole-v0'}
>>> environment = GymEnvironment(env_params, 0)
>>> print(environment)
GymEnvironment<>

close()[source]

Close the environment to release the resource.

Returns: Success(np.bool_), Whether shutdown the process or threading successfully.

class mindspore_rl.environment.MsEnvironment(kwargs=None)[source]

Class encapsulates built-in environment.

Parameters

kwargs (dict) –

The dictionary of environment specific configurations. See below table for details:

Environment name	Configuration Parameters	Default value	Notices
Tag	seed	42	random seed
	environment_num	2	number of environments
	predator_num	10	number of predators
	max_timestep	100	max timestep per episode
	map_length	100	length of map
	map_width	100	width of map
	wall_hit_penalty	0.1	agent wall hit penalty
	catch_reward	10	predator catch reward
	caught_penalty	5	prey caught penalty
	step_cost	0.01	step cost

Supported Platforms:: GPU

Examples

>>> config = {'name': 'Tag', 'predator_num': 4}
>>> env = MsEnvironment(config)
>>> observation = env.reset()
>>> action = Tensor(env.action_space.sample())
>>> observation, reward, done = env.step(action)
>>> print(observation.shape)
(2, 5, 21)

property action_space

Get the valid action space of the environment.

Returns: The action space of environment.

property config

Get environment configuration.

Returns: The configuration of environment.

property done_space

Get the valid done space of the environment.

Returns: The done space of environment.

property observation_space

Get the valid observation space of the environment.

Returns: The state space of environment.

reset()[source]

Reset the environment to initial observation and return the initial observation.

Inputs:: No inputs.

Returns: Tensor, the initial observation.

Supported Platforms:: GPU

Examples

>>> config = {'name': 'Tag', 'predator_num': 4}
>>> env = MsEnvironment(config)
>>> observation = env.reset()
>>> print(observation.shape)
(2, 5, 21)

property reward_space

Get the valid reward space of the environment.

Returns: The reward space of environment.

step(action)[source]

Run one timestep of environment to interact with environment.

Parameters

action (Tensor) – Action provided by the all of agents.

Returns

Tuple of 3 tensors, the observation, the reward and the done.

observation (Tensor) - Observations of all agents after action.
reward (Tensor) - Amount of reward returned by the environment.
done (Tensor) - Whether the episode has ended.

Supported Platforms:: GPU

Examples

>>> config = {'name': 'Tag', 'predator_num': 4}
>>> env = MsEnvironment(config)
>>> observation = env.reset()
>>> action = Tensor(env.action_space.sample())
>>> observation, reward, done = env.step(action)
>>> print(observation.shape)
(2, 5, 21)

class mindspore_rl.environment.MultiEnvironmentWrapper(env_instance, num_proc=1)[source]

The MultiEnvironmentWrapper is a wrapper for multi environment scenario. User implements their single environment class and set the environment number larger than 1 in configuration file, framework will automatically invoke this class to create a multi environment class.

Parameters

env_instance (list[Environment]) – A list that contains instance of environment (subclass of Environment).
num_proc (int, optional) – Number of processing uses during interacting with environment. Default: 1 .

Supported Platforms:: Ascend GPU CPU

Examples

>>> env_params = {'name': 'CartPole-v0'}
>>> multi_env = [GymEnvironment(env_params), GymEnvironment(env_params)]
>>> wrapper = MultiEnvironmentWrapper(multi_env)
>>> print(wrapper)
MultiEnvironmentWrapper<>

property action_space

Get the action space of the environment.

Returns: A tuple which states for the space of action.

close()[source]

Close the environment to release the resource.

Returns: Success(np.bool_), Whether shutdown the process or threading successfully.

property config

Get the config of environment.

Returns: A dictionary which contains environment’s info.

property done_space

Get the done space of the environment.

Returns: A tuple which states for the space of done.

property observation_space

Get the state space of the environment.

Returns: A tuple which states for the space of state.

render()[source]: Render the game. Only support on PyNative mode.

reset()[source]

Reset the environment to the initial state. It is always used at the beginning of each episode. It will return the value of initial state of each environment.

Returns: A list of tensor which states for all the initial states of each environment.

property reward_space

Get the reward space of the environment.

Returns: A tuple which states for the space of reward.

step(action)[source]

Execute the environment step, which means that interact with environment once.

Parameters

action (Tensor) – A tensor that contains the action information.

Returns

state (list(Tensor)), a list of environment state after performing the action.
reward (list(Tensor)), a list of reward after performing the action.
done (list(Tensor)), whether the simulations of each environment finishes or not.

class mindspore_rl.environment.PettingZooMPEEnvironment(params, env_id=0)[source]

The PettingZooMPEEnvironment class is a wrapper that encapsulates PettingZoo to provide the ability to interact with PettingZoo environments in MindSpore Graph Mode.

Parameters

params (dict) –
A dictionary contains all the parameters which are used in this class.

Configuration Parameters

Notices

scenario_name

the name of game

num

Number of Environment

continuous_actions

type of actions space
env_id (int, optional) – A integer which is used to set the seed of this environment, default value means the 0th environment. Default: 0 .

Supported Platforms:: Ascend GPU CPU

Examples

>>> env_params = {'name': 'simple_spread', 'num': 3, 'continuous_actions': False}
>>> environment = PettingZooMPEEnvironment(env_params)
>>> print(environment)
PettingZooMPEEnvironment<>

close()[source]

Close the environment to release the resource.

Returns: Success(np.bool_), Whether shutdown the process or threading successfully.

class mindspore_rl.environment.Space(feature_shape, dtype, low=None, high=None, batch_shape=None, mask=None)[source]

The class for environment action/observation space.

Parameters

feature_shape (Union[list(int), tuple(int), int]) – The action/observation shape before batching.
dtype (np.dtype) – The action/observation space dtype.
low (Union[int, float], optional) – The action/observation space lower boundary. Default: None .
high (Union[int, float], optional) – The action/observation space upper boundary. Default: None .
batch_shape (Union[list(int), tuple(int), int], optional) – The batch shape for vectorization. It usually be used in multi-environment and multi-agent cases. Default: None .
mask (Sequence[int], optional) – The mask for discrete action space. Default: None .

Examples

>>> action_space = Space(feature_shape=(6,), dtype=np.int32)
>>> print(action_space.ms_dtype)
Int32

property boundary

The space boundary of current Space.

Returns: Uppoer and lower boundary of current space.

property is_discrete

Is discrete space.

Returns: Whether the current space is discrete or continuous.

property ms_dtype

MindSpore data type of current Space.

Returns: The mindspore data type of current space.

property np_dtype

Numpy data type of current Space.

Returns: The numpy dtype of current space.

property num_values

available action number of current Space.

Returns: The available action of current space.

sample()[source]

Sample a valid action from the space

Returns: Tensor, a valid action.

property shape

Space shape after batching.

Returns: The shape of current space.

class mindspore_rl.environment.StarCraft2Environment(params, env_id=0)[source]

StarCraft2Environment is a wrapper of SMAC. SMAC is WhiRL’s environment for research in the field of collaborative multi-agent reinforcement learning (MARL) based on Blizzard’s StarCraft II RTS game. SMAC makes use of Blizzard’s StarCraft II Machine Learning API and DeepMind’s PySC2 to provide a convenient interface for autonomous agents to interact with StarCraft II, getting observations and performing actions. More detail please have a look at the official github of SMAC: https://github.com/oxwhirl/smac.

Parameters

params (dict) –

A dictionary contains all the parameters which are used in this class.

Configuration Parameters	Notices
sc2_args	a dict which contains key value that is used to create instance of SMAC, such as map_name. For more detail please have a look at its official github.

env_id (int, optional) – A integer which is used to set the seed of this environment, default value means the 0th environment. Default: 0 .

Supported Platforms:: Ascend GPU CPU

Examples

>>> env_params = {'sc2_args': {'map_name': '2s3z'}}
>>> environment = StarCraft2Environment(env_params, 0)
>>> print(environment)

close()[source]

Close the environment to release the resource.

Returns: Success(np.bool_), Whether shutdown the process or threading successfully.

class mindspore_rl.environment.TicTacToeEnvironment(params, env_id=0)[source]

Tic-Tac-Toe is a famous paper-and-pencil game (https://en.wikipedia.org/wiki/Tic-tac-toe). The rule is that two players draw Os or Xs in a three-by-tree grid. When three of their marks are in a Horizontal, vertical or diagonal row, that player will be the winner. The following figure is an example of Tic-Tac-Toe.

o		x
x	o
	x	o

Parameters

params (dict) – A dictionary contains all the parameters which are used in this class.
env_id (int, optional) – A integer which is used to set the seed of this environment, default value means the 0th environment. Default: 0 .

Supported Platforms:: Ascend GPU CPU

Examples

>>> from mindspore_rl.environment import TicTacToeEnvironment
>>> env_params = {}
>>> environment = TicTacToeEnvironment(env_params, 0)
>>> print(environment)
TicTacToeEnvironment<>

property action_space

Get the action space of the environment.

Returns: The action space of environment.

calculate_rewards()[source]

Return the rewards of current state.

Returns: A tensor which states for the rewards of current state.

property config

Get the config of environment.

Returns: A dictionary which contains environment’s info.

current_player()[source]

Return the current player of current state.

Returns: A tensor which states for current player.

property done_space

Get the done space of the environment.

Returns: The done space of environment.

is_terminal()[source]

Return whether the current state is terminal.

Returns: whether the current state is terminal or not.

legal_action()[source]

Return the legal action of current state.

Returns: A tensor which states for the legal action.

load(state)[source]

Load the input state. It will update the legal action, current state and done info of the game to the input checkpoint.

Parameters

state (Tensor) – The input checkpoint state.

Returns

state (Tensor), the state of checkpoint.
reward (Tensor), the reward of checkpoint.
done (Tensor), whether the checkpoint is terminal.

max_utility()[source]

Return the max utility of Tic-Tac-Toe.

Returns: A tensor which states for max utility.

property observation_space

Get the state space of the environment.

Returns: The state space of environment.

reset()[source]

Reset the environment to the initial state. It is always used at the beginning of each episode. It will return the value of initial state.

Returns: A Tensor which states for initial state.

property reward_space

Get the reward space of the environment.

Returns: The reward space of environment.

save()[source]

Return a repilca of environment. Tic-Tac-Toe do not need a replica, thus it will return the current state

Returns: A tensor which states for the current state.

step(action)[source]

Execute the environment step, which means that interact with environment once.

Parameters

action (Tensor) – A tensor that contains the action information.

Returns

state (Tensor), the environment state after performing the action.
reward (Tensor), the reward after performing the action.
done (Tensor), whether the simulation finishes or not.

total_num_player()[source]

Return the total number of player

Returns: Tensor, the total number of player.

mindspore_rl.network

Network component used to implement polices.

class mindspore_rl.network.FullyConnectedLayers(fc_layer_params, dropout_layer_params=None, activation_fn=nn.ReLU(), weight_init='normal', bias_init='zeros')[source]

This is a fully connected layers module. User can input abitrary number of fc_layer_params, then this module can create corresponding number of fully connect layers.

Parameters

fc_layer_params (list[int]) – A list of int states for the input and output size of fully connected layer. For example, if the input list is [10, 20, 3], then the module will create two fully connected layers whose input and output size are (10, 20) and (20, 3) respectively. The length of fc_layer_params should be greater than or equal to 3.
dropout_layer_params (list[float]) – A list of float states for the dropout rate. If the input list if [0.5, 0.3], then two dropout layers will be created after each fully connected layer. The length of dropout_layer_params should be one less than fc_layer_params. dropout_layer_params is a optional value. Default: None .
activation_fn (Union[str, Cell, Primitive]) – An instance of activation function. Default: nn.ReLu() .
weight_init (Union[Tensor, str, Initializer, numbers.Number]) – The trainable weight_init parameter. The dtype is same as x. The values of str refer to the function initializer, e.g. normal , uniform . Default: 'normal' .
bias_init (Union[Tensor, str, Initializer, numbers.Number]) – The trainable bias_init parameter. The dtype is same as x. The values of str refer to the function initializer, e.g. normal , uniform . Default: 'zeros' .

Inputs:

x (Tensor) - Tensor of shape $(*, f c_l a y e r s_p a r a m s [0])$ .

Outputs:

Tensor of shape $(*, f c_l a y e r s_p a r a m s [- 1])$ .

Examples

>>> import numpy as np
>>> from mindspore import Tensor
>>> from mindspore_rl.network.fully_connected_net import FullyConnectedLayers
>>> input = Tensor(np.ones([2, 4]).astype(np.float32))
>>> net = FullyConnectedLayers(fc_layer_params=[4, 10, 2])
>>> output = net(input)
>>> print(output.shape)
(2, 2)

construct(x)[source]

Parameters: x (Tensor) – Tensor of shape $(*, f c_l a y e r s_p a r a m s [0])$ .
Returns: Tensor of shape $(*, f c_l a y e r s_p a r a m s [- 1])$ .

class mindspore_rl.network.FullyConnectedNet(input_size, hidden_size, output_size, compute_type=mstype.float32)[source]

A basic fully connected neural network.

Parameters

input_size (int) – numbers of input size.
hidden_size (int) – numbers of hidden layers.
output_size (int) – numbers of output size.
compute_type (mindspore.dtype) – data type used for fully connected layer. Default: mindspore.dtype.float32 .

Examples

>>> from mindspore import Tensor
>>> from mindspore_rl.network.fully_connected_net import FullyConnectedNet
>>> input = Tensor(np.ones([2, 4]).astype(np.float32))
>>> net = FullyConnectedNet(4, 10, 2)
>>> output = net(input)
>>> print(output.shape)
(2, 2)

construct(x)[source]

Returns output of Dense layer.

Parameters: x (Tensor) – Tensor as the input of network.
Returns: The output of the Dense layer.

class mindspore_rl.network.GruNet(input_size, hidden_size, weight_init='normal', num_layers=1, has_bias=True, batch_first=False, dropout=0.0, bidirectional=False, enable_fusion=True)[source]

Stacked GRU (Gated Recurrent Unit) layers.

Apply GRU layer to the input.

For detailed information, please refer to mindspore.nn.GRU.

Parameters

input_size (int) – Number of features of input.
hidden_size (int) – Number of features of hidden layer.
weight_init (str or Initializer) – Initialize method, e.g. 'normal' , 'uniform'. Default: 'normal' .
num_layers (int) – Number of layers of stacked GRU. Default: 1 .
has_bias (bool) – Whether the cell has bias. Default: True .
batch_first (bool) – Specifies whether the first dimension of input x is batch_size. Default: False .
dropout (float) – If not 0.0 , append Dropout layer on the outputs of each GRU layer except the last layer. Default 0.0 . The range of dropout is [0.0, 1.0).
bidirectional (bool) – Specifies whether it is a bidirectional GRU, num_directions is 2 if bidirectional is True otherwise 1. Default: False .
enable_fusion (bool) – Whether need to use GRU fusion ops. Default: True .

Inputs:

x_in (Tensor) - Tensor of data type mindspore.float32 and shape $(s e q_l e n, b a t c h_s i z e, i n p u t_s i z e)$ or $(b a t c h_s i z e, s e q_l e n, i n p u t_s i z e)$ .
h_in (Tensor) - Tensor of data type mindspore.float32 and shape $(n u m_d i r e c t i o n s * n u m_l a y e r s, b a t c h_s i z e, h i d d e n_s i z e)$ . The data type of h_in must be the same as x_in.

Outputs:

Tuple, a tuple contains (x_out, h_out).

x_out (Tensor) - Tensor of shape $(s e q_l e n, b a t c h_s i z e, n u m_d i r e c t i o n s * h i d d e n_s i z e)$ or $(b a t c h_s i z e, s e q_l e n, n u m_d i r e c t i o n s * h i d d e n_s i z e)$ .
h_out (Tensor) - Tensor of shape $(n u m_d i r e c t i o n s * n u m_l a y e r s, b a t c h_s i z e, h i d d e n_s i z e)$ .

Examples

>>> net = GruNet(10, 16, 2, has_bias=True, bidirectional=False)
>>> x_in = Tensor(np.ones([3, 5, 10]).astype(np.float32))
>>> h_in = Tensor(np.ones([1, 5, 16]).astype(np.float32))
>>> x_out, h_out = net(x_in, h_in)
>>> print(x_out.shape)
(3, 5, 16)

construct(x_in, h_in)[source]

The forward calculation of gru net

Parameters

x_in (Tensor) – Tensor of data type mindspore.float32 and shape $(s e q_l e n, b a t c h_s i z e, i n p u t_s i z e)$ or $(b a t c h_s i z e, s e q_l e n, i n p u t_s i z e)$ .
h_in (Tensor) – Tensor of data type mindspore.float32 and shape $(n u m_d i r e c t i o n s * n u m_l a y e r s, b a t c h_s i z e, h i d d e n_s i z e)$ . The data type of h_in must be the same as x_in.

Returns

x_out (Tensor) - Tensor of shape $(s e q_l e n, b a t c h_s i z e, n u m_d i r e c t i o n s * h i d d e n_s i z e)$ or $(b a t c h_s i z e, s e q_l e n, n u m_d i r e c t i o n s * h i d d e n_s i z e)$ .
h_out (Tensor) - Tensor of shape $(n u m_d i r e c t i o n s * n u m_l a y e r s, b a t c h_s i z e, h i d d e n_s i z e)$ .

mindspore_rl.policy

Policies used in RL algorithms.

class mindspore_rl.policy.EpsilonGreedyPolicy(input_network, size, epsi_high, epsi_low, decay, action_space_dim, shape=(1,))[source]

Produces a sample action base on the given epsilon-greedy policy.

Parameters

input_network (Cell) – A network returns policy action.
size (int) – Shape of epsilon.
epsi_high (float) – A high epsilon for exploration betweens [0, 1].
epsi_low (float) – A low epsilon for exploration betweens [0, epsi_high].
decay (float) – A decay factor applied to epsilon.
action_space_dim (int) – Dimensions of the action space.
shape (tuple, optional) – Shape of output action in random policy, it should be the same as action get in greedy policy. Default: (1,).

Examples

>>> state_dim, hidden_dim, action_dim = (4, 10, 2)
>>> input_net = FullyConnectedNet(state_dim, hidden_dim, action_dim)
>>> policy = EpsilonGreedyPolicy(input_net, 1, 0.1, 0.1, 100, action_dim)
>>> state = Tensor(np.ones([1, state_dim]).astype(np.float32))
>>> step =  Tensor(np.array([10,]).astype(np.float32))
>>> output = policy(state, step)
>>> print(output.shape)
(1,)

construct(state, step)[source]

The interface of the construct function.

Parameters

state (Tensor) – The input tensor for network.
step (Tensor) – The current step, effects the epsilon decay.

Returns

The output action.

class mindspore_rl.policy.GreedyPolicy(input_network)[source]

Produces a sample action base on the given greedy policy.

Parameters: input_network (Cell) – network used to generate action probs by input state.

Examples

>>> state_dim, hidden_dim, action_dim = 4, 10, 2
>>> input_net = FullyConnectedNet(state_dim, hidden_dim, action_dim)
>>> policy = GreedyPolicy(input_net)
>>> state = Tensor(np.ones([2, 4]).astype(np.float32))
>>> output = policy(state)
>>> print(output.shape)
(2,)

construct(state)[source]

Returns the best action.

Parameters: state (Tensor) – State tensor as the input of network.
Returns: action_max, the best action.

class mindspore_rl.policy.Policy[source]

The virtual base class for the policy. This class should be overridden before calling in the model.

construct(*inputs, **kwargs)[source]

The interface of the construct function. Inherited and used by users. Args can refer to ‘epsilongreedypolicy’, ‘randompolicy’, etc.

Parameters

inputs – it’s depended on the user definition.
kwargs – it’s depended on the user definition.

Returns

User defined. Usually, it returns an action value or the probability distribution of an action.

class mindspore_rl.policy.RandomPolicy(action_space_dim, shape=(1,))[source]

Produces a random action betweens [0, acton_space_dim).

Parameters

action_space_dim (int) – dimension of the action space.
shape (tuple, optional) – shape of output action in random policy. Default: (1,).

Examples

>>> action_space_dim = 2
>>> policy = RandomPolicy(action_space_dim)
>>> output = policy()
>>> print(output.shape)
(1,)

construct()[source]

Returns a random number betweens [0, acton_space_dim).

Returns: A random integer betweens [0, acton_space_dim).

mindspore_rl.utils

class mindspore_rl.utils.DiscountedReturn(gamma, need_bprop=False, dtype=ms.float32)[source]

Calculate discounted return.

Set discounted return as $G$ , discounted factor as $γ$ , reward as $R$ , timestep as $t$ , max timestep as $N$ . Then $G_{t} = Σ_{t = 0}^{N} γ^{t} R_{t + 1}$

For the reward sequence contain multi-episode, $d o n e$ is introduced for indicating episode boundary, $l a s t_s t a t e_v a l u e$ represents value after final step of last episode.

Parameters

gamma (float) – Discounted factor between [0, 1].
need_bprop (bool) – Whether need to calculate the backpropagation of discounted returns. Default: False .
dtype (mindspore.dtype) – Data type for the output. Default: ms.float32 .

Inputs:

reward (Tensor) - The reward sequence contains multi-episode. Tensor of shape $(T i m e s t e p, B a t c h, . . .)$
done (Tensor) - The episode done flag. Tensor of shape $(T i m e s t e p, B a t c h)$ . The data type must be bool.
last_state_value (Tensor) - The value after final step of last episode. Tensor of shape $(B a t c h, . . .)$

Returns: Discounted return.

Examples

>>> net = DiscountedReturn(gamma=0.99)
>>> reward = Tensor([[1, 1, 1, 1]], dtype=ms.float32)
>>> done = Tensor([[False, False, True, False]])
>>> last_state_value = Tensor([2.], dtype=ms.float32)
>>> ret = net(reward, done, last_state_value)
>>> print(output.shape)
(2, 2)

class mindspore_rl.utils.OUNoise(stddev, damping, action_shape)[source]

Perform Ornstein-Uhlenbeck (OU) noise base on actions.

Set zero-mean normal distribution as $N (0, s t d d e v)$ , Then the next temporal value is $x_n e x t = (1 - d a m p i n g) * x - N (0, s t d d e v)$ , The action with OU Noise is $a c t i o n + = x_n e x t$ .

Parameters

stddev (float) – stddev of Ornstein-Uhlenbeck (OU) noise.
damping (float) – damping of Ornstein-Uhlenbeck (OU) noise.
action_shape (tuple) – action shape.

Inputs:

actions (Tensor) - Actions before perferming noise.

Outputs:

actions (Tensor) - Actions after perferming noise.

Examples

>>> import numpy as np
>>> from mindspore import Tensor
>>> from mindspore_rl.utils import OUNoise
>>> action_shape = (6,)
>>> actions = Tensor(np.ones(action_shape))
>>> net = OUNoise(stddev=0.2, damping=0.15, action_shape=action_shape)
>>> actions = net(actions)
>>> print(actions.shape)
(6,)

class mindspore_rl.utils.SoftUpdate(factor, update_interval, behavior_params, target_params)[source]

Update target network parameters with moving average algorithm.

Set target network parameter as $t a r g e t_p a r a m$ , behavior network parameter as $b e h a v i o r_p a r a m$ , moving averaget factor as $f a c t o r$ . Then $t a r g e t_p a r a m = (1. - f a c t o r) * b e h a v i o r_p a r a m + f a c t o r * t a r g e t_p a r a m$ .

Parameters

factor (float) – moving average factor between [0, 1].
update_interval (int) – The target network parameters will be updated every update_interval steps.
behavior_params (list(Parameter)) – list of behavior network parameters.
target_params (list(Parameter)) – list of target network parameters.

Examples

>>> import numpy as np
>>> import mindspore.nn as nn
>>> from mindspore.common.parameter import ParameterTuple
>>> from mindspore_rl.utils import SoftUpdate
>>> class Net(nn.Cell):
>>>     def __init__(self):
>>>         super().__init__()
>>>         self.behavior_params = ParameterTuple(nn.Dense(10, 20).trainable_params())
>>>         self.target_params = ParameterTuple(nn.Dense(10, 20).trainable_params())
>>>         self.updater = SoftUpdate(0.9, 2, self.behavior_params, self.target_params)
>>>     def construct(self):
>>>         return self.updater()
>>> net = Net()
>>> for _ in range(10):
>>>     net()
>>> np.allclose(net.behavior_params[0].asnumpy(), net.target_params[0].asnumpy(), atol=1e-5)
True

Network component used to implement polices.

class mindspore_rl.utils.AlgorithmFunc[source]

This is the base class for user to customize algorithm in MCTS. User need to inherit this base class and implement all the functions with SAME input and output.

calculate_prior(new_state, legal_action)[source]

Calculate prior of the input legal actions.

Parameters

new_state (mindspore.float32) – The state of environment.
legal_action (mindspore.int32) – The legal action of environment

Returns

prior (mindspore.float32), The probability (or prior) of all the input legal actions.

simulation(new_state)[source]

Simulation phase in MCTS. It takes the state as input and return the rewards.

Parameters: new_state (mindspore.float32) – The state of environment.
Returns: rewards (mindspore.float32), The results of simulation.

class mindspore_rl.utils.BatchRead[source]

Read a list of parameters to assign the target.

Warning

This is an experiential prototype that is subject to change and/or deletion.

Supported Platforms:: GPU CPU

Examples

>>> import mindspore
>>> from mindspore import nn
>>> from mindspore.common.parameter import Parameter, ParameterTuple
>>> from mindspore_rl.utils import BatchRead
>>> class SNet(nn.Cell):
...   def __init__(self):
...     super(SNet, self).__init__()
...     self.a = Parameter(Tensor(0.5, mstype.float32), name="a")
...     self.dense = nn.Dense(in_channels=16, out_channels=1, weight_init=0)
>>> class DNet(nn.Cell):
...   def __init__(self):
...     super(DNet, self).__init__()
...     self.a = Parameter(Tensor(0.1, mstype.float32), name="a")
...     self.dense = nn.Dense(in_channels=16, out_channels=1)
>>> class Read(nn.Cell):
...   def __init__(self, dst, src):
...     super(Read, self).__init__()
...     self.read = BatchRead()
...     self.dst = ParameterTuple(dst.trainable_params())
...     self.src = ParameterTuple(src.trainable_params())
...   def construct(self):
...     success = self.read(self.dst, self.src)
...     return success
>>> dst_net = DNet()
>>> source_net = SNet()
>>> nets = nn.CellList()
>>> nets.append(dst_net)
>>> nets.append(source_net)
>>> success = Read(nets[0], nets[1])()

construct(dst, src)[source]

Read the source parameter list to assign the dst.

Parameters

dst (tuple(Parameters)) – A paramameter tuple of the dst model.
src (tuple(Parameters)) – A paramameter tuple of the source model.

Returns

True.

class mindspore_rl.utils.BatchWrite[source]

Write a list of parameters to assign the target.

Warning

This is an experiential prototype that is subject to change and/or deletion.

Supported Platforms:: GPU CPU

Examples

>>> import mindspore
>>> from mindspore import nn
>>> from mindspore.common.parameter import Parameter, ParameterTuple
>>> from mindspore_rl.utils import BatchWrite
>>> class SourceNet(nn.Cell):
...   def __init__(self):
...     super(SourceNet, self).__init__()
...     self.a = Parameter(Tensor(0.5, mstype.float32), name="a")
...     self.dense = nn.Dense(in_channels=16, out_channels=1, weight_init=0)
>>> class DstNet(nn.Cell):
...   def __init__(self):
...     super(DstNet, self).__init__()
...     self.a = Parameter(Tensor(0.1, mstype.float32), name="a")
...     self.dense = nn.Dense(in_channels=16, out_channels=1)
>>> class Write(nn.Cell):
...   def __init__(self, dst, src):
...     super(Write, self).__init__()
...     self.w = BatchWrite()
...     self.dst = ParameterTuple(dst.trainable_params())
...     self.src = ParameterTuple(src.trainable_params())
...   def construct(self):
...     success = self.w(self.dst, self.src)
...     return success
>>> dst_net = DstNet()
>>> source_net = SourceNet()
>>> nets = nn.CellList()
>>> nets.append(dst_net)
>>> nets.append(source_net)
>>> success = Write(nets[0], nets[1])()

construct(dst, src)[source]

Write the source parameter list to assign the dst.

Parameters

dst (tuple(Parameters)) – A paramameter tuple of the dst model.
src (tuple(Parameters)) – A paramameter tuple of the source model.

Returns

True.

class mindspore_rl.utils.CallbackManager(callbacks)[source]

Execute callbacks sequentially

Parameters: callbacks (list[Callback]) – a list of Callbacks.

begin(params)[source]

Call only once before training.

Parameters: params (CallbackParam) – Parameters for begin.

end(params)[source]

Call only once after training.

Parameters: params (CallbackParam) – Parameters for end.

episode_begin(params)[source]

Call before each episode start

Parameters: params (CallbackParam) – Parameters for episode begin.

episode_end(params)[source]

Call before each episode end.

Parameters: params (CallbackParam) – Parameters for episode end.

class mindspore_rl.utils.CallbackParam[source]: It contains the parameters required for the execution of the callback function.

class mindspore_rl.utils.CheckpointCallback(save_per_episode=0, directory=None, max_ckpt_nums=5)[source]

Save the checkpoint file for all the model weights. And keep the latest max_ckpt_nums checkpoint files.

Parameters

save_per_episode (int, optional) – The frequency to save checkpoint. Default: 0 (not saved).
directory (str, optional) – The directory for saving checkpoints. Default: None , saving to './' .
max_ckpt_nums (int, optional) – Numbers of how many checkpoint files to be kept. Default: 5 .

Examples

>>> from mindspore_rl.utils.callback import CheckpointCallback
>>> from mindspore_rl.core import Session
>>> from mindspore_rl.algorithm.dqn import config
>>> ckpt_cb = CheckpointCallback()
>>> cbs = [ckpt_cb]
>>> session = Session(config.algorithm_config, None, None, cbs)

episode_end(params)[source]

Save checkpoint in the end of episode.

Parameters: params (CallbackParam) – Parameters of the tarining.

class mindspore_rl.utils.EvaluateCallback(eval_rate=0)[source]

Evaluate callback.

Parameters: eval_rate (int, optional) – The frequency to eval. Default: 0 (will not evaluate).

Examples

>>> from mindspore_rl.utils.callback import EvaluateCallback
>>> from mindspore_rl.core import Session
>>> from mindspore_rl.algorithm.dqn import config
>>> eval_cb = EvaluateCallback()
>>> cbs = [eval_cb]
>>> session = Session(config.algorithm_config, None, None, cbs)

begin(params)[source]

Store the eval rate in the begin of training, run once.

Parameters: params (CallbackParam) – Parameters for episode begin.

episode_end(params)[source]

Run evaluate in the end of episode, and print the rewards.

Parameters: params (CallbackParam) – Parameters for episode end.

class mindspore_rl.utils.LossCallback(print_rate=1)[source]

Print loss in each episode end.

Parameters: print_rate (int, optional) – The frequency to print loss. Default: 1 .

Examples

>>> from mindspore_rl.utils.callback import LossCallback
>>> from mindspore_rl.core import Session
>>> from mindspore_rl.algorithm.dqn import config
>>> loss_cb = LossCallback()
>>> cbs = [loss_cb]
>>> session = Session(config.algorithm_config, None, None, cbs)

episode_end(params)[source]

Print loss in the end of episode.

Parameters: params (CallbackParam) – Parameters of the tarining.

class mindspore_rl.utils.MCTS(env, tree_type, node_type, root_player, customized_func, device, args, has_init_reward=False, max_action=- 1.0, max_iteration=1000)[source]

Monte Carlo Tree Search(MCTS) is a general search algorithm for some kinds of decision processes, most notably those employed in software that plays board games, such as Go, chess. It was originally proposed in 2006. A general MCTS has four phases:

Selection - selects the next node according to the selection policy (like UCT, RAVE, AMAF and etc.).
Expansion - unless the selection reached a terminal state, expansion adds a new child node to the last node (leaf node) that is selected in Selection phase.
Simulation - performs an algorithm (random, neural net or other algorithms) to obtain the payoff.
Backpropagation - propagates the payoff for all visited node.

As the time goes by, these four phases of MCTS is evolved. AlphaGo introduced neural network to MCTS, which makes the MCTS more powerful.

This class is a mindspore ops implementation MCTS. User can use provided MCTS algorithm, or develop their own MCTS by derived base class (MonteCarloTreeNode) in c++.

Parameters

env (Environment) – It must be the subclass of Environment.
tree_type (str) – The name of tree type.
node_type (str) – The name of node type.
root_player (float) – The root player, which should be less than the total number of player.
customized_func (AlgorithmFunc) – Some algorithm specific class. For more detail, please have a look at documentation of AlgorithmFunc.
device (str) – The device type "CPU" , "GPU" . "Ascend" is not support yet.
args (Tensor) – any values which will be the input of MctsCreation. Please following the table below to provide the input value. These value will not be reset after invoke restore_tree_data.
has_init_reward (bool, optional) – Whether pass the reward to each node during the node initialization. Default: False.
max_action (float, optional) – The max number of action in environment. If the max_action is -1.0 , the step in Environment will accept the last action. Otherwise, it will accept max_action number of action. Default: -1.0 .
max_iteration (int, optional) –
The max training iteration of MCTS. Default: 1000 .

MCTS Tree Type

MCTS Node Type

Configuration Parameter

Notices

CPUCommon

CPUVanilla

UCT const

UCT const is used to calculate UCT value in Selection phase

GPUCommon

GPUVanilla

UCT const

Examples

>>> from mindspore import Tensor
>>> import mindspore as ms
>>> from mindspore_rl.environment import TicTacToeEnvironment
>>> from mindspore_rl.utils import VanillaFunc
>>> from mindspore_rl.utils import MCTS
>>> env = TicTacToeEnvironment(None)
>>> vanilla_func = VanillaFunc(env)
>>> uct = (Tensor(uct, ms.float32),)
>>> root_player = 0.0
>>> mcts = MCTS(env, "CPUCommon", "CPUVanilla", root_player, vanilla_func, device, args=uct)
>>> action, handle = mcts.mcts_search()
>>> print(action)

destroy(handle)[source]

destroy will destroy current tree. Please call this function ONLY when do not use this tree any more.

Parameters: handle (mindspore.int64) – The unique handle of mcts tree.
Returns: success (mindspore.bool_), Whether destroy is successful.

mcts_search(*args)[source]

mcts_search is the main function of MCTS. Invoke this function will return the best action of current state.

Parameters

*args (Tensor) – The variable which updates during each iteration. They will be restored after invoking restore_tree_data. The input value needs to match provied algorithm.

Returns

action (mindspore.int32), The action which is returned by monte carlo tree search.
handle (mindspore.int64), The unique handle of mcts tree.

restore_tree_data(handle)[source]

restore_tree_data will restore all the data in the tree, back to the initial state.

Parameters: handle (mindspore.int64) – The unique handle of mcts tree.
Returns: success (mindspore.bool_), Whether restore is successful.

class mindspore_rl.utils.TensorArray(dtype, element_shape, dynamic_size=True, size=0, name='TA')[source]

TensorArray: a dynamic array to store tensors.

Warning

This is an experiential prototype that is subject to change and/or deletion.

Parameters

dtype (mindspore.dtype) – the data type in the TensorArray.
element_shape (tuple(int)) – the shape of each tensor in a TensorArray.
dynamic_size (bool, optional) – if True , the size of TensorArray can be increased, otherwise it is a fixed size. Default: True .
size (int, optional) – if dynamic_size is False, size means the max size of the TensorArray. Default: 0 .
name (str, optional) – the name of this TensorArray, any str. Default: "TA" .

Supported Platforms:: GPU CPU

Examples

>>> import mindspore
>>> from mindspore_rl.utils import TensorArray
>>> ta = TensorArray(mindspore.int64, ())
>>> ta.write(0, 1)
>>> ta.write(1, 2)
>>> ans = ta.read(1)
>>> print(ans)
2
>>> s = ta.stack()
>>> print(s)
[1 2]
>>> ta.clear()
>>> ta.write(0, 3)
>>> ans = ta.read(0)
>>> print(ans)
3
>>> ta.close()

clear()[source]

Clear the created TensorArray. Only reset the TensorArray, clear the data and reset the size in TensorArray and keep the instance of this TensorArray.

Returns: Bool, true.

close()[source]

Close the created TensorArray.

Warning

Once close the TensorArray, every functions belong to this TensorArray will be disaviliable. Every resources created in TensorArray will be removed. If this TensorArray will be used in next step or somewhere, eg: next loop, please use clear instead.

Returns: Bool, true.

read(index)[source]

Read tensor form the TensorArray by the given position index.

Parameters: index ([int, mindspore.int64]) – The given index to get the tensor.
Returns: Tensor, the value in position index.

size()[source]

The logical size of TensorArray.

Returns: Tensor, the size of TensorArray.

stack()[source]

Stack the values in TensorArray into a stacked Tensor.

Returns: Tensor, all the values will be stacked into one tensor.

write(index, value)[source]

Write value(Tensor) to TensorArray in position index.

Parameters

index ([int, mindspore.int64]) – The position to write.
value (Tensor) – The value to add into the TensorArray.

Returns

Bool, true.

class mindspore_rl.utils.TensorsQueue(dtype, shapes, size=0, name='TQ')[source]

TensorsQueue: a queue which stores tensors lists.

Warning

This is an experiential prototype that is subject to change and/or deletion.

Parameters

dtype (mindspore.dtype) – the data type in the TensorsQueue. Each tensor should have the same dtype.
shapes (tuple[int64]) – the shape of each element in TensorsQueue.
size (int, optional) – the size of the TensorsQueue. Default: 0 .
name (str, optional) – the name of this TensorsQueue. Default: "TQ" .

Raises

TypeError – If dtype is not MindSpore number type.
ValueError – If size is less than 0.
ValueError – If shapes size is less than 1.

Supported Platforms:: GPU CPU

Examples

>>> import mindspore as ms
>>> from mindspore import Tensor
>>> from mindspore_rl.utils import TensorsQueue
>>> data1 = Tensor([[0, 1], [1, 2]], dtype=ms.float32)
>>> data2 = Tensor([1], dtype=ms.float32)
>>> tq = TensorsQueue(dtype=ms.float32, shapes=((2, 2), (1,)), size=5)
>>> tq.put((data1, data2))
>>> ans = tq.pop()

clear()[source]

Clear the created TensorsQueue. Only reset the TensorsQueue, clear the data and reset the size in TensorsQueue and keep the instance of this TensorsQueue.

Returns: Bool, true.

close()[source]

Close the created TensorsQueue.

Warning

Once close the TensorsQueue, every functions belong to this TensorsQueue will be disaviliable. Every resources created in TensorsQueue will be removed. If this TensorsQueue will be used in next step or somewhere, eg: next loop, please use clear instead.

Returns: Bool, true.

get()[source]

Get one element int the front of the TensorsQueue.

Returns: tuple(Tensors), the element in TensorsQueue.

pop()[source]

Get one element int the front of the TensorsQueue, and remove it.

Returns: tuple(Tensors), the element in TensorsQueue.

put(element)[source]

Put element(tuple(Tensors)) to TensorsQueue in the end of queue.

Parameters: element (tuple(Tensor) or list[tensor]) – The input element.
Returns: Bool, true.

size()[source]

Get the used size of the TensorsQueue.

Returns: Tensor(mindspore.int64), the used size of TensorsQueue.

class mindspore_rl.utils.TimeCallback(print_rate=1, fixed_steps_in_episode=None)[source]

Time Callback to monitor time costs for each episode.

Parameters

print_rate (int, optional) – The frequency to print time. Default: 1 .
fixed_steps_in_episode (int, optional) – If the number of steps in each episode is fixed, this number is used to calculate the step time. If None , the real steps number should be provided in params. Default: None .

Examples

>>> from mindspore_rl.utils.callback import TimeCallback
>>> from mindspore_rl.core import Session
>>> from mindspore_rl.algorithm.dqn import config
>>> time_cb = TimeCallback()
>>> cbs = [time_cb]
>>> session = Session(config.algorithm_config, None, None, cbs)

episode_begin(params)[source]

Get time in the begin of episode.

Parameters: params (CallbackParam) – Parameters of the tarining.

episode_end(params)[source]

Print time in the end of episode.

Parameters: params (CallbackParam) – Parameters of the tarining.

class mindspore_rl.utils.VanillaFunc(env)[source]

This is the customized algorithm for VanillaMCTS. The prior of each legal action is uniform distribution and it plays randomly to obtain the result of simulation.

Parameters: env (Environment) – The input environment.

Examples

>>> env = TicTacToeEnvironment(None)
>>> vanilla_func = VanillaFunc(env)
>>> legal_action = env.legal_action()
>>> prior = vanilla_func.calculate_prior(legal_action, legal_action)
>>> print(prior)

calculate_prior(new_state, legal_action)[source]

The functionality of calculate_prior is to calculate prior of the input legal actions.

Parameters

new_state (mindspore.float32) – The state of environment.
legal_action (mindspore.int32) – The legal action of environment

Returns

prior (mindspore.float32), The probability (or prior) of all the input legal actions.

simulation(new_state)[source]

The functionality of simulation is to calculate reward of the input state.

Parameters: new_state (mindspore.float32) – The state of environment.
Returns: rewards (mindspore.float32), The results of simulation.

mindspore_rl.utils.update_config(config, env_yaml, algo_yaml)[source]

Update the config by the provided yamls. Eg: see mindspore_rl/algorithm/dqn/config.py, mindspore_rl/example/env_yaml/ and mindspore_rl/example/algo_yaml/ for usage.

Parameters

config (dict) – the config to be update.
env_yaml (str) – the environment yaml file.
algo_yaml (str) – the algorithm yaml file.

MCTS Tree Type	MCTS Node Type	Configuration Parameter	Notices
CPUCommon	CPUVanilla	UCT const	UCT const is used to calculate UCT value in Selection phase
GPUCommon	GPUVanilla	UCT const	UCT const is used to calculate UCT value in Selection phase