mindspore_rl

Components for MindSpore Reinforcement Learning Framework.

mindspore_rl.agent

Components for agent, actor, learner, trainer.

class mindspore_rl.agent.Actor[source]

Base class for all actors. Actor is a class used to interact with the environment and generate data.

Examples

>>> from mindspore_rl.agent.actor import Actor
>>> from mindspore_rl.network import FullyConnectedNet
>>> from mindspore_rl.environment import GymEnvironment
>>> class MyActor(Actor):
...   def __init__(self):
...     super(MyActor, self).__init__()
...     self.argmax = P.Argmax()
...     self.actor_net = FullyConnectedNet(4, 10, 2)
...     self.env = GymEnvironment({'name': 'CartPole-v0'})
>>> my_actor = MyActor()
>>> print(my_actor)
MyActor<
(actor_net): FullyConnectedNet<
(linear1): Dense<input_channels=4, output_channels=10, has_bias=True>
(linear2): Dense<input_channels=10, output_channels=2, has_bias=True>
(relu): ReLU<>
>
(environment): GymEnvironment<>
act(phase, params)[source]

The act function will take an enumerate value and observation or other data which is needed during calculating the action. It will return a set of output which contains new observation, or other experience. In this function, agent will interact with environment.

Parameters
  • phase (enum) – A enumerate value states for init, collect, eval or other user-defined stage.

  • params (tuple(Tensor)) – A tuple of tensor as input, which is used to calculate action

Returns

A tuple of tensor as output, which states for experience data.

Return type

observation (tuple(Tensor))

act_init(state)[source]

The interface of the act initialisation function by get the state as input. User will needd to overload this function according to the algothrim.

Parameters

state (Tensor) – the output state from the environment.

Returns

whether the simulation is finished or not. - reward (Tensor): simulation reward. - state (Tensor): simulation state.

Return type

  • done (Tensor)

evaluate(state)[source]

The interface of the evaluate function by get the state as input. User will needd to overload this function according to the algothrim.

Parameters

state (Tensor) – the output state from the environment.

Returns

whether the simulation is finished or not. - reward (Tensor): simulation reward. - state (Tensor): simulation state.

Return type

  • done (Tensor)

get_action(phase, params)[source]

get_action is the method used to obtain the action. User will need to overload this function according to the algorithm. But argument of this function should be phase and params. This interface will not interact with environment

Parameters
  • phase (enum) – A enumerate value states for init, collect, eval or other user-defined stage.

  • params (tuple(Tensor)) – A tuple of tensor as input, which is used to calculate action

Returns

A tuple of tensor as output, containing actions and other data.

Return type

action (tuple(Tensor))

class mindspore_rl.agent.Agent(actors, learner)[source]

The base class for the Agent. As the definition of agent, it is composed of actor and leanrner. It has basic act and learn functions for interaction with environment and update itself.

Parameters
  • actors (Actor) – The actor instance.

  • learner (Learner) – The learner instance.

Examples

>>> from mindspore_rl.agent.learner import Learner
>>> from mindspore_rl.agent.actor import Actor
>>> from mindspore_rl.agent.agent import Agent
>>> actors = Actor()
>>> learner = Learner()
>>> agent = Agent(actors, learner)
>>> print(agent)
Agent<
(_actors): Actor<>
(_learner): Learner<>
>
act(phase, params)[source]

The act function will take an enumerate value and observation or other data which is needed during calculating the action. It will return a set of output which contains new observation, or other experience. In this function, agent will interact with environment.

Parameters
  • phase (enum) – A enumerate value states for init, collect or eval stage.

  • params (tuple(Tensor)) – A tuple of tensor as input, which is used to calculate action

Returns

A tuple of tensor as output, which states for experience

Return type

observation (tuple(Tensor))

get_action(phase, params)[source]

The get_action function will take an enumerate value and observation or other data which is needed during calculating the action. It will return a set of outputs containing actions and other data. In this function, agent will not interact with environment.

Parameters
  • phase (enum) – A enumerate value states for init, collect or eval stage.

  • params (tuple(Tensor)) – A tuple of tensor as input, which is used to calculate action

Returns

A tuple of tensor as output, containing actions and other data.

Return type

atcion (tuple(Tensor))

learn(experience)[source]

The learn function will take a set of experience as input to calculate the loss and update the weights.

Parameters

experience (tuple(Tensor)) – A tuple of tensor states for experience

Returns

Result which outputs after updating weights

Return type

results (tuple(Tensor))

class mindspore_rl.agent.Learner[source]

The base class of the learner. Calculate and update the self generated network through the input experience.

Examples

>>> from mindspore_rl.agent.learner import Learner
>>> from mindspore_rl.network import FullyConnectedNet
>>> class MyLearner(Learner):
...   def init(self):
...     super(MyLearner, self).init()
...     self.target_network = FullyConnectedNet(4, 10, 2)
>>> my_learner = MyLearner()
>>> print(my_learner)
MyLearner<
(target_network): FullyConnectedNet<
(linear1): Dense<input_channels=4, output_channels=10, has_bias=True>
(linear2): Dense<input_channels=10, output_channels=2, has_bias=True>
(relu): ReLU<>
>
learn(experience)[source]

The interface for the learn function. The behavior of the learn function depend on the user’s implementation. Usually, it takes the samples form replay buffer or other Tensors, and calculates the loss for updating the networks.

Parameters

experience (tuple(Tensor)) – Sampling from the buffer.

Returns

Result which outputs after updating weights

Return type

results (tuple(Tensor))

class mindspore_rl.agent.Trainer(msrl)[source]

The trainer base class. It is a process class that provides the basic mode of training.

Note

Reference to dqn_trainer.py.

Parameters

msrl (MSRL) – the function handler class.

evaluate()[source]

The interface of the evaluate function for evaluate in train.

load_and_eval(ckpt_path=None)[source]

The interface of the eval function for offline. A checkpoint must be provided.

Parameters

ckpt_path (string) – The checkpoint file to restore net.

train(episodes, callbacks=None, ckpt_path=None)[source]

The train method provides a standard training process, including the whole loop and callbacks. Users can inherit or overwrite as needed.

Parameters
  • episodes (int) – the number of training episodes.

  • callbacks (Optional[list[Callback]]) – List of callback objects. Default: None

  • ckpt_path (Optional[str]) – The checkpoint file path to init or restore net. Default: None.

train_one_episode()[source]

The interface of train one episode function in train. And the output of this function must be constricted as loss, rewards, steps, [optional]others in order.

trainable_variables()[source]

The variables for saving to checkpoint.

mindspore_rl.core

Helper components used to implement RL algorithms.

class mindspore_rl.core.MSRL(alg_config, deploy_config=None)[source]

The MSRL class provides the function handlers and APIs for reinforcement learning algorithm development.

It exposes the following function handler to the user. The input and output of these function handlers are identical to the user defined functions.

agent_act
agent_get_action
sample_buffer
agent_learn
replay_buffer_sample
replay_buffer_insert
replay_buffer_reset
Parameters

config (dict) –

provides the algorithm configuration.

  • Top level: defines the algorithm components.

    • key: ‘actor’, value: the actor configuration (dict).

    • key: ‘learner’, value: the learner configuration (dict).

    • key: ‘policy_and_network’, value: the policy and networks used by actor and learner (dict).

    • key: ‘collect_environment’, value: the collect environment configuration (dict).

    • key: ‘eval_environment’, value: the eval environment configuration (dict).

    • key: ‘replay_buffer’, value: the replay buffer configuration (dict).

  • Second level: the configuration of each algorithm component.

    • key: ‘number’, value: the number of actor/learner (int).

    • key: ‘type’, value: the type of the actor/learner/policy_and_network/environment (class name).

    • key: ‘params’, value: the parameters of actor/learner/policy_and_network/environment (dict).

    • key: ‘policies’, value: the list of policies used by the actor/learner (list).

    • key: ‘networks’, value: the list of networks used by the actor/learner (list).

    • key: ‘pass_environment’, value: True user needs to pass the environment instance into actor, False otherwise (Bool).

get_replay_buffer()[source]

It will return the instance of replay buffer.

Returns

Buffers (object), The instance of relay buffer. If the buffer is None, the return value will be None.

get_replay_buffer_elements(transpose=False, shape=None)[source]

It will return all the elements in the replay buffer.

Parameters
  • transpose (bool) – whether the output element needs to be transpose, if transpose is true, shape will also need to be filled. Default: False

  • shape (Tuple[int]) – the shape used in transpose. Default: None

Returns

elements (List[Tensor]), A set of tensor contains all the elements in the replay buffer

init(config)[source]

Initialization of MSRL object. The function creates all the data/objects that the algorithm requires. It also initializes all the function handler.

Parameters

config (dict) – algorithm configuration file.

class mindspore_rl.core.PriorityReplayBuffer(alpha, beta, capacity, sample_size, shapes, dtypes, seed0=0, seed1=0)[source]

PriorityReplayBuffer is experience container used in Deep Q-Networks. The algorithm is proposed in Prioritized Experience Replay <https://arxiv.org/abs/1511.05952>. Same as the normal replay buffer, it lets the reinforcement learning agents remember and reuse experiences from the past. Besides, it replays important transitions more frequently and improve sample effciency.

Parameters
  • alpha (float) – parameter to control degree of prioritization. 0 means the uniform sampling, 1 means priority sampling.

  • beta (float) – parameter to control degree of sampling correction. 0 means the no correction, 1 means full correction.

  • capacity (int) – the capacity of the buffer.

  • sample_size (int) – size for sampling from the buffer.

  • shapes (List[int]) – the shape of each tensor in a buffer element.

  • types (List[mindspore.dtype]) – the data type of each tensor in a buffer element.

  • seed0 (int) – Seed0 value for random generating. Default: 0.

  • seed1 (int) – Seed1 value for random generating. Default: 0.

Examples

>>> import mindspore as ms
>>> from mindspore import Tensor
>>> from mindspore_rl.core.priority_replay_buffer import PriorityReplayBuffer
>>> capacity = 10000
>>> batch_size = 10
>>> alpha, beta = 1., 1.
>>> shapes = [(4,), (1,), (1,), (4,)]
>>> dtypes = [ms.float32, ms.int32, ms.float32, ms.float32]
>>> replaybuffer = PriorityReplayBuffer(alpha, beta, capacity, batch_size, shapes, dtypes)
>>> print(replaybuffer)
PriorityReplayBuffer<>
destroy()[source]

Destroy the replay buffer.

Parameters

handle (Tensor) – Priority replay buffer instance handle with dtype int64 and shape (1,).

Returns

Priority replay buffer instance handle with dtype int64 and shape (1,).

push(*transition)[source]

Push a transition to the buffer. If the buffer is full, the oldest one will be removed.

Parameters

transition (List[Tensor]) – insert a list of tensor which matches with the initialized shapes and dtypes into the buffer.

Returns

handle(Tensor), Priority replay buffer instance handle with dtype int64 and shape (1,).

sample()[source]

Samples a batch of transitions from the replay buffer.

Returns

indices (Tensor), the transition indices in the replay buffer. weights (Tensor), the weight used to correct for sampling bias. transitions (tuple(Tensor)), transitions with variable-length tensors.

update_priorities(indices, priorities)[source]

Update transition prorities.

Parameters
  • indices (Tensor) –

  • priorities (Tensor) –

Returns

tuple(Tensor), Transition with its indices and correction weights.

class mindspore_rl.core.ReplayBuffer(batch_size, capacity, shapes, types)[source]

The replay buffer class. The replay buffer will store the experience from environment. In replay buffer, each element is a list of tensors. Therefore, the constructor of the ReplayBuffer class takes the shape and type of each tensor as an argument.

Parameters
  • batch_size (int) – size for sampling from the buffer.

  • capacity (int) – the capacity of the buffer.

  • shapes (List[int]) – the shape of each tensor in a buffer element.

  • types (List[mindspore.dtype]) – the data type of each tensor in a buffer element.

Examples

>>> batch_size = 10
>>> capacity = 10000
>>> shapes = [(4,), (1,), (1,), (4,)]
>>> types = [ms.float32, ms.int32, ms.float32, ms.float32]
>>> replaybuffer = ReplayBuffer(batch_size, capacity, shapes, types)
>>> print(replaybuffer)
ReplayBuffer<>
full()[source]

Check if the replaybuffer is full or not.

Returns

True if the replaybuffer is full, False otherwise.

get_item(index)[source]

Get an element from the replaybuffer in specific position(index).

Parameters

index (int) – the location of the item.

Returns

element (List[Tensor]), the element from the buffer.

insert(exp)[source]

Insert an element to the buffer. If the buffer is full, FIFO strategy will be used to replace the element in the buffer.

Parameters

exp (List[Tensor]) – insert a list of tensor which matches with the initialized shape and type into the buffer.

Returns

element (List[Tensor]), return the whole buffer after insertion

reset()[source]

Reset the replaybuffer. It changes the value of self.count to zero.

Returns

success (boolean), whether the reset successful or not.

sample()[source]

Sampling the replaybuffer, which means that it will randomly choose a set of element and output them.

Returns

A set of sampled elements from the buffer.

size()[source]

Return the size of the replybuffer.

Returns

size (int), the number of element in the replaybuffer.

class mindspore_rl.core.Session(alg_config, deploy_config=None)[source]

The Session is a class for running MindSpore RL algorithms.

Parameters
run(class_type=None, is_train=True, episode=0, duration=0, params=None, callbacks=None)[source]

Execute the reinforcement learning algorithm.

Parameters
  • class_type (Trainer) – The class type of the algorithm’s trainer class. Default: None.

  • is_train (bool) – Run the algorithm in train mode or eval mode. Default: True

  • episode (int) – The number of episode of the training. Default: 0.

  • duration (int) – The number of duration of the training. Default: 0.

  • params (dict) – The algorithm specific training parameters. Default: None.

  • callbacks (list[Callback]) – The callback list. Default: None.

mindspore_rl.environment

Component used to implement custom environments.

class mindspore_rl.environment.Environment[source]

The virtual base class for the environment. This class should be overridden before calling in the model.

property action_space

Get the action space of the environment.

Returns

The action space of environment.

property config

Get the config of environment.

Returns

A dictionary which contains environment’s info.

property done_space

Get the done space of the environment.

Returns

The done space of environment.

property observation_space

Get the state space of the environment.

Returns

The state space of environment.

reset()[source]

Reset the environment to the initial state. It is always used at the beginning of each episode. It will return the value of initial state or other initial information.

Returns

A tensor which states for the initial state of environment or a tuple contains initial information, such as new state, action, reward, etc.

property reward_space

Get the reward space of the environment.

Returns

The reward space of environment.

step(action)[source]

Execute the environment step, which means that interact with environment once.

Parameters

action (Tensor) – A tensor that contains the action information.

Returns

A tuple of Tensor which contains information after interacting with environment.

class mindspore_rl.environment.EnvironmentProcess(proc_no, env_num, envs, actions, observations, initial_states)[source]

An independent process responsible for creating and interacting with one or more environments.

Parameters
  • proc_no (int) – The process number assigned by the caller.

  • env_num (int) – The number of input environments.

  • envs (list(Environment)) – A list that contains instance of environment (subclass of Environment).

  • actions (Queue) – The queue used to pass actions to the environment process.

  • observations (Queue) – The queue used to pass observations to the caller process.

  • initial_states (Queue) – The queue used to pass initial states to the caller process.

Examples

>>> from multiprocessing import Queue
>>> actions = Queue()
>>> observations = Queue()
>>> initial_states = Queue()
>>> proc_no = 1
>>> env_num = 2
>>> env_params = {'name': 'CartPole-v0'}
>>> multi_env = [GymEnvironment(env_params), GymEnvironment(env_params)]
>>> env_proc = EnvironmentProcess(proc_no, env_num, multi_env, actions, observations, initial_states)
>>> env_proc.start()
run()[source]

Method to be run in sub-process; can be overridden in sub-class

class mindspore_rl.environment.GymEnvironment(params, env_id=0)[source]

The GymEnvironment class is a wrapper that encapsulates the Gym(https://gym.openai.com/) to provide the ability to interact with Gym environments in MindSpore Graph Mode.

Parameters
  • params (dict) –

    A dictionary contains all the parameters which are used in this class.

    Configuration Parameters

    Notices

    name

    the name of game in Gym

    seed

    seed used in Gym

  • env_id (int) – A integer which is used to set the seed of this environment.

Supported Platforms:

Ascend GPU CPU

Examples

>>> env_params = {'name': 'CartPole-v0'}
>>> environment = GymEnvironment(env_params, 0)
>>> print(environment)
GymEnvironment<>
property action_space

Get the action space of the environment.

Returns

The action space of environment.

property config

Get the config of environment.

Returns

A dictionary which contains environment’s info.

property done_space

Get the done space of the environment.

Returns

The done space of environment.

property observation_space

Get the state space of the environment.

Returns

The state space of environment.

reset()[source]

Reset the environment to the initial state. It is always used at the beginning of each episode. It will return the value of initial state.

Returns

A tensor which states for the initial state of environment.

property reward_space

Get the reward space of the environment.

Returns

The reward space of environment.

step(action)[source]

Execute the environment step, which means that interact with environment once.

Parameters

action (Tensor) – A tensor that contains the action information.

Returns

  • state (Tensor), the environment state after performing the action.

  • reward (Tensor), the reward after performing the action.

  • done (Tensor), whether the simulation finishes or not.

class mindspore_rl.environment.MsEnvironment(kwargs=None)[source]

Class encapsulates built-in environment.

Parameters

kwargs (dict) –

The dictionary of environment specific configurations. See below table for details:

Environment name

Configuration Parameters

Default value

Notices

Tag

seed

42

random seed

environment_num

2

number of environments

predator_num

10

number of predators

max_timestep

100

max timestep per episode

map_length

100

length of map

map_width

100

width of map

wall_hit_penalty

0.1

agent wall hit penalty

catch_reward

10

predator catch reward

caught_penalty

5

prey caught penalty

step_cost

0.01

step cost

Supported Platforms:

GPU

Examples

>>> config = {'name': 'Tag', 'predator_num': 4}
>>> env = MsEnvironment(config)
>>> observation = env.reset()
>>> action = Tensor(env.action_space.sample())
>>> observation, reward, done = env.step(action)
>>> print(observation.shape)
(2, 5, 21)
property action_space

Get the valid action space of the environment.

property config

Get environment configuration.

property done_space

Get the valid done space of the environment.

property observation_space

Get the valid observation space of the environment.

reset()[source]

Reset the environment to initial observation and return the initial observation.

Inputs:

No inputs.

Returns

Tensor, the initial observation.

Supported Platforms:

GPU

Examples

>>> config = {'name': 'Tag', 'predator_num': 4}
>>> env = MsEnvironment(config)
>>> observation = env.reset()
>>> print(observation.shape)
(2, 5, 21)
property reward_space

Get the valid reward space of the environment.

step(action)[source]

Run one timestep of environment to interact with environment.

Parameters

action (Tensor) – Action provided by the all of agents.

Returns

Tuple of 3 tensors, the observation, the reward and the done.

  • observation (Tensor) - Observations of all agents after action.

  • reward (Tensor) - Amount of reward returned by the environment.

  • done (Tensor) - Whether the episode has ended.

Supported Platforms:

GPU

Examples

>>> config = {'name': 'Tag', 'predator_num': 4}
>>> env = MsEnvironment(config)
>>> observation = env.reset()
>>> action = Tensor(env.action_space.sample())
>>> observation, reward, done = env.step(action)
>>> print(observation.shape)
(2, 5, 21)
class mindspore_rl.environment.MultiEnvironmentWrapper(env_instance, num_proc=None)[source]

The MultiEnvironmentWrapper is a wrapper for multi environment scenario. User implements their single environment class and set the environment number larger than 1 in configuration file, framework will automatically invoke this class to create a multi environment class.

Parameters
  • env_instance (list(Class)) – A list that contains instance of environment (subclass of Environment).

  • num_proc (int) – Number of processing uses during interacting with environment. Default: None.

Supported Plantforms:

Ascend GPU CPU

Examples

>>> env_params = {'name': 'CartPole-v0'}
>>> multi_env = [GymEnvironment(env_params), GymEnvironment(env_params)]
>>> wrapper = MultiEnvironmentWrapper(multi_env)
>>> print(wrapper)
MultiEnvironmentWrapper<>
property action_space

Get the action space of the environment.

Returns

A tuple which states for the space of action.

property config

Get the config of environment.

Returns

A dictionary which contains environment’s info.

property done_space

Get the done space of the environment.

Returns

A tuple which states for the space of done.

property observation_space

Get the state space of the environment.

Returns

A tuple which states for the space of state.

reset()[source]

Reset the environment to the initial state. It is always used at the beginning of each episode. It will return the value of initial state of each environment.

Returns

A list of tensor which states for all the initial states of each environment.

property reward_space

Get the reward space of the environment.

Returns

A tuple which states for the space of reward.

step(action)[source]

Execute the environment step, which means that interact with environment once.

Parameters

action (Tensor) – A tensor that contains the action information.

Returns

  • state (list(Tensor)), a list of environment state after performing the action.

  • reward (list(Tensor)), a list of reward after performing the action.

  • done (list(Tensor)), whether the simulations of each environment finishes or not.

class mindspore_rl.environment.Space(feature_shape, dtype, low=None, high=None, batch_shape=None)[source]

The class for environment action/observation space.

Parameters
  • feature_shape (Union[list(int), tuple(int), int]) – The action/observation shape before batching.

  • dtype (np.dtype) – The action/observation space dtype.

  • low (int, float, optional) – The action/observation space lower boundary.

  • high (int, float, optional) – The action/observation space upper boundary.

  • batch_shape (Union[list(int), tuple(int), int], optional) – The batch shape for vectorization. It usually be used in multi-environment and multi-agent cases.

Examples

>>> action_space = Space(feature_shape=(6,), dtype=np.int32)
>>> print(action_space.ms_dtype)
Int32
property boundary

The space boundary of current Space.

property is_discrete

Is discrete space.

property ms_dtype

MindSpore data type or current Space.

property np_dtype

Numpy data type of current Space.

property num_values

available action number of current Space.

sample()[source]

Sample a valid action from the space

Returns

Tensor, a valid action.

property shape

Space shape after batching.

class mindspore_rl.environment.StarCraft2Environment(params, env_id=0)[source]

StarCraft2Environment is a wrapper of SMAC. SMAC is WhiRL’s environment for research in the field of collaborative multi-agent reinforcement learning (MARL) based on Blizzard’s StarCraft II RTS game. SMAC makes use of Blizzard’s StarCraft II Machine Learning API and DeepMind’s PySC2 to provide a convenient interface for autonomous agents to interact with StarCraft II, getting observations and performing actions. Unlike the PySC2, SMAC concentrates on decentralised micromanagement scenarios, where each unit of the game is controlled by an individual RL agent. More detail please have a look at the official github of SMAC: https://github.com/oxwhirl/smac.

Parameters
  • params (dict) –

    A dictionary contains all the parameters which are used in this class.

    Configuration Parameters

    Notices

    sc2_args

    a dict which contains key value that is used to create instance of SMAC, such as map_name. For more detail please have a look at its official github.

  • env_id (int) – A integer which is used to set the seed of this environment.

Supported Platforms:

Ascend GPU CPU

Examples

>>> env_params = {'sc2_args': {'map_name': '2s3z'}}
>>> environment = StarCraft2Environment(env_params, 0)
>>> print(environment)
property action_space

Get the action space of the environment.

Returns

A tuple which states for the space of observation.

property config

Get the config of environment.

Returns

A dictionary which contains environment’s info.

property done_space

Get the done space of the environment.

Returns

The done space of environment.

get_step_info()[source]

Get the information after interacting with environment.

Returns

  • battle_won, whether this game is won or not.

  • dead_allies, how many allies are dead.

  • dead_enemies, how many enemies are dead.

property observation_space

Get the state space of the environment.

Returns

A tuple which states for the space of state.

reset()[source]

Reset the environment to the initial state. It is always used at the beginning of each episode. It will return the value of initial state, the global observation and an new available action.

Returns

A tuple of Tensor contains initial state, global observation and available actions.

property reward_space

Get the reward space of the environment.

Returns

The reward space of environment.

step(action)[source]

Execute the environment step, which means that interact with environment once.

Parameters

action (Tensor) – A tensor that contains the action information.

Returns

  • state (Tensor), the environment state after performing the action.

  • reward (Tensor), the reward after performing the action.

  • done (mindspore.bool_), whether the simulation finishes or not.

  • global_obs, the global observation of this environment.

  • avail_actions, the available actions in this state.

class mindspore_rl.environment.TicTacToeEnvironment(params, env_id=0)[source]

Tic-Tac-Toe is a famous paper-and-pencil game (en.wikipedia.org/wiki/Tic-tac-toe). The rule is that two players draw Os or Xs in a three-by-tree grid. When three of their marks are in a Horizontal, vertical or diagonal row, that player will be the winner. The following figure is an example of Tic-Tac-Toe.

o

x

x

o

x

o

Parameters
  • params (dict) – A dictionary contains all the parameters which are used in this class.

  • env_id (int) – A integer which is used to set the seed of this environment.

Supported Platforms:

Ascend GPU CPU

Examples

>>> env_params = {}
>>> environment = TicTacToeEnvironment(env_params, 0)
>>> print(environment)
TicTacToeEnvironment<>
property action_space

Get the action space of the environment.

Returns

The action space of environment.

calculate_rewards()[source]

Return the rewards of current state.

Returns

A tensor which states for the rewards of current state.

property config

Get the config of environment.

Returns

A dictionary which contains environment’s info.

current_player()[source]

Return the current player of current state.

Returns

A tensor which states for current player.

property done_space

Get the done space of the environment.

Returns

The done space of environment.

is_terminal()[source]

Return whether the current state is terminal.

Returns

whether the current state is terminal or not.

legal_action()[source]

Return the legal action of current state.

Returns

A tensor which states for the legal action.

load(state)[source]

Load the input state. It will update the legal action, current state and done info of the game to the input checkpoint.

Parameters

state (Tensor) – The input checkpoint state.

Returns

  • state (Tensor), the state of checkpoint.

  • reward (Tensor), the reward of checkpoint.

  • done (Tensor), whether the checkpoint is terminal.

max_utility()[source]

Return the max utility of Tic-Tac-Toe.

Returns

A tensor which states for max utility

property observation_space

Get the state space of the environment.

Returns

The state space of environment.

reset()[source]

Reset the environment to the initial state. It is always used at the beginning of each episode. It will return the value of initial state.

Returns

A Tensor which states for initial state.

property reward_space

Get the reward space of the environment.

Returns

The reward space of environment.

save()[source]

Return a repilca of environment. Tic-Tac-Toe do not need a replica, thus it will return the current state

Returns

A tensor which states for the current state.

step(action)[source]

Execute the environment step, which means that interact with environment once.

Parameters

action (Tensor) – A tensor that contains the action information.

Returns

  • state (Tensor), the environment state after performing the action.

  • reward (Tensor), the reward after performing the action.

  • done (Tensor), whether the simulation finishes or not.

mindspore_rl.network

Network component used to implement polices.

class mindspore_rl.network.FullyConnectedLayers(fc_layer_params, dropout_layer_params=None, activation_fn=nn.ReLU(), weight_init='normal', bias_init='zeros')[source]

This is a fully connected layers module. User can input abitrary number of fc_layer_params, then this module can create corresponding number of fully connect layers.

Parameters
  • fc_layer_params (List[int]) – A list of int states for the input and output size of fully connected layer. For example, if the input list is [10, 20, 3], then the module will create two fully connected layers whose input and output size are (10, 20) and (20, 3) respectively. The length of fc_layer_params should be larger than 3.

  • dropout_layer_params (List[float]) – A list of float states for the dropout rate. If the input list if [0.5, 0.3], then two dropout layers will be created after each fully connected layer. The length of dropout_layer_params should be one less than fc_layer_params. dropout_layer_params is a optional value. Default: None.

  • activation_fn (Union[str, Cell, Primitive) – An instance of activation function. Default: nn.ReLu().

  • weight_init (Union[Tensor, str, Initializer, numbers.Number]) – The trainable weight_init parameter. The dtype is same as x. The values of str refer to the function initializer. Default: ‘normal’.

  • bias_init (Union[Tensor, str, Initializer, numbers.Number]) – The trainable bias_init parameter. The dtype is same as x. The values of str refer to the function initializer. Default: ‘zeros’.

Inputs:
  • x (Tensor) - Tensor of shape (,fc_layers_params[0]).

Outputs:

Tensor of shape (,fc_layers_params[1]).

Examples

>>> input = Tensor(np.ones([2, 4]).astype(np.float32))
>>> net = FullyConnectedLayers(fc_layer_params=[4, 10, 2])
>>> output = net(input)
>>> print(output.shape)
(2, 2)
construct(x)[source]
Parameters

x (Tensor) – Tensor of shape (,fc_layers_params[0]).

Returns

Tensor of shape (,fc_layers_params[1]).

class mindspore_rl.network.FullyConnectedNet(input_size, hidden_size, output_size, compute_type=mstype.float32)[source]

A basic fully connected neural network.

Parameters
  • input_size (int) – numbers of input size.

  • hidden_size (int) – numbers of hidden layers.

  • output_size (int) – numbers of output size.

  • compute_type (mindspore.dtype) – data type used for fully connected layer. Default: mindspore.dtype.float32

Examples

>>> input = Tensor(np.ones([2, 4]).astype(np.float32))
>>> net = FullyConnectedNet(4, 10, 2)
>>> output = net(input)
>>> print(output.shape)
(2, 2)
construct(x)[source]

Returns output of Dense layer.

Parameters

x (Tensor) – Tensor as the input of network.

Returns

The output of the Dense layer.

class mindspore_rl.network.GruNet(input_size, hidden_size, weight_init='normal', num_layers=1, has_bias=True, batch_first=False, dropout=0.0, bidirectional=False)[source]

Stacked GRU (Gated Recurrent Unit) layers.

Apply GRU layer to the input.

For detailed information, please refer to mindspore.nn.GRU.

Parameters
  • input_size (int) – Number of features of input.

  • hidden_size (int) – Number of features of hidden layer.

  • weight_init (str or initializer) – Initialize method. Default: normal.

  • num_layers (int) – Number of layers of stacked GRU. Default: 1.

  • has_bias (bool) – Whether the cell has bias. Default: True.

  • batch_first (bool) – Specifies whether the first dimension of input x is batch_size. Default: False.

  • dropout (float) – If not 0.0, append Dropout layer on the outputs of each GRU layer except the last layer. Default 0.0. The range of dropout is [0.0, 1.0).

  • bidirectional (bool) – Specifies whether it is a bidirectional GRU, num_directions=2 if bidirectional=True otherwise 1. Default: False.

Inputs:
  • x_in (Tensor) - Tensor of data type mindspore.float32 and shape (seq_len, batch_size, input_size) or (batch_size, seq_len, input_size).

  • h_in (Tensor) - Tensor of data type mindspore.float32 and shape (num_directions * num_layers, batch_size, hidden_size). The data type of h_in must be the same as x_in.

Outputs:

Tuple, a tuple contains (x_out, h_out).

  • x_out (Tensor) - Tensor of shape (seq_len, batch_size, num_directions * hidden_size) or (batch_size, seq_len, num_directions * hidden_size).

  • h_out (Tensor) - Tensor of shape (num_directions * num_layers, batch_size, hidden_size).

Examples

>>> net = GruNet(10, 16, 2, has_bias=True, bidirectional=False)
>>> x_in = Tensor(np.ones([3, 5, 10]).astype(np.float32))
>>> h_in = Tensor(np.ones([1, 5, 16]).astype(np.float32))
>>> x_out, h_out = net(x_in, h_in)
>>> print(x_out.shape)
(3, 5, 16)
construct(x_in, h_in)[source]

The forward calculation of gru net

Parameters
  • x_in (Tensor) – Tensor of data type mindspore.float32 and shape (seq_len, batch_size, input_size) or (batch_size, seq_len, input_size).

  • h_in (Tensor) – Tensor of data type mindspore.float32 and shape (num_directions * num_layers, batch_size, hidden_size). The data type of h_in must be the same as x_in.

Returns

  • x_out (Tensor) - Tensor of shape (seq_len, batch_size, num_directions * hidden_size) or (batch_size, seq_len, num_directions * hidden_size).

  • h_out (Tensor) - Tensor of shape (num_directions * num_layers, batch_size, hidden_size).

mindspore_rl.policy

Policies used in RL algorithms.

class mindspore_rl.policy.EpsilonGreedyPolicy(input_network, size, epsi_high, epsi_low, decay, action_space_dim)[source]

Produces an epsilon-greedy sample action base on the given policy.

Parameters
  • input_network (Cell) – A network returns policy action.

  • size (int) – Shape of epsilon.

  • epsi_high (float) – A high epsilon for exploration betweens [0, 1].

  • epsi_low (float) – A low epsilon for exploration betweens [0, epsi_high].

  • decay (float) – A decay factor applied to epsilon.

  • action_space_dim (int) – Dimensions of the action space.

Examples

>>> state_dim, hidden_dim, action_dim = (4, 10, 2)
>>> input_net = FullyConnectedNet(state_dim, hidden_dim, action_dim)
>>> policy = EpsilonGreedyPolicy(input_net, 1, 0.1, 0.1, 100, action_dim)
>>> state = Tensor(np.ones([1, state_dim]).astype(np.float32))
>>> step =  Tensor(np.array([10,]).astype(np.float32))
>>> output = policy(state, step)
>>> print(output.shape)
(1,)
construct(state, step)[source]

The interface of the construct function.

Parameters
  • state (Tensor) – The input tensor for network.

  • step (Tensor) – The current step, effects the epsilon decay.

Returns

The output action.

class mindspore_rl.policy.GreedyPolicy(input_network)[source]

Produces a greedy action base on the given policy.

Parameters

input_network (Cell) – network used to generate action probs by input state.

Examples

>>> state_dim, hidden_dim, action_dim = 4, 10, 2
>>> input_net = FullyConnectedNet(state_dim, hidden_dim, action_dim)
>>> policy = GreedyPolicy(input_net)
>>> state = Tensor(np.ones([2, 4]).astype(np.float32))
>>> output = policy(state)
>>> print(output.shape)
(2,)
construct(state)[source]

Returns the best action.

Parameters

state (Tensor) – State tensor as the input of network.

Returns

action_max, the best action.

class mindspore_rl.policy.Policy[source]

The virtual base class for the policy. This class should be overridden before calling in the model.

construct(*inputs, **kwargs)[source]

The interface of the construct function. Inherited and used by users. Args can refer to ‘epsilongreedypolicy’, ‘randompolicy’, etc.

Parameters
  • inputs – it’s depended on the user definition.

  • kwargs – it’s depended on the user definition.

Returns

User defined. Usually, it returns an action value or the probability distribution of an action.

class mindspore_rl.policy.RandomPolicy(action_space_dim)[source]

Produces a random action betweens [0, acton_space_dim).

Parameters

action_space_dim (int) – dimension of the action space.

Examples

>>> action_space_dim = 2
>>> policy = RandomPolicy(action_space_dim)
>>> output = policy()
>>> print(output.shape)
(1,)
construct()[source]

Returns a random number betweens [0, acton_space_dim).

Returns

A random integer betweens [0, acton_space_dim).