mindspore_rl

MindSpore强化学习框架的组件。

mindspore_rl.agent

agent、actor、learner、trainer的组件。

class mindspore_rl.agent.Actor[源代码]

所有Actor的基类。Actor 是一个用来和环境交互并产生数据的类。

样例：

>>> from mindspore_rl.agent.actor import Actor
>>> from mindspore_rl.network import FullyConnectedNet
>>> from mindspore_rl.environment import GymEnvironment
>>> class MyActor(Actor):
...   def __init__(self):
...     super(MyActor, self).__init__()
...     self.argmax = P.Argmax()
...     self.actor_net = FullyConnectedNet(4, 10, 2)
...     self.env = GymEnvironment({'name': 'CartPole-v0'})
>>> my_actor = MyActor()
>>> print(my_actor)
MyActor<
(actor_net): FullyConnectedNet<
(linear1): Dense<input_channels=4, output_channels=10, has_bias=True>
(linear2): Dense<input_channels=10, output_channels=2, has_bias=True>
(relu): ReLU<>
>
(environment): GymEnvironment<>

act(phase, params)[源代码]

act 方法接收一个枚举值和观察数据或计算动作期间所需的数据。它将返回一组包含新观察数据或其他经验的输出。此接口将与环境交互。

参数：

phase (enum) - 一个int型的枚举值，用于初始化、收集、评估或其他用户定义的阶段。
params (tuple(Tensor)) - 作为输入的张量元组，用于计算动作。

返回：

observation (tuple(Tensor)) - 作为输出的张量元组，用于生成经验数据。

act_init(state)[源代码]

通过输入的state来初始化act的接口，用户需要根据算法重写。

参数：

state (Tensor) - 与环境交互返回的状态数据。

返回：

done (Tensor) - 仿真是否结束的标志。
reward (Tensor) - 仿真的结果。
state (Tensor) - 仿真的状态。

evaluate(state)[源代码]

通过输入的state来初始化evaluate的接口，用户需要根据算法重写。

参数：

state (Tensor) - 与环境交互返回的状态数据。

返回：

done (Tensor) - 仿真是否结束的标志。
reward (Tensor) - 仿真的结果。
state (Tensor) - 仿真的状态。

get_action(phase, params)[源代码]

get_action 是用来获得动作的方法。用户需要根据算法重载此函数。但该函数入参需为phase和params。此接口不会与环境交互。

参数：

phase (enum) - 一个int型的枚举值，用于初始化、收集、评估或者其他用户定义的阶段。
params (tuple(Tensor)) - 作为输入的张量元组，用于计算动作。

返回：

action (tuple(Tensor)) - 作为输出的张量元组，包含动作和其他所需数据的张量。

class mindspore_rl.agent.Learner[源代码]

Learner的基类。通过输入的经验数据，计算并更新自生的网络。

样例：

>>> from mindspore_rl.agent.learner import Learner
>>> from mindspore_rl.network import FullyConnectedNet
>>> class MyLearner(Learner):
...   def init(self):
...     super(MyLearner, self).init()
...     self.target_network = FullyConnectedNet(4, 10, 2)
>>> my_learner = MyLearner()
>>> print(my_learner)
MyLearner<
(target_network): FullyConnectedNet<
(linear1): Dense<input_channels=4, output_channels=10, has_bias=True>
(linear2): Dense<input_channels=10, output_channels=2, has_bias=True>
(relu): ReLU<>
>

learn(experience)[源代码]

learn 方法的接口。 learn 方法的行为取决于用户的实现。通常，它接受来自重放缓存中的 samples 或其他Tensors，并计算用于更新网络的损失。

参数：

experience (tuple(Tensor)) - 缓存中的经验数据。

返回：

results (tuple(Tensor)) - 更新权重后输出的结果。

class mindspore_rl.agent.Trainer(msrl)[源代码]

Trainer的基类。是一个流程类，提供训练的基本模式。

参数：

msrl (MSRL) - 函数句柄。

evaluate()[源代码]: 在训练中用于评估的评估方法。

load_and_eval(ckpt_path=None)[源代码]

离线评估的方法。必须提供一个checkpoint。

参数：

ckpt_path (string) - 需要加载到网络的checkpoint文件。默认值：None。

train(episodes, callbacks=None, ckpt_path=None)[源代码]

train 方法中提供一个标准的训练流程，包含整个循环和回调。用户可根据需要自行继承或覆写。

参数：

episodes (int) - 训练回合数。
callbacks (Optional[list[Callback]]) - 回调函数的列表。默认值：None。
ckpt_path (Optional[str]) - 要初始化或重加载的网络文件路径。默认值：None。

train_one_episode()[源代码]: 在训练中，训练一个回合的接口。该函数的输出必须按顺序限制为 loss, rewards, steps, [Optional]others。

trainable_variables()[源代码]: 用于保存至checkpoint的变量。

class mindspore_rl.agent.Agent(actors, learner)[源代码]

Agent的基类。作为智能体的定义，由Actor和Learner构成。具备基本的act和learn功能用于和环境交互和自我更新。

参数：

actors (Actor) - Actor 实例。
learner (learner) - learner 实例。

样例：

>>> from mindspore_rl.agent.learner import Learner
>>> from mindspore_rl.agent.actor import Actor
>>> from mindspore_rl.agent.agent import Agent
>>> actors = Actor()
>>> learner = Learner()
>>> agent = Agent(actors, learner)
>>> print(agent)
Agent<
(_actors): Actor<>
(_learner): Learner<>
>

act(phase, params)[源代码]

act 方法接收一个枚举值和观察数据或计算动作期间所需的数据。它将返回一组包含新观察数据或其他经验的输出。此接口中，Agent将与环境交互。

参数：

phase (enum) - 一个int型的枚举值，用于初始化、收集或评估的阶段。
params (tuple(Tensor)) - 作为输入的张量元组，用于计算动作。

返回：

observation (tuple(Tensor)) - 作为输出的张量元组，用于生成经验数据。

get_action(phase, params)[源代码]

get_action 方法接收一个枚举值和观察数据或计算动作期间所需的数据。它将返回一组包含动作和其他数据的输出。此接口中，Agent不与环境交互。

参数：

phase (enum) - 一个int型的枚举值，用于初始化、收集、评估或者其他用户定义的阶段。
params (tuple(Tensor)) - 作为输入的张量元组，用于计算动作。

返回：

action (tuple(Tensor)) - 作为输出的张量元组，包含动作和其他所需数据的张量。

learn(experience)[源代码]

learn 方法接收一组经验数据作为输入，以计算损失并更新权重。

参数：

experience (tuple(Tensor)) - 经验的张量状态元组。

返回：

results (tuple(Tensor)) - 更新权重后输出的结果。

mindspore_rl.core

用于实现 RL 算法的Helper程序组件。

class mindspore_rl.core.MSRL(alg_config, deploy_config=None)[源代码]

MSRL提供了用于强化学习算法开发的方法和API。它向用户公开以下方法。这些方法的输入和输出与用户定义的方法相同。

agent_act
agent_get_action
sample_buffer
agent_learn
replay_buffer_sample
replay_buffer_insert
replay_buffer_reset

参数：

alg_config (dict) - 提供算法配置。
deploy_config (dict) - 提供分布式配置。
- 顶层 - 定义算法组件。
关键字: actor，值： actor的配置 (dict)。
关键字: learner，值： learner的配置 (dict)。
关键字: policy_and_network，值： actor和learner使用的策略和网络 (dict)。
关键字: collect_environment，值：收集环境的配置 (dict)。
关键字: eval_environment，值：评估环境的配置 (dict)。
关键字: replay_buffer，值：重放缓存的配置 (dict)。
- 第二层 - 每个算法组件的配置。
关键字: number，值： actor/learner的数量 (int)。
关键字: type，值： actor/learner/policy_and_network/environment (class)。
关键字: params，值： actor/learner/policy_and_network/environment的参数 (dict)。
关键字: policies，值： actor/learner使用的策略列表 (list)。
关键字: networks，值： actor/learner使用的网络列表 (list)。
关键字: pass_environment，值：如果为True，用户需要传递环境实例给actor，为False则不需要 (bool)。

get_replay_buffer()[源代码]

返回重放缓存的实例。

返回：

buffers (object) - 重放缓存的实例。如果缓存为None，返回也为None。

get_replay_buffer_elements(transpose=False, shape=None)[源代码]

返回重放缓存中的所有元素。

参数：

transpose (bool) - 输出元素是否需要转置，如果为True，则shape也需指定。默认值：False。
shape (tuple[int]) - 转置的shape。默认值：None。

返回：

elements (List[Tensor]) - 一组包含所有重放缓存中数据的张量。

init(config)[源代码]

MSRL 对象的初始化。该方法创建算法所需的所有数据/对象。它会初始化所有的方法。

参数：

config (dict) - 算法的配置文件。

class mindspore_rl.core.Session(alg_config, deploy_config=None)[源代码]

Session是一个用于运行MindSpore RL算法的类。

参数：

alg_config (dict) - 算法的配置或算法的部署配置。
deploy_config (dict) - 分布式的部署配置。更多算法配置的详细信息，请看 https://www.mindspore.cn/reinforcement/docs/zh-CN/r0.5/custom_config_info.html

run(class_type=None, is_train=True, episode=0, duration=0, params=None, callbacks=None)[源代码]

执行强化学习算法。

参数：

class_type (Trainer) - 算法的trainer类的类型。默认值： None。
is_train (bool) - 在训练或推理中执行算法，True为训练，False为推理。默认值： True。
episode (int) - 训练的回合数。默认值： 0。
duration (int) - 每回合的步数。默认值： 0。
params (dict) - 算法特定的训练参数。默认值： None。
callbacks (list[Callback]) - 回调列表。默认值： None。

class mindspore_rl.core.UniformReplayBuffer(batch_size, capacity, shapes, types)

重放缓存类。重放缓存区中存放来自环境的经验数据。在该类中，每个元素都是一组Tensor，因此，ReplayBuffer类的构造函数将每个Tensor的形状和类型作为参数。

参数：

batch_size (int) - 从缓存区采样的batch大小。
capacity (int) - 缓存区的大小。
shapes (list[int]) - 缓存区中每个元素对应的Tensor shape列表。
types (list[mindspore.dtype]) - 缓存区中每个元素对应的Tensor dtype列表。

full()

检查缓存区是否已满。

返回：

Full (bool) - 缓存区已满返回True，否则返回False。

get_item(index)

从缓存区的指定位置取出元素。

参数：

index (int) - 元素的索引。

返回：

element (list[Tensor]) - 返回指定位置的元素。

insert(exp)

将元素插入缓存区。如果缓存区已满，则将使用先进先出的策略替换缓存区的元素。

参数：

exp (list[Tensor]) - 插入的Tensor组，需要符合缓存初始化时的shape和type。

返回：

element (list[Tensor]) - 返回插入数据后的缓存区。

reset()

重置缓存区，将count值置零。

返回：

success (bool) - 重置是否成功。

sample()

缓存区采样，随机地选择一组元素并输出。

返回：

data (Tuple(Tensor)) - 一组从缓存区随机采样出的元素。

size()

返回缓存区的大小。

返回：

size (int) - 缓存区的元素个数。

class mindspore_rl.core.PriorityReplayBuffer(alpha, beta, capacity, sample_size, shapes, dtypes, seed0=0, seed1=0)[源代码]

优先级经验回放缓存，用于深度Q学习存储经验数据。该算法在 Prioritized Experience Replay 中提出。与普通的经验回放缓存相同，它允许强化学习智能体记住和重用过去的经验。此外，它更频繁的回放重要的transition，提高样本效率。

参数：

alpha (float) - 控制优先级程度的参数。0表示均匀采样，1表示优先级采样。
beta (float) - 控制采样校正程度的参数。0表示不校正，1表示完全校正。
capacity (int) - 缓存的容量。
sample_size (int) - 从缓存采样的大小
shapes (list[int]) - 缓存区中张量维度列表。
types (list[mindspore.dtype]) - 缓存区张量数据类型列表。
seed0 (int) - 随机数种子0值。默认值：0。
seed1 (int) - 随机数种子1值。默认值：0。

样例：

>>> import mindspore as ms
>>> from mindspore import Tensor
>>> from mindspore_rl.core.priority_replay_buffer import PriorityReplayBuffer
>>> capacity = 10000
>>> batch_size = 10
>>> alpha, beta = 1., 1.
>>> shapes = [(4,), (1,), (1,), (4,)]
>>> dtypes = [ms.float32, ms.int32, ms.float32, ms.float32]
>>> replaybuffer = PriorityReplayBuffer(alpha, beta, capacity, batch_size, shapes, dtypes)
>>> print(replaybuffer)
PriorityReplayBuffer<>

push(*transition)[源代码]

将transition推送到缓存区。如果缓存区已满，则覆盖最早的数据。

参数：

transition (List(Tensor)) - 与初始化的shapes和dtypes匹配的张量列表。

返回：

handle (Tensor) - 优先级经验回放缓存句柄，数据和维度分别是int64和（1,）。

sample()[源代码]

从缓存区中采样一批transition。

返回：

indices (Tensor) - transition在缓存区中的索引。
weights (Tensor) - 用于校正采样偏差的权重。
transition - 采样得到的transition。

update_priorities(indices, priorities)[源代码]

更新transition的优先级。

参数：

indices (Tensor) - transition在缓存区中的索引。
priorities (Tensor) - transition优先级。

返回：

handle (Tensor) - 优先级经验回放缓存句柄，数据和维度分别是int64和（1,）。

destroy()[源代码]

销毁经验回放缓存。

返回：

handle (Tensor) - 优先级经验回放缓存句柄，数据和维度分别是int64和（1,）。

class mindspore_rl.core.ReplayBuffer(batch_size, capacity, shapes, types)[源代码]

The replay buffer class. The replay buffer will store the experience from environment. In replay buffer, each element is a list of tensors. Therefore, the constructor of the ReplayBuffer class takes the shape and type of each tensor as an argument.

Parameters

batch_size (int) – size for sampling from the buffer.
capacity (int) – the capacity of the buffer.
shapes (List[int]) – the shape of each tensor in a buffer element.
types (List[mindspore.dtype]) – the data type of each tensor in a buffer element.

Examples

>>> batch_size = 10
>>> capacity = 10000
>>> shapes = [(4,), (1,), (1,), (4,)]
>>> types = [ms.float32, ms.int32, ms.float32, ms.float32]
>>> replaybuffer = ReplayBuffer(batch_size, capacity, shapes, types)
>>> print(replaybuffer)
ReplayBuffer<>

full()[源代码]

Check if the replaybuffer is full or not.

Returns: True if the replaybuffer is full, False otherwise.

get_item(index)[源代码]

Get an element from the replaybuffer in specific position(index).

Parameters: index (int) – the location of the item.
Returns: element (List[Tensor]), the element from the buffer.

insert(exp)[源代码]

Insert an element to the buffer. If the buffer is full, FIFO strategy will be used to replace the element in the buffer.

Parameters: exp (List[Tensor]) – insert a list of tensor which matches with the initialized shape and type into the buffer.
Returns: element (List[Tensor]), return the whole buffer after insertion

reset()[源代码]

Reset the replaybuffer. It changes the value of self.count to zero.

Returns: success (boolean), whether the reset successful or not.

sample()[源代码]

Sampling the replaybuffer, which means that it will randomly choose a set of element and output them.

Returns: A set of sampled elements from the buffer.

size()[源代码]

Return the size of the replybuffer.

Returns: size (int), the number of element in the replaybuffer.

mindspore_rl.environment

用于实现自定义环境的组件。

class mindspore_rl.environment.GymEnvironment(params, env_id=0)[源代码]

GymEnvironment将 Gym 封装成一个类来提供在MindSpore图模式下也能和Gym环境交互的能力。

参数：

params (dict) - 字典包含GymEnvironment类中所需要的所有参数。

配置参数	备注
name	Gym内游戏的名字
seed	Gym内使用的随机种子

env_id (int) - 环境id，用于设置环境内种子。

支持平台：

Ascend GPU CPU

样例：

>>> env_params = {'name': 'CartPole-v0'}
>>> environment = GymEnvironment(env_params, 0)
>>> print(environment)
GymEnvironment<>

reset()[源代码]

将环境重置为初始状态。reset方法一般在每一局游戏开始时使用，并返回环境的初始状态值。

返回：: Tensor，表示环境初始状态。

step(action)[源代码]

执行环境Step函数来和环境交互一回合。

参数：

action (Tensor) - 包含动作信息的Tensor。

返回：

state (Tensor) - 输入动作后的环境返回的新状态。
reward (Tensor) - 输入动作后环境返回的奖励。
done (Tensor) - 输入动作后环境是否终止。

property action_space

返回：

Space，环境的动作空间。

property config

返回：

dict，一个包含环境信息的字典。

property done_space

返回：

Space，环境的终止空间。

property observation_space

返回：

Space，环境的状态空间。

property reward_space

返回：

Space，环境的奖励空间。

class mindspore_rl.environment.MultiEnvironmentWrapper(env_instance, num_proc=None)[源代码]

MultiEnvironmentWrapper是多环境场景下的包装器。用户实现自己的单环境类，并在配置文件中设置环境数量大于1时，框架将自动调用此类创建多环境。

参数：

env_instance (list(Environment)) - 包含环境实例（继承Environment类）的List。
num_proc (int) - 在和环境交互时使用的进程数量。默认值： None。

样例：

>>> env_params = {'name': 'CartPole-v0'}
>>> multi_env = [GymEnvironment(env_params), GymEnvironment(env_params)]
>>> wrapper = MultiEnvironmentWrapper(multi_env)
>>> print(wrapper)
MultiEnvironmentWrapper<>

reset()[源代码]

将环境重置为初始状态。reset方法一般在每一局游戏开始时使用，并返回环境的初始状态值。

返回：: 表示环境初始状态的Tensor List。

step(action)[源代码]

执行环境Step函数来和环境交互一回合。

参数：

action (Tensor) - 包含动作信息的Tensor。

返回：

state (list(Tensor)) - 输入动作后的环境返回的新状态List。
reward (list(Tensor)) - 输入动作后环境返回的奖励List。
done (list(Tensor)) - 输入动作后环境是否终止的List。

property action_space

返回：

Space，环境的动作空间。

property config

返回：

dict，一个包含环境信息的字典。

property done_space

返回：

Space，返回环境的终止空间。

property observation_space

返回：

Space，返回环境的状态空间。

property reward_space

返回：

Space，返回环境的奖励空间。

class mindspore_rl.environment.Environment[源代码]

环境的虚基类。在调用此类之前，请重写其中的方法。

property action_space

获取环境的动作空间。

返回：: 返回环境的动作空间。

property config

获取环境的配置信息。

返回：: 返回一个包含环境信息的字典。

property done_space

获取环境的终止空间。

返回：: 返回环境的终止空间。

property observation_space

获取环境的状态空间。

返回：: 返回环境的状态空间。

reset()[源代码]

将环境重置为初始状态。reset方法一般在每一局游戏开始时使用，并返回环境的初始状态值以及其reset方法初始信息。

返回：: 表示环境初始状态的Tensor或者Tuple包含初始信息，，如新的状态，动作，奖励等。

property reward_space

获取环境的状态空间。

返回：: 返回环境的奖励空间。

step(action)[源代码]

执行环境Step函数来和环境交互一回合。

参数：

action (Tensor) - 包含动作信息的Tensor。

返回：

tuple，包含和环境交互后的信息。

class mindspore_rl.environment.Space(feature_shape, dtype, low=None, high=None, batch_shape=None)[源代码]

包含环境动作/状态空间的类。

参数：

feature_shape (Union[list(int), tuple(int), int]) - 批处理前的动作/状态的Shape。
dtype (np.dtype) - 动作/状态空间的数据类型。
low (int, float) - 动作/状态空间的下边界。默认：None。
high (int, float) - 动作/状态空间的上边界。默认：None。
batch_shape (Union[list(int), tuple(int), int]) - 矢量化的批量Shape。通常用于多环境和多智能体的场景。默认：None。

样例：

>>> action_space = Space(feature_shape=(6,), dtype=np.int32)
>>> print(action_space.ms_dtype)
Int32

property boundary

返回：

当前空间的上下边界。

property is_discrete

返回：

是否为离散空间。

property ms_dtype

返回：

当前空间的MindSpore的数据类型。

property np_dtype

返回：

当前空间的Numpy的数据类型。

property num_values

返回：

当前空间可选动作的数量。

property shape

返回：

批处理后的Space的Shape。

sample()[源代码]

从当前Space里随机采样一个合法动作。

返回：

action (Tensor) - 一个合法动作的Tensor。

class mindspore_rl.environment.MsEnvironment(kwargs=None)[源代码]

封装了内置环境（c++实现的环境）的类。

参数：

kwargs (dict) - 和环境相关的特定配置信息。详细信息请参见下表：

环境名称	配置参数	默认值	备注
Tag环境	seed	42	随机种子
	environment_num	2	环境数量
	predator_num	10	Predator的数量
	max_timestep	100	每一局游戏的最大步长
	map_length	100	地图的长
	map_width	100	地图的宽
	wall_hit_penalty	0.1	智能体撞击到墙的惩罚
	catch_reward	10	Predator抓捕到目标的奖励
	caught_penalty	5	Prey被捕捉到的惩罚
	step_cost	0.01	单个Step的基础成本

支持平台：

GPU

样例：

>>> config = {'name': 'Tag', 'predator_num': 4}
>>> env = MsEnvironment(config)
>>> observation = env.reset()
>>> action = Tensor(env.action_space.sample())
>>> observation, reward, done = env.step(action)
>>> print(observation.shape)
(2, 5, 21)

property action_space

返回：

Space，获取环境的动作空间。

property config

返回：

dict，获取环境信息。

property done_space

返回：

Space，获取环境的终止空间。

property observation_space

返回：

Space，获取环境的状态空间。

reset()[源代码]

将环境重置为初始状态，并返回环境的初始状态值。

输入：: 没有输入。
返回：: Tensor，表示环境初始状态。
支持平台：: GPU

样例：

>>> config = {'name': 'Tag', 'predator_num': 4}
>>> env = MsEnvironment(config)
>>> observation = env.reset()
>>> print(observation.shape)
(2, 5, 21)

property reward_space: Space，获取环境的奖励空间。

step(action)[源代码]

执行环境Step函数来和环境交互一回合。

参数：

action (Tensor) - 由所有智能体提供的动作。

返回：

3 个张量的元组，状态、奖励和终止。

observation (Tensor) - 输入动作后的环境返回的所有智能体的新状态。
reward (Tensor) - 输入动作后环境返回的奖励。
done (Tensor) - 输入动作后环境是否终止。

支持平台：

GPU

样例：

>>> config = {'name': 'Tag', 'predator_num': 4}
>>> env = MsEnvironment(config)
>>> observation = env.reset()
>>> action = Tensor(env.action_space.sample())
>>> observation, reward, done = env.step(action)
>>> print(observation.shape)
(2, 5, 21)

class mindspore_rl.environment.EnvironmentProcess(proc_no, env_num, envs, actions, observations, initial_states)[源代码]

负责创建一个独立进程用作与一个或多个环境交互。

参数：

proc_no (int) - 被分配的进程号。
env_num (int) - 传入此进程的环境数量。
envs (list(Environment)) - 包含环境实例（继承Environment类）的List。
actions (Queue) - 用于将动作传递给环境进程的队列。
observations (Queue) - 用于将状态传递给环境进程的队列。
initial_states (Queue) - 用于将初始状态传递给环境进程的队列。

样例：

>>> from multiprocessing import Queue
>>> actions = Queue()
>>> observations = Queue()
>>> initial_states = Queue()
>>> proc_no = 1
>>> env_num = 2
>>> env_params = {'name': 'CartPole-v0'}
>>> multi_env = [GymEnvironment(env_params), GymEnvironment(env_params)]
>>> env_proc = EnvironmentProcess(proc_no, env_num, multi_env, actions, observations, initial_states)
>>> env_proc.start()

run()[源代码]: 在子进程中运行的方法，可以在子类中重写。

class mindspore_rl.environment.StarCraft2Environment(params, env_id=0)[源代码]

StarCraft2Environment是一个SMAC的包装器。SMAC是WhiRL的一个基于暴雪星际争霸2这个战略游戏开发的用于多智能体强化学习（MARL）在合作场景的环境。SMAC通过使用暴雪星际争霸2的机器学习API和DeepMind的PySC2提供了易用的界面方便智能体与星际争霸2的交互来获得环境的状态和合法的动作。不像PySC2，SMAC专注于去中心的细微操控场景，这种场景下游戏中的每个单位都会被一个独立的RL智能体操控。更多的信息请查阅官方的SMAC官方的GitHub： <https://github.com/oxwhirl/smac>。

参数：

params (dict) - 字典包含StarCraft2Environment类中所需要的所有参数。

配置参数

备注

sc2_args

一个用于创建SMAC实例的字典包含一些SMAC需要的key值如map_name. 详细配置信息请查看官方GitHub。
env_id (int) - 环境id，用于设置环境内种子。

支持平台：

Ascend GPU CPU

样例：

>>> env_params = {'sc2_args': {'map_name': '2s3z'}}
>>> environment = StarCraft2Environment(env_params, 0)
>>> print(environment)

property action_space

获取环境的动作空间。

返回：: Space，返回环境的动作空间。

property config

获取环境的配置信息。

返回：: dict，返回一个包含环境信息的字典。

property done_space

获取环境的终止空间。

返回：: Space，返回环境的终止空间。

get_step_info()[源代码]

在与环境交互后，获得环境的信息。

返回：

battle_won (Tensor) - 是否这局游戏取得胜利。
dead_allies (Tensor) - 己方单位阵亡数量。
dead_enemies (Tensor) - 敌方单位阵亡数量。

property observation_space

获取环境的状态空间。

返回：: 返回环境的状态空间。

reset()[源代码]

将环境重置为初始状态。reset方法一般在每一局游戏开始时使用，并返回环境的初始状态值，全局状态以及新的合法动作。

返回：: tuple，包含了环境的初始状态值，全局状态以及新的合法动作这几个Tensor。

property reward_space

获取环境的奖励空间。

返回：: Space，返回环境的奖励空间。

step(action)[源代码]

执行环境Step函数来和环境交互一回合。

参数：

action (Tensor) - 包含动作信息的Tensor。

返回：

state (Tensor) - 输入动作后的环境返回的新状态。
reward (Tensor) - 输入动作后环境返回的奖励。
done (Tensor) - 输入动作后环境是否终止。
global_obs (Tensor) - 输入动作后环境返回的新的全局状态。
avail_actions (Tensor) - 输入动作后环境返回的新的合法动作。

class mindspore_rl.environment.TicTacToeEnvironment(params, env_id=0)[源代码]

井字棋是一款有名的纸笔游戏<en.wikipedia.org/wiki/Tic-tac-toe>。这个游戏的规则是两个玩家在一个3X3的格子上交互的画O和X。当三个相同的标记在水平，垂直或者对角线连成一条线时，对应的玩家将获得胜利。下图就是一个井字棋游戏的例子。

o		x
x	o
	x	o

参数：

params (dict) - 字典包含TicTacToeEnvironment类中所需要的所有参数。
env_id (int) - 环境id，用于设置环境内种子。

支持平台：

Ascend GPU CPU

样例：

>>> env_params = {}
>>> environment = TicTacToeEnvironment(env_params, 0)
>>> print(environment)
TicTacToeEnvironment<>

property action_space

返回：

Space，环境的动作空间。

property config

返回：

dict，一个包含环境信息的字典。

property done_space

返回：

Space，环境的终止空间。

property observation_space

返回：

Space，环境的状态空间。

property reward_space

返回：

Space，环境的奖励空间。

reset()[源代码]

将环境重置为初始状态。reset方法一般在每一局游戏开始时使用，并返回环境的初始状态值。

返回：: Tensor，表示环境初始状态。

step(action)[源代码]

执行环境Step函数来和环境交互一回合。

参数：

action (Tensor) - 包含动作信息的Tensor。

返回：

state (Tensor) - 输入动作后的环境返回的新状态。
reward (Tensor) - 输入动作后环境返回的奖励。
done (Tensor) - 输入动作后环境是否终止。

save()[源代码]

返回一个环境的副本。在井字棋游戏中不需要返回环境的副本，因此他会返回当前状态。

返回：: 一个代表当前状态的Tensor。

load(state)[源代码]

加载输入的状态。环境会根据输入的状态，更新当前的状态，合法动作和是否结束。

参数：

state (Tensor) - 输入的环境状态。

返回：

state (Tensor) - 存档点的状态。
reward (Tensor) - 存档点的收益。
done (Tensor) - 是否在输入存档点时，游戏已经结束。

calculate_rewards()[源代码]

返回当前状态的收益。

返回：: Tensor，表示当前状态收益。

legal_action()[源代码]

返回当前状态的合法动作

返回：: Tensor，表示合法动作。

max_utility()[源代码]

返回井字棋游戏的最大收益。

返回：: Tensor，表示最大收益。

current_player()[源代码]

返回当前状态下，轮到哪个玩家。

返回：: Tensor，表示当前玩家。

is_terminal()[源代码]

返回当前状态下，游戏是否已经终止。

返回：: 当前状态下，游戏是否已经终止。

mindspore_rl.network

用于实现策略的网络组件。

class mindspore_rl.network.FullyConnectedNet(input_size, hidden_size, output_size, compute_type=mstype.float32)[源代码]

一个基本的全连接神经网络。

参数：

input_size (int) - 输入的数量。
hidden_size (int) - 隐藏层的数量。
output_size (int) - 输出大小的数量。
compute_type (mindspore.dtype) - 用于全连接层的数据类型。默认值： mindspore.float32。

样例：

>>> input = Tensor(np.ones([2, 4]).astype(np.float32))
>>> net = FullyConnectedNet(4, 10, 2)
>>> output = net(input)
>>> print(output.shape)
(2, 2)

construct(x)[源代码]

返回网络的输出。

参数：

x (Tensor) - 网络的输入张量。

返回：

网络的输出。

class mindspore_rl.network.FullyConnectedLayers(fc_layer_params, dropout_layer_params=None, activation_fn=nn.ReLU(), weight_init='normal', bias_init='zeros')[源代码]

这是一个全连接层的模块。用户可以输入任意数量的fc_layer_params，然后该模块可以创建相应数量的全链接层。

参数：

fc_layer_params (list[int]) - 全连接层输入和输出大小的值列表。例如，输入列表为[10，20，30]，模块将创建两个全连接层，其输入和输出大小分别为(10, 20)和(20,30)。fc_layer_params的长度应大于等于3。
dropout_layer_params (list[float]) - 丢弃率的列表。如果输入为[0.5, 0.3]，则在每个全连接层之后将创建两个丢弃层。 dropout_layer_params的长度应小于fc_layer_params。 dropout_layer_params是个可选值。默认值： None。
activation_fn (Union[str, Cell, Primitive]) - 激活函数的实例。默认值： nn.ReLU()。
weight_init (Union[Tensor, str, Initializer, numbers.Number]) - 可训练的初始化权重参数。类型与 x 相同。str的值代表 Initializer 函数，如normal、uniform。默认值： ‘normal’。
bias_init (Union[Tensor, str, Initializer, numbers.Number]) - 可训练的初始化偏置参数。类型与 x 相同。str的值代表 Initializer 函数，如normal、uniform。默认值： ‘zeros’。

输入：

x (Tensor) - Tensor的shape为 $(*, f c_l a y e r s_p a r a m s [0])$ 。

输出：

Tensor的shape为 $(*, f c_l a y e r s_p a r a m s [- 1])$ 。

样例：

>>> input = Tensor(np.ones([2, 4]).astype(np.float32))
>>> net = FullyConnectedLayers(fc_layer_params=[4, 10, 2])
>>> output = net(input)
>>> print(output.shape)
(2, 2)

construct(x)[源代码]

返回网络的输出。

参数：

x (Tensor) - Tensor的shape为 $(*, f c_l a y e r s_p a r a m s [0])$ 。

返回：

Tensor的shape为 $(*, f c_l a y e r s_p a r a m s [- 1])$ 。

class mindspore_rl.network.GruNet(input_size, hidden_size, weight_init='normal', num_layers=1, has_bias=True, batch_first=False, dropout=0.0, bidirectional=False)[源代码]

GRU (门控递归单元)层。将GRU层应用于输入。有关详细信息，请参见：mindspore.nn.GRU。

参数：

input_size (int) - 输入的特征数。
hidden_size (int) - 隐藏层的特征数量。
weight_init (str/Initializer) - 初始化方法，如normal、uniform。默认值： ‘normal’。
num_layers (int) - GRU层的数量。默认值： 1。
has_bias (bool) - cell中是否有偏置。默认值： True。
batch_first (bool) - 指定输入 x 的第一个维度是否为批处理大小。默认值： False。
dropout (float) - 如果不是0.0, 则在除最后一层外的每个GRU层的输出上附加 Dropout 层。默认值： 0.0。取值范围 [0.0, 1.0)。
bidirectional (bool) - 指定它是否为双向GRU，如果bidirectional=True则为双向，否则为单向。默认值： False。

输入：

x_in (Tensor) - 数据类型为mindspore.float32和shape为(seq_len, batch_size, input_size)或(batch_size, seq_len, input_size)的Tensor。
h_in (Tensor) - 数据类型为mindspore.float32和shape为(num_directions * num_layers, batch_size, hidden_size)的Tensor。h_in 的数据类型必须和 x_in 一致。

输出：

元组，包含(x_out, h_out)。

x_out (Tensor) - shape为(seq_len, batch_size, num_directions * hidden_size) 或(batch_size, seq_len, num_directions * hidden_size)的Tensor。
h_out (Tensor) - shape为(num_directions * num_layers, batch_size, hidden_size)的Tensor。

样例：

>>> net = GruNet(10, 16, 2, has_bias=True, bidirectional=False)
>>> x_in = Tensor(np.ones([3, 5, 10]).astype(np.float32))
>>> h_in = Tensor(np.ones([1, 5, 16]).astype(np.float32))
>>> x_out, h_out = net(x_in, h_in)
>>> print(x_out.shape)
(3, 5, 16)

construct(x_in, h_in)[源代码]

gru网络的正向输出。

参数：

x_in (Tensor) - 数据类型为mindspore.float32和shape为(seq_len, batch_size, input_size)或(batch_size, seq_len, input_size)的Tensor。
h_in (Tensor) - 数据类型为mindspore.float32和shape为(num_directions * num_layers, batch_size, hidden_size)的Tensor。h_in 的数据类型必须和 x_in 一致。

返回：

x_out (Tensor) - shape为(seq_len, batch_size, num_directions * hidden_size) 或(batch_size, seq_len, num_directions * hidden_size)的Tensor。
h_out (Tensor) - shape为(num_directions * num_layers, batch_size, hidden_size)的Tensor。

mindspore_rl.policy

RL 算法中使用的策略。

class mindspore_rl.policy.Policy[源代码]

策略的虚基类。在调用模型之前，应该重写此类。

construct(*inputs, **kwargs)[源代码]

构造函数接口。由用户继承使用，参数可参考 EpsilonGreedyPolicy， RandomPolicy 等。

参数：

inputs - 取决于用户的定义。
kwargs - 取决于用户的定义。

返回：

取决于用户的定义。通常返回一个动作值或者动作的概率分布。

class mindspore_rl.policy.RandomPolicy(action_space_dim)[源代码]

在[0, action_space_dim)之间产生随机动作。

参数：

action_space_dim (int) - 动作空间的维度。

样例：

>>> action_space_dim = 2
>>> policy = RandomPolicy(action_space_dim)
>>> output = policy()
>>> print(output.shape)
(1,)

construct()[源代码]

返回[0, action_space_dim)之间的随机数。

返回：: [0, action_space_dim)之间的随机数。

class mindspore_rl.policy.GreedyPolicy(input_network)[源代码]

基于给定的贪婪策略生成采样动作。

参数：

input_network (Cell) - 用于按输入状态产生动作的网络。

样例：

>>> state_dim, hidden_dim, action_dim = 4, 10, 2
>>> input_net = FullyConnectedNet(state_dim, hidden_dim, action_dim)
>>> policy = GreedyPolicy(input_net)
>>> state = Tensor(np.ones([2, 4]).astype(np.float32))
>>> output = policy(state)
>>> print(output.shape)
(2,)

construct(state)[源代码]

返回最佳动作。

参数：

state (Tensor) - 网络的输入状态Tensor。

返回：

action_max，输出最佳动作。

class mindspore_rl.policy.EpsilonGreedyPolicy(input_network, size, epsi_high, epsi_low, decay, action_space_dim)[源代码]

基于给定的epsilon-greedy策略生成采样动作。

参数：

input_network (Cell) - 返回策略动作的输入网络。
size (int) - epsilon的shape。
epsi_high (float) - 探索的上限epsilon值，介于[0, 1]。
epsi_low (float) - 探索的下限epsilon值，介于[0, epsi_high]。
decay (float) - epsilon的衰减系数。
action_space_dim (int) - 动作空间的维度。

样例：

>>> state_dim, hidden_dim, action_dim = (4, 10, 2)
>>> input_net = FullyConnectedNet(state_dim, hidden_dim, action_dim)
>>> policy = EpsilonGreedyPolicy(input_net, 1, 0.1, 0.1, 100, action_dim)
>>> state = Tensor(np.ones([1, state_dim]).astype(np.float32))
>>> step =  Tensor(np.array([10,]).astype(np.float32))
>>> output = policy(state, step)
>>> print(output.shape)
(1,)

construct(state, step)[源代码]

构造函数接口。

参数：

state (Tensor) - 网络的输入Tensor。
step (Tensor) - 当前step, 影响epsilon的衰减。

返回：

输出动作。