<no title>

class mindspore_rl.utils.DiscountedReturn(gamma, need_bprop)[源代码]

计算折扣回报。

设折扣回报为 \(G\)，折扣系数为 \(\gamma\)，奖励为 \(R\)，时间步 \(t\)，最大时间步 \(N\)。则 \(G_{t} = \Sigma_{t=0}^N{\gamma^tR_{t+1}}\)。

对于奖励序列包含多个episode的情况， \(done\) 用来标识episode边界， \(last\_state\_value\) 表示最后一个epsode的最后一个step的价值。

参数：

gamma (float) - 折扣系数。
need_bprop (bool) - 是否需要计算discounted return的反向

输入：

reward (Tensor) - 包含多个episode的奖励序列。张量的维度 \((Timestep, Batch, ...)\)。
done (Tensor) - Episode结束标识。张量维度 \((Timestep, Batch)\)。
last_state_value (Tensor) - 表示最后一个epsode的最后一个step的价值，张量的维度 \((Batch, ...)\)。

返回：

折扣回报。

样例：

>>> net = DiscountedReturn(gamma=0.99)
>>> reward = Tensor([[1, 1, 1, 1]], dtype=mindspore.float32)
>>> done = Tensor([[False, False, True, False]])
>>> last_state_value = Tensor([2.], dtype=mindspore.float32)
>>> ret = net(reward, done, last_state_value)
>>> print(output.shape)
(2, 2)