由模糊规则代表的政策的政策梯度加强学习：应用于汽车速度控制的模拟

论文标题

由模糊规则代表的政策的政策梯度加强学习：应用于汽车速度控制的模拟

Policy Gradient Reinforcement Learning for Policy Represented by Fuzzy Rules: Application to Simulations of Speed Control of an Automobile

论文作者

Ishihara, Seiji, Igarashi, Harukazu

论文摘要

已经提出了一种模糊推理和政策梯度加强学习的方法，它直接学习，因为最大程度地提高了每集奖励的期望值，这是由策略函数以模糊规则表示的策略函数中的参数。一项研究将此方法应用于汽车速度控制的任务，并获得了正确的政策，其中一些控制速度适当地控制了汽车的速度，但许多其他人会产生不适当的速度振动。通常，该策略是不希望导致时间变化或输出值振动的，并且在许多情况下，策略可以使输出值平稳变化。在本文中，我们提出了一种使用目标函数的融合方法，该方法会以随机加权的重心模型引入去融化，并为时间变化的平滑度加权，作为改进措施，以抑制模糊控制器的输出值的突然变化。然后，我们在融合中显示学习规则，也通过奖励功能对输出值波动来考虑效果。作为我们方法在汽车速度控制中应用的实验结果，已经证实，所提出的方法具有抑制输出值时间序列中不良波动的作用。此外，还表明奖励功能之间的差异可能会对学习结果产生不利影响。

A method of a fusion of fuzzy inference and policy gradient reinforcement learning has been proposed that directly learns, as maximizes the expected value of the reward per episode, parameters in a policy function represented by fuzzy rules with weights. A study has applied this method to a task of speed control of an automobile and has obtained correct policies, some of which control speed of the automobile appropriately but many others generate inappropriate vibration of speed. In general, the policy is not desirable that causes sudden time change or vibration in the output value, and there would be many cases where the policy giving smooth time change in the output value is desirable. In this paper, we propose a fusion method using the objective function, that introduces defuzzification with the center of gravity model weighted stochastically and a constraint term for smoothness of time change, as an improvement measure in order to suppress sudden change of the output value of the fuzzy controller. Then we show the learning rule in the fusion, and also consider the effect by reward functions on the fluctuation of the output value. As experimental results of an application of our method on speed control of an automobile, it was confirmed that the proposed method has the effect of suppressing the undesirable fluctuation in time-series of the output value. Moreover, it was also showed that the difference between reward functions might adversely affect the results of learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题