强有力的深入学习对对抗性的对抗性扰动

论文标题

强有力的深入学习对对抗性的对抗性扰动

Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations

论文作者

Zhang, Huan, Chen, Hongge, Xiao, Chaowei, Li, Bo, Liu, Mingyan, Boning, Duane, Hsieh, Cho-Jui

论文摘要

深度强化学习（DRL）代理人通过观察观察其状态，这可能包含自然的测量误差或对抗噪声。由于观察结果偏离了真实状态，因此他们可能会误导代理商做出次优的行动。几项作品通过对抗性攻击表明了这种脆弱性，但是在这种环境下改善DRL的鲁棒性的现有方法对理论原则的成功和缺乏。我们表明，对于许多RL任务，天真地应用现有技术来改善鲁棒性来进行分类任务，例如对抗性培训。我们提出了国家对话马尔可夫决策过程（SA-MDP），以研究此问题的基本属性，并开发理论上有原则的政策正规化，可以应用于大型的DRL算法家庭，包括近端政策优化（PPO）（PPO）（PPO），深度确定性的策略梯度（DDPG）和DIFC网络和DIFCNECTINCS（DIFT CORTICE and DIFT CORTINE for DICTECTINCETECTINCETECTINCE and DICTECTECETECTECTECTINCE），以及DQN（DQN）（DQN）（DQN），以及DQN）。我们在一套强大的白盒对抗攻击中，大大提高了PPO，DDPG和DQN代理的鲁棒性，包括我们自己的新攻击。此外，我们发现，即使在许多环境中没有对手，也有强大的策略可以显着提高DRL性能。我们的代码可在https://github.com/chenhongge/stateadvdrl上找到。

A deep reinforcement learning (DRL) agent observes its states through observations, which may contain natural measurement errors or adversarial noises. Since the observations deviate from the true states, they can mislead the agent into making suboptimal actions. Several works have shown this vulnerability via adversarial attacks, but existing approaches on improving the robustness of DRL under this setting have limited success and lack for theoretical principles. We show that naively applying existing techniques on improving robustness for classification tasks, like adversarial training, is ineffective for many RL tasks. We propose the state-adversarial Markov decision process (SA-MDP) to study the fundamental properties of this problem, and develop a theoretically principled policy regularization which can be applied to a large family of DRL algorithms, including proximal policy optimization (PPO), deep deterministic policy gradient (DDPG) and deep Q networks (DQN), for both discrete and continuous action control problems. We significantly improve the robustness of PPO, DDPG and DQN agents under a suite of strong white box adversarial attacks, including new attacks of our own. Additionally, we find that a robust policy noticeably improves DRL performance even without an adversary in a number of environments. Our code is available at https://github.com/chenhongge/StateAdvDRL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题