專題演講 主講人:王柏安助理教授 (清華大學統計與數據科學研究所)

題 目:Policy Testing in Reinforcement Learning
主講人:王柏安助理教授 (清華大學統計與數據科學研究所)
時 間:114年10月3日(星期五)上午10:40-11:30
(上午10:20-10:40茶會於綜合一館428室舉行)
地 點:綜合一館304室
摘要
Pure exploration refers to a family of problems in which a learner aims to identify a specific property of an unknown distribution under a fixed confidence regime. For example, in best policy identification, the goal is to determine the policy with the highest value, while in policy testing, the objective is to decide whether the value of a given policy exceeds a predetermined threshold.Following the successful demonstration of instance-specific optimality in best arm identification for multi-armed bandits in 2016, researchers have attempted to extend these optimality guarantees to Markov decision processes (MDP). However, existing algorithms are often either computationally infeasible or sacrifice the optimality.
In this talk, my co-author and I propose a simple solution to these non-convex issues by introducing a reversed MDP, in which the roles of the transition parameters and the policy are reversed. By incorporating a policy gradient method, we establish a new framework called parameter gradient. This framework makes it possible to tackle the previously intractable non-convex optimization, leveraging recent breakthroughs in policy gradient research. To our knowledge, this is the first algorithm to achieve both statistical optimality and computational feasibility for pure exploration problem in MDP.