Speaker:Asst. Prof. Po-An Wang (Institute of Statistics and Data Science, NTHU)
Topic:Policy Testing in Reinforcement Learning
Speaker:Asst. Prof. Po-An Wang (Institute of Statistics and Data Science, NTHU)
Time:Oct 3 (Friday) , 2025, 10:40-11:30
Place: 3F-304, Assembly Building I
Abstract
Pure exploration refers to a family of problems in which a learner aims to identify a specific property of an unknown distribution under a fixed confidence regime. For example, in best policy identification, the goal is to determine the policy with the highest value, while in policy testing, the objective is to decide whether the value of a given policy exceeds a predetermined threshold.
Following the successful demonstration of instance-specific optimality in best arm identification for multi-armed bandits in 2016, researchers have attempted to extend these optimality guarantees to Markov decision processes (MDP). However, existing algorithms are often either computationally infeasible or sacrifice the optimality.
In this talk, my co-author and I propose a simple solution to these non-convex issues by introducing a reversed MDP, in which the roles of the transition parameters and the policy are reversed. By incorporating a policy gradient method, we establish a new framework called parameter gradient. This framework makes it possible to tackle the previously intractable non-convex optimization, leveraging recent breakthroughs in policy gradient research. To our knowledge, this is the first algorithm to achieve both statistical optimality and computational feasibility for pure exploration problem in MDP.