Sharp Minimax Rates for Imitation Learning
We establish sharp minimax bounds on Imitation Learning (IL) in episodic Markov Decision Processes (MDPs), where the learner is provided a dataset of demonstrations from an expert. It is known that Behavior Cloning (BC) achieves suboptimality growing quadratically in horizon, which is termed as error compounding in the literature. We show that when the MDP transition function is unknown, all algorithms have to suffer a suboptimality that grows quadratically with the horizon, even if the algorithm can interactively query the expert such as in the setting of DAGGER. We then propose the setting of known transitions and show that one can provably break the quadratic dependence and improve the exponent to 3/2, which is shown to be tight. Our upper bound is established using a computationally efficient algorithm which we name as Mimic-MD, and the lower bound is established by proving a two-way reduction between IL and the value estimation problem of the unknown expert policy under any given reward function, as well as linear functional estimation with subsampled observations. We further show that under the additional assumption that the expert is optimal for the true reward function, there exists an efficient algorithm, which we term as Mimic-Mixture, that provably achieves suboptimality independent of the horizon for arbitrary 3-state MDPs with rewards only at the terminal layer. In contrast, no algorithm can achieve suboptimality growing slower than the square root of the horizon with high probability if the expert is not constrained to be optimal. We formally establish the benefit of expert optimal assumption in the known transition setting and show that this additional assumption does not help when the transition functions are unknown. Based on joint work with Nived Rajaraman, Yanjun Han, Lin F. Yang, and Kannan Ramchandran.