EI is one of the methods under the umbrella of RL, which combines the best of supervised learning as well as reinforcement learning to make it possible for an agent to learn more rapidly. The basic concept behind EI is to bootstrap the learning of an agent through its process by first imitating some expert behavior and then enhancing it through reinforcement learning as well as self-play. This is an iterative process; that is, the agent can refine the policy incrementally by incorporating demonstrations from experts with its individual experience in the environment.
Expert Iteration can be considered to fundamentally focus on working with the demonstrations of an expert. Demonstrations from an expert refer to the sequences of actions implemented by an expert-a human or perhaps a well-trained agent-while in particular states of the environment. This type of demonstration gives the agent stronger indications of good actions in different contexts. The first stage of EI is called behavior cloning, wherein the agent uses supervised learning to imitate the behavior of an expert. This allows the agent to learn a baseline policy efficiently without exploitation of the environment or reliance on trial and error in pure reinforcement learning techniques like Q-learning or Proximal Policy Optimization (PPO).
After acquiring an initial policy from the expert demonstrations, the agency is left for self-play wherein the agent will play by itself using the reinforcements learned along the way to improve the policy. Under self-play as well as the aforementioned phase, an agent gets feedback from the environment -- which this time is in the form of rewards -- that enables it to explore the new strategies that will further promote its performance over that of the expert. Expert Iteration makes it possible for the agent to improve its policy over time as it alternates between learning from expert demonstrations and self-play, thus reaping the benefits of both the efficiency of supervised learning and the exploration benefits of reinforcement learning.
It significantly improves sample efficiency—the amount of data required to train an agent. Demonstrations by experts allow the agent to bypass the typically very time-consuming initial exploration phase of traditional RL methods, which is especially important in environments where exploration is costly or even dangerous (for example, robotics, or autonomous driving). For instance, EI allows progressive improvement because, through self-play, the agent can exploit it to improve over the expert's performance by finding new strategies not contained in the expert's data.
Conversely, Expert Iteration has a drawback. The quality of such an approach depends on good-quality demonstrations from the expert. Bad-quality or suboptimal demonstrations can nudge the agent to learn ineffectual or suboptimal policies. Furthermore, EI may not scale well in environments with large state and action spaces because high-quality expert demonstrations may take a very long time to collect and have very high costs. Yet another problem is that EI can potentially lead to overfitting of the agent's behavior on the expert's behavior, and thus restrain the agent from exploring alternative strategies during self-play.
Real applications of Expert Iteration have created tremendous evidence and successive applications to varying domains, like fast learning for manipulation or navigation tasks by robotics and game-playing AI systems like AlphaZero, demonstrating that self-play, together with expert knowledge, can lead to superhuman performance. More importantly, learning from expert drivers may be used as EI for bootstrapping an initial policy for training on its own to drive safely to then refine its behavior through reinforcement learning.
Top comments (0)