DEV Community

Cover image for Game Theory in Action: Pluribus and CFR in Texas Hold'em (2) CFR Counterfactual Regret Minimization Algorithm
friendneeder
friendneeder

Posted on

Game Theory in Action: Pluribus and CFR in Texas Hold'em (2) CFR Counterfactual Regret Minimization Algorithm

    The previous chapter introduced the Nash equilibrium strategy, which was derived using an algorithm called CFR. Before discussing CFR, let's first introduce regret matching.
Taking rock-paper-scissors as an example, its payoff matrix is as follows.

Image description
The algorithm iterates as follows:
For each player, initialize regretSum[action] to 0.
Iterate T times:
    Normalize any positive regretSum values to derive the strategy.
    Add the strategy to strategySum.
    Sample an action a based on the strategy.
    Calculate the regret value for other actions a’ as regret[a'] = u(a) - u(a').
    Add the regret[action] value to regretSum[action].
Normalize strategySum to find the average strategy.
This results in an equilibrium solution for rock-paper-scissors of 33%, 33%, 33%.
    In simple terms, if playing rock earns 2 more than playing paper this time, add a 2 to regretSum[rock]. The sampling action probability, strategy, is based on the distribution of regretSum. We will try more rock. Noam's paper proves the convergence of regret matching. The paper is as follows.

Image description
Here, two concepts are introduced:
Best Response: Assuming we know the opponent will play rock 100% of the time, if we adopt a strategy of playing paper 100%, this strategy is called Best Response. Best Response is always one-hot, executing a pure action 100%.
Exploitability of Opponent Strategy: The profit from Best Response.
An interesting thought: If we modify the rules of rock-paper-scissors so that scissors wins 2 points and loses 2 points, and rock and paper earn 1 point each, what would be the Nash equilibrium strategy for players?

Image description
Here is a recommended paper that provides a detailed introduction to CFR and includes code:
Introduction of cfr
    CFR Counterfactual Regret Minimization Algorithm The regret matching algorithm used in rock-paper-scissors returns profits immediately, making it unsuitable for extensive games. When the game issue is in the form of a game tree, it is necessary to train the strategy at each information set on the game tree. Defining the state value function of intermediate nodes, updating regret values, and ensuring the final result is a Nash equilibrium are issues that regret matching cannot solve alone. This is where our CFR counterfactual regret minimization algorithm comes into play.

Image description

Image description
I extracted an introduction to CFR from the first paper, which should help understand the implementation method using the example of Kuhn poker (code included). There are also other bloggers who have explained this in detail online. At first glance, the CFR algorithm seems complex, but we can analyze the issue by extending regret matching.
CFR's regret calculation formula
Comparing CFR and regret matching's regret calculation, the biggest difference is the inclusion of opponent reach probabilities. It took me a long time to understand its physical meaning. Trying to remove this probability led to incorrect results. After practical testing, I believe that under a game tree, both oneself and the opponent will make a large number of different choices, and we need to combine different choices into a state value to propagate upwards, thus weighting different choices' profits differently. If the opponent never reaches a node, then the profit's contribution to the upper node is zero.
Another point to note is that when calculating the average strategy, the real-time strategy should be multiplied by one's own reach probability to accumulate, which makes sense. If the probability of reaching a node under the current strategy is very low, then accumulating this strategy into the average strategy is meaningless.
This is just my personal inference; detailed mathematical proof is still necessary.
The paper first defines average overall regret and proves that if the average overall regret is less than e, then the average strategy conforms to a Nash equilibrium strategy within 2e exploitability.

Image description
It then proves that the sum of counterfactual regrets across all information sets exceeds the average overall regret.
Image description
Lastly, it proves the convergence of counterfactual regret. It also demonstrates that the CFR algorithm can converge to a Nash equilibrium.

Image description
The next chapter will introduce the principles and differences between current market solvers like Libratus, Pluribus, and DeepStack.
    I am looking for remote job opportunities. Please contact me at my email: justfunnychen@gmail.com.

Top comments (0)