+ ~~~~~~~~~~~~~~~~ GUSB.US ~~~~~~~~~~~~~~~~ +
| ~ the personal website of gus brocchini ~ |
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +

    GB-gammon

In May and June of 2020, I wrote GB-gammon, a program that teaches itself how to play backgammon using a neural network and temporal difference learning. I more or less followed Gerald Tesauro's seminal TD-Gammon papers. You can see the code here.

The first thing I learned working on GB-gammon was how little I actually knew about what I was doing. Within a few days, I went from "I think I know basically everything I need to" to "I don't even know what things I need to know." Luckily, after poring over online backgammon magazines from the 90s and wikipedia pages I didn't understand, I was pointed towards David Silver's reinforcement learning lectures. After watching these, and re-reading Gerald Tesauro's papers, I finally knew enough to actually write it.

    how it works

The neural network is just a function which takes in a representation of where all the pieces are on a backgammon board, and spits out what it thinks is the likelihood of each side winning. (Backgammon has two special ways of winning, a "gammon" and a "backgammon", which it also predicts likelihoods for.) At the beginning, its guesses are completely random. To make a move, it looks at how the board would look after every possible move, and then chooses the move that it thinks has the highest likelihood of winning (or really, what backgammon players call the move with the highest equity).

As it plays against itself, it nudges its parameters so that its past guesses, if made according to the new parameters, would be closer to its current guess. At the end of a game, it looks at the actual result and does a similar nudge. This way, it gets a little bit better every move.

    a note on the game of backgammon

I thought that writing the logic of the actual game of backgammon would be comparatively easy. But what seems like a simple set of rules——"just move the pieces by the number on the dice"——is riddled with complications. The first one, that you can't move onto a point controlled by your opponent, is simple enough. But then think about the fact that you have to enter off the bar before you can make the rest of your move. Then there are the rules about bearing off——you need an exact number, unless there are no pieces more distant, oh and everything needs to be in your home before you can do any of that. And your ability to do those things can change within a single move! The real kicker was the rule that you have to make the largest move possible. I had to rewrite nearly all the logic when I remembered that rule.

Probably the best computer backgammon player before Tesauro's neural nets was Hans Berliner's BKG. I skimmed his paper when I was mostly done with the project, and I was heartened to see him describe the same struggle: "It would seem that the design of a move generator for a backgammon program should be a trivial exercise.... However, there are problems."

    results

There are a few things I looked at to measure how GB-gammon improved over time. The first was how many moves each game took:



As you can see, the early games take forever, as the neural net has no idea what it's doing and haphazardly flops towards winning or losing. Then, fairly quickly, it learns not to leave blots (single pieces that the opponent can hit and then move all the way back), making the games much shorter. After that, it learns more complex techniques (e.g., back games) that extend gameplay.

The next metric was the number of gammons and backgammons over time. Gammons and backgammons are ways of super-winning (you get 1 point for a normal win, 2 points for a gammon, and 3 points for a backgammon).



I interpret the backgammon data like the moves-per-game data, as three steps: at first, playing randomly, it's pretty likely to escape its back checkers while its opponent is bearing off; then, it starts learning to keep its back checkers in case it can hit; finally, it learns to try to escape the backgammon. The gammon data I'm not sure what to think about, except to say that it makes sense that, in a winning position, the net would initially focus on winning and then on gammoning its opponent.

I was initially surprised at the number of gammons and backgammons, but I realized that, because I had not implemented doubling (in standard backgammon, you can offer to double the stakes of the current game, and your opponent must either accept the new stakes or resign at the current stakes), a lot of games which would've ended in a dropped double were ending as gammons or backgammons.

I also saved some versions of the neural net, after 1,000, 5,000, 15,000, 50,000, and 100,000 simulated games. I ran a little tournament, where I played each version against each other version 100 times and then took the average score. The columns show the net that played black, and the rows show the net playing white, so row 2, column 1 shows that, when the 1k net plays black and the 5k net plays white, white wins 1.73 points per game. (I've bolded the cells where a net plays itself.)

           1k     5k    15k    50k   100k
    1k B+0.10 B+1.41 B+1.99 B+1.93 B+2.23
    5k W+1.73 W+0.78 B+0.32 B+0.79 B+0.89
   15k W+1.92 W+0.84 W+0.02 B+0.48 B+0.81
   50k W+2.15 W+0.91 W+0.25 B+0.04 B+0.25
  100k W+2.10 W+1.04 B+0.02 W+0.27 B+0.47

The results mostly make sense. I'm surprised that the 100k net is so imbalanced when playing itself, especially when the 15k and 50k nets both win less than a tenth of a point per game when playing themselves. I'm really not sure what this could be, except exceptional bad luck in the training data or the tournament.

The other big surprise is that the 15k net, playing black, actually beats the 100k net, playing white. This could be something about the playing styles of the two nets, or similar bad luck. If you can figure out what went on with either of these, please let me know.

UPDATE: A new theory, from the same person who showed me the David Silver lectures, is that the net learns a certain strategy and then learns to counter it. Then, it can forget the strategy because it's no longer effective when it's just playing itself, and because it no longer knows the strategy it can forget how to counter it. So, it's possible that the 15k net knows some secret trick, long forgotten by the comparatively advanced 100k net. (It reminds me of this story, where a Viking-era medicinal potion was found to kill modern, antibiotic-resistant bacteria.)

I'm really glad I did this project. There's a deep cleverness to reinforcement learning that I'm happy to at least partially understand, and it's cured me of my playing-backgammon-on-my-phone addiction. After all, why try to win this one game when you can figure out a strategy to win 60% of all future games?

Gus Brocchini
Burlingame, CA
January 2021