A Tutorial on Google AlphaGo
Bo An
18/3/2016
Slides based on:
Silver, David, et al. “Mastering the game of Go with deep neutral networks and tree search .” Nature
529.7587 (2016): 484-489.
Shane Moon’s presentation
Games in AI
Ideal test bed for AI research
Clear results
Clear motivation
Good challenge
Success in search-based approach
chess (1997, Deep Blue)
and others
Not successful in the game of Go
Go is to Chess as Poetry is to Double-entry accounting
It goes to the core of artificial intelligence, which involves the study of learning
and decision-making, strategic thinking, knowledge representation, pattern
recognition and, perhaps most intriguingly, intuition
2 /27
The game of Go
An 4,000 years old board
game from China
Standard size 19×19
Two players, Black and
White, place the stones in
turns
Stones can not be moved, but
can be captured and taken off
Larger territory wins
3 /27
AlphaGo vs European Champion (Fan Hui 2-Dan*)rank
October 5– 9, 2015
- Time limit: 1 hour
- AlphaGo Wins (5:0)
4 /27
AlphaGo vs World Champion (Lee Sedol 9-Dan)
March 9 – 15, 2016
• Time limit: 2 hours
• Venue: Seoul, Four
Seasons Hotel
• AlphaGo Wins (4:1)
5 /27
Computer Go AI - Definition
d=1 d=2
Computer Go
s Artificial a s’
=
Intelligence
s (state) e.g. we can represent the board into a matrix-like form
a (action)
Given s, pick the best a
6 /27
Computer Go AI – An Implementation Idea?
d=1 d=2 d=3
... d = max D ~ 19x19=361
Process the simulation until the
game ends, then report win / lose
results
How about simulating all possible board
positions?e.g. it wins 13 times if the next stone gets placed here
…
37,839 times
Choose the “next action / stone”
that has the most win-counts in
the full-scale simulation
431320 times
Can also apply Minmax search learned in
CZ3005
7 /27
8 /27
AlphaGo: Key Ideas
Objective: Reduce search space without sacrificing quality
Key Idea 1: Take advantage of human top players’ data
Deep learning
Key Idea 2: Self-play
Reinforcement learning
Key Idea 3: Looking ahead
Monte Carlo tree search
We learned Minimax search with evaluation functions
9 /27
Reducing Searching Space
2.
1. Position
Reducingevaluation ahead of time
“action candidates” (Depth
(Breadth Reduction)
Reduction)
d=1 d=2 d=3 …
… dd =
= maxD
maxD
V=1
Win?
Win?
V=2 Lose?
Lose?
V = 10
…
If
If there
thereis of
Instead
Remove athese
is model that can
asimulating
function
from tell you
that
until
search can that
the
candidates
these moves V(s):
measure: are not“board
common / probable (e.g.
evaluation
maximum
in advance depth..
(breadth reduction)
by experts, etc.)…
of state s”
10 /27
1. Reducing“action candidates”
Learning: P ( next action | current state )
= P ( a | s)
11 /27
1. Reducing“action candidates”
(1)Imitating expert moves (supervised learning)
Current State Next State
s1 s2
Prediction
s2 s3
Model
s3 s4
Data: Online Go Experts (5 ~ 9 dan)
160K games, 30M board positions
12 /27
1. Reducing“action candidates”
(1)Imitating expert moves (supervised learning)
Current Board Board
Next Action
Prediction
Model
s f: s a a
There are 19×19 = 361 possible actions
(with different possibilities)
13 /27
1. Reducing“action candidates”
(1)Imitating expert moves (supervised learning)
Current Board Next Action
Deep
Learning
Prediction
(13 layer
CNN)
Model
s g: s p(a|s) p(a|s) argmax a
14 /27
Convolutional Neural Network (CNN)
Go: abstraction is the key to win
CNN: abstraction is its forte
15 /27
1. Reducing “action candidates”
(1)Imitating expert moves (supervised learning)
Current Board Next Action
Expert Moves Imitator Model
(w/CNN)
Training:
16 /27
1. Reducing “action candidates”
(2)Improving through self-plays (reinforcement learning)
improving by playing against itself
Expert Moves Expert Moves
Imitator Model VS Imitator Model
(w/CNN) (w/CNN)
Return: board positions, win/lose info
17 /27
1. Reducing “action candidates”
(2)Improving through self-plays (reinforcement learning)
Board position win/loss
Expert Moves Imitator Model
Win
Loss
(w/CNN)
z = +1
-1
Training:
18 /27
1. Reducing “action candidates”
(2)Improving through self-plays (reinforcement learning)
older models vs. newer models
Expert Moves
Updated Model Updated
Updated
Updated
Updated Model
Model
Model
Model
Imitator
ver 1.3
Model
1.1 VS ver1,000,000
1,000,000
ver 46235.2
ver 1.3
ver 1.5
3204.1 ver ver 2.0
1.7
It uses the same topology as the expert moves imitator model, and just uses the updated parameters
Return:
The finalboard
modelpositions,
wins 80%win/lose infowhen
of the time
playing against the first model
19 /27
2. Board Evaluation
Adds a regression layer to the model
Predicts values between 0~1
Close to 1: a good board position
Close to 0: a bad board position
Board position Win / Lose
Updated Model Value
ver 1,000,000 Prediction Win
Model (0 / 1)
(Regression)
Training:
20 /27
Reducing Search Space
1. Reducing“action candidates”
(Breadth Reduction)
Policy Network
2. Board Evaluation (Depth Reduction)
Value Network
21 /27
Looking Ahead (Monte Carlo Search Tree)
22 /27
Looking Ahead (Monte Carlo Search Tree)
23 /27
Results
24 /27
AlphaGo
Lee Sedol 9-dan vs AlphaGo Energy Consumption
A very, very tough calculation;)
25 /27
AlphaGo
Taking CPU / GPU resources to virtually infinity
26 /27
Discussions
AlphaGo’s Weakness
make the state complicated
……
What is the next step?
Poker, Mahjong
Chess RoboCup
Environment Static Dynamic
State Change Turn taking Real time
Information
Complete Incomplete
accessibility
Sensor Non-
Symbolic
Readings symbolic
Control Central Distributed
27 /27