SupaeroDataScience · vvivant · Mar 11, 2018
diff --git a/README.md b/README.md
@@ -1,36 +1,13 @@
 # RL challenge
 
-Your challenge is to learn to play [Flappy Bird](https://en.wikipedia.org/wiki/Flappy_Bird)!
+This project is a part of Supaero reinforcement learning module. The objective is to play Flappy Bird using Deep-Q Learning.
+A similar project has already been done in class for a breakout game giving an example of the global method and a MemoryBuffer class.
 
-Flappybird is a side-scrolling game where the agent must successfully nagivate through gaps between pipes. Only two actions in this game: at each time step, either you click and the bird flaps, or you don't click and gravity plays its role.
+# Method
 
-There are three levels of difficulty in this challenge:
-- Learn an optimal policy with hand-crafted features
-- Learn an optimal policy with raw variables
-- Learn an optimal policy from pixels.
+This project uses a deep neural network to learn the Q-function which is the maximum expected reward when choosing a certain action. Knowing this Q-function defines a policy, which will the to choose the action which leads to the maximum expected reward.
 
-# Your job
-
-Your job is to:
-<ol>
-<li> fork the project at [https://github.com/SupaeroDataScience/RLchallenge](https://github.com/SupaeroDataScience/RLchallenge) on your own github (yes, you'll need one).
-<li> rename the "RandomBird" folder into "YourLastName".
-<li> modify 'FlappyPolicy.py' in order to implement the function `FlappyPolicy(state,screen)` used below. You're free to add as many extra files as you need. However, you're not allowed to change 'run.py'.
-<li> you are encouraged, however, to copy-paste the contents of 'run.py' as a basis for your learning algorithm.
-<li> add any useful material (comments, text files, analysis, etc.)
-<li> make a pull request on the original repository <i>when you're done</i> (please don't make a pull request before you think your work is ready to be merged on the original repository).
-</ol>
-
-**All the files you create must be placed inside the directory "YourLastName".**
-
-`FlappyPolicy(state,screen)` takes both the game state and the screen as input. It gives you the choice of what you base your policy on:
-<ul>
-<li> If you use the state variables vector and perform some handcrafted feature engineering, you're playing in the "easy" league. If your agent reaches an average score of 15, you're sure to have a grade of at least 10/20 (possibly more if you implement smart stuff and/or provide a smart discussion).
-<li> If you use the state variables vector without altering it (no feature engineering), you're playing in the "good job" league. If your agent reaches an average score of 15, you're sure to have at least 15/20 (possibly more if you implement smart stuff and/or provide a smart discussion).
-<li> If your agent uses only the raw pixels from the image, you're playing in the "Deepmind" league. If your agent reaches an average score of 15, you're sure to have at the maximum grade (plus possible additional benefits).
-</ul>
-
-Recall that the evaluation will start by running 'run.py' on our side, so 'FlappyPolicy' should call an already trained policy, otherwise we will be evaluating your agent during learning, which is not the goal. Of course, we will check your learning code and we will greatly appreciate insightful comments and additional material like (documentation, discussion, comparisons, perspectives, state-of-the-art...).
+This neural network takes as input raw pixels as 80*80 images. All the different states, actions and rewards are stored in a MemoryBuffer which allows to train our network on randomly chosen states (experience replay). An epsilon-greedy approach is done with epsilon starting from 0.1 and going linearly to 0.001 after 300 000 steps.
 
 # Installation
 
@@ -47,3 +24,8 @@ cd PyGame-Learning-Environment/
 pip install -e .
 ```
 Note that this version of FlappyBird in PLE has been slightly changed to make the challenge a bit easier: the background is turned to plain black, the bird and pipe colors are constant (red and green respectively).
+
+# See Flappy Bird fly
+
+The "Training.py" is the script used to generate the neural network "DQN".
+To test it, execute "run.py" which will launch 10 games and store the mean score and max score for each game.
diff --git a/RandomBird/FlappyAgent.py b/RandomBird/FlappyAgent.py
diff --git a/Vivant/DQN b/Vivant/DQN
diff --git a/Vivant/FlappyAgent.py b/Vivant/FlappyAgent.py
@@ -0,0 +1,25 @@
+from keras.models import load_model
+from collections import deque
+import numpy as np
+from skimage.transform import resize
+from skimage.color import rgb2gray
+
+global frames
+
+def process_screen(x):
+    return 255*resize(rgb2gray(x[60:, 25:310,:]),(80,80))
+
+model = load_model("DQN")
+list_actions = [119,None]
+frames = deque([np.zeros((80,80)),np.zeros((80,80)),np.zeros((80,80)),np.zeros((80,80))], maxlen=4)
+
+def FlappyPolicy(state,screen):
+
+    screen = process_screen(screen)
+    frames.append(screen)
+    frameStacked = np.stack(frames, axis=-1)
+    action = list_actions[np.argmax(model.predict(np.expand_dims(frameStacked,axis=0)))]
+    return action
+
+
+
diff --git a/Vivant/Training.py b/Vivant/Training.py
@@ -0,0 +1,185 @@
+from ple.games.flappybird import FlappyBird
+from ple import PLE
+
+import numpy as np
+
+from collections import deque
+
+from skimage.color import rgb2gray
+from skimage.transform import resize
+import skimage
+
+import matplotlib.pyplot as plt
+
+import random
+
+from keras.layers.convolutional import Convolution2D, MaxPooling2D
+from keras.layers.core import Dense, Dropout, Activation, Flatten
+from keras.optimizers import SGD , Adam
+from keras.models import Sequential
+from keras.layers import Dense, Conv2D, Flatten
+
+class MemoryBuffer:
+    "An experience replay buffer using numpy arrays"
+    def __init__(self, length, screen_shape, action_shape):
+        self.length = length
+        self.screen_shape = screen_shape
+        self.action_shape = action_shape
+        shape = (length,) + screen_shape
+        self.screens_x = np.zeros(shape, dtype=np.uint8) # starting states
+        self.screens_y = np.zeros(shape, dtype=np.uint8) # resulting states
+        shape = (length,) + action_shape
+        self.actions = np.zeros(shape, dtype=np.uint8) # actions
+        self.rewards = np.zeros((length,1), dtype=np.uint8) # rewards
+        self.terminals = np.zeros((length,1), dtype=np.bool) # true if resulting state is terminal
+        self.terminals[-1] = True
+        self.index = 0 # points one position past the last inserted element
+        self.size = 0 # current size of the buffer
+
+    def append(self, screenx, a, r, screeny, d):
+        self.screens_x[self.index] = screenx
+        #plt.imshow(screenx)
+        #plt.show()
+        #plt.imshow(self.screens_x[self.index])
+        #plt.show()
+        self.actions[self.index] = a
+        self.rewards[self.index] = r
+        self.screens_y[self.index] = screeny
+        self.terminals[self.index] = d
+        self.index = (self.index+1) % self.length
+        self.size = np.min([self.size+1,self.length])
+
+    def stacked_frames_x(self, index):
+        im_deque = deque(maxlen=4)
+        pos = index % self.length
+        for i in range(4): # todo
+            im = self.screens_x[pos]
+            im_deque.appendleft(im)
+            test_pos = (pos-1) % self.length
+            if self.terminals[test_pos] == False:
+                pos = test_pos
+        return np.stack(im_deque, axis=-1)
+
+    def stacked_frames_y(self, index):
+        im_deque = deque(maxlen=4)
+        pos = index % self.length
+        for i in range(4): # todo
+            im = self.screens_y[pos]
+            im_deque.appendleft(im)
+            test_pos = (pos-1) % self.length
+            if self.terminals[test_pos] == False:
+                pos = test_pos
+        return np.stack(im_deque, axis=-1)
+
+    def minibatch(self, size):
+        #return np.random.choice(self.data[:self.size], size=sz, replace=False)
+        indices = np.random.choice(self.size, size=size, replace=False)
+        x = np.zeros((size,)+self.screen_shape+(4,))
+        y = np.zeros((size,)+self.screen_shape+(4,))
+        for i in range(size):
+            x[i] = self.stacked_frames_x(indices[i])
+            y[i] = self.stacked_frames_y(indices[i])
+        return x, self.actions[indices], self.rewards[indices], y, self.terminals[indices]
+
+def dqn():
+    dqn = Sequential()
+    #1st layer
+    dqn.add(Conv2D(filters=16, kernel_size=(8,8), strides=4, activation="relu", input_shape=(80,80,4)))
+    #2nd layer
+    dqn.add(Conv2D(filters=32, kernel_size=(4,4), strides=2, activation="relu"))
+    dqn.add(Flatten())
+    #3rd layer
+    dqn.add(Dense(units=256, activation="relu"))
+    #output layer
+    dqn.add(Dense(units=2, activation="linear"))
+    adam = Adam(lr=1e-4)
+    dqn.compile(optimizer=adam, loss="mean_squared_error")
+    return dqn
+
+#Epsilon going from 0.1 to 0.001 at 300000 steps linearly
+def epsilon(step):
+    if step<(300000):
+        return 0.1 -(0.1-0.001)/(300000)*step
+    return 0.001
+
+#Reward: 1 if pipe passed, else 0.1 if nothing or collision
+def clip_reward(r):
+    if r!=1:
+        rr=0.1 
+    else:
+        rr=r
+    return rr
+
+def greedy_action(network, x):
+    Q = network.predict(np.array([x]))
+    return np.argmax(Q)
+
+def process_screen(x):
+    return 255*resize(rgb2gray(x[60:, 25:310,:]),(80,80))
+
+game = FlappyBird(graphics="fixed") # use "fancy" for full background, random bird color and random pipe color, use "fixed" (default) for black background and constant bird and pipe colors.
+p = PLE(game, fps=30, frame_skip=1, num_steps=1, force_fps=True,display_screen=True)
+p.init()
+
+
+dqn = dqn()
+
+
+total_steps = 300000
+replay_memory_size = 300000
+mini_batch_size = 32
+gamma = 0.99
+
+ # initialize state and replay memory
+p.reset_game()
+screen_x = process_screen(p.getScreenRGB())
+stacked_x = deque([screen_x, screen_x, screen_x, screen_x], maxlen=4)
+x = np.stack(stacked_x, axis=-1)
+
+replay_memory = MemoryBuffer(replay_memory_size, (80, 80), (1,))
+evaluation_period = 3200
+
+list_actions = p.getActionSet()
+
+# Deep Q-learning with experience replay
+for step in range(total_steps):
+
+    if np.random.rand() < epsilon(step):
+        a_test= np.random.randint(0,5)
+        if (a_test==1):
+            a=1
+        else:
+            a=0
+    else:
+        a = greedy_action(dqn,x)
+
+    action = list_actions[a]
+    r = clip_reward(p.act(action))
+    raw_screen_y = p.getScreenRGB()
+    screen_y = process_screen(raw_screen_y)
+
+    d = p.game_over()
+
+    replay_memory.append(screen_x, a, r, screen_y, d)
+
+    if step>evaluation_period:
+        X,A,R,Y,D = replay_memory.minibatch(mini_batch_size)
+        QY = dqn.predict(Y)
+        QYmax = QY.max(1).reshape((mini_batch_size,1))
+        update = R + gamma * (1-D) * QYmax
+        QX = dqn.predict(X)
+        QX[np.arange(mini_batch_size), A.ravel()] = update.ravel()
+        dqn.train_on_batch(x=X, y=QX)
+
+    if d==True:
+        p.reset_game()
+        screen_x = process_screen(p.getScreenRGB())
+        stacked_x = deque([screen_x, screen_x, screen_x, screen_x], maxlen=4)
+        x = np.stack(stacked_x, axis=-1)
+    else:
+        screen_x = screen_y
+        stacked_x.append(screen_x)
+        x = np.stack(stacked_x, axis=-1)
+
+    if (step%100000 == 0):
+        dqn.save("DQN")
diff --git a/RandomBird/run.py → Vivant/run.py b/RandomBird/run.py → Vivant/run.py
@@ -5,13 +5,13 @@
 from FlappyAgent import FlappyPolicy
 
 game = FlappyBird(graphics="fixed") # use "fancy" for full background, random bird color and random pipe color, use "fixed" (default) for black background and constant bird and pipe colors.
-p = PLE(game, fps=30, frame_skip=1, num_steps=1, force_fps=False, display_screen=True)
+p = PLE(game, fps=30, frame_skip=1, num_steps=1, force_fps=True, display_screen=True)
 # Note: if you want to see you agent act in real time, set force_fps to False. But don't use this setting for learning, just for display purposes.
 
 p.init()
 reward = 0.0
 
-nb_games = 100
+nb_games = 10
 cumulated = np.zeros((nb_games))
 
 for i in range(nb_games):
@@ -27,3 +27,6 @@
 
 average_score = np.mean(cumulated)
 max_score = np.max(cumulated)
+
+print(average_score)
+print(max_score)