Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 10 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,13 @@
# RL challenge

Your challenge is to learn to play [Flappy Bird](https://en.wikipedia.org/wiki/Flappy_Bird)!
This project is a part of Supaero reinforcement learning module. The objective is to play Flappy Bird using Deep-Q Learning.
A similar project has already been done in class for a breakout game giving an example of the global method and a MemoryBuffer class.

Flappybird is a side-scrolling game where the agent must successfully nagivate through gaps between pipes. Only two actions in this game: at each time step, either you click and the bird flaps, or you don't click and gravity plays its role.
# Method

There are three levels of difficulty in this challenge:
- Learn an optimal policy with hand-crafted features
- Learn an optimal policy with raw variables
- Learn an optimal policy from pixels.
This project uses a deep neural network to learn the Q-function which is the maximum expected reward when choosing a certain action. Knowing this Q-function defines a policy, which will the to choose the action which leads to the maximum expected reward.

# Your job

Your job is to:
<ol>
<li> fork the project at [https://github.com/SupaeroDataScience/RLchallenge](https://github.com/SupaeroDataScience/RLchallenge) on your own github (yes, you'll need one).
<li> rename the "RandomBird" folder into "YourLastName".
<li> modify 'FlappyPolicy.py' in order to implement the function `FlappyPolicy(state,screen)` used below. You're free to add as many extra files as you need. However, you're not allowed to change 'run.py'.
<li> you are encouraged, however, to copy-paste the contents of 'run.py' as a basis for your learning algorithm.
<li> add any useful material (comments, text files, analysis, etc.)
<li> make a pull request on the original repository <i>when you're done</i> (please don't make a pull request before you think your work is ready to be merged on the original repository).
</ol>

**All the files you create must be placed inside the directory "YourLastName".**

`FlappyPolicy(state,screen)` takes both the game state and the screen as input. It gives you the choice of what you base your policy on:
<ul>
<li> If you use the state variables vector and perform some handcrafted feature engineering, you're playing in the "easy" league. If your agent reaches an average score of 15, you're sure to have a grade of at least 10/20 (possibly more if you implement smart stuff and/or provide a smart discussion).
<li> If you use the state variables vector without altering it (no feature engineering), you're playing in the "good job" league. If your agent reaches an average score of 15, you're sure to have at least 15/20 (possibly more if you implement smart stuff and/or provide a smart discussion).
<li> If your agent uses only the raw pixels from the image, you're playing in the "Deepmind" league. If your agent reaches an average score of 15, you're sure to have at the maximum grade (plus possible additional benefits).
</ul>

Recall that the evaluation will start by running 'run.py' on our side, so 'FlappyPolicy' should call an already trained policy, otherwise we will be evaluating your agent during learning, which is not the goal. Of course, we will check your learning code and we will greatly appreciate insightful comments and additional material like (documentation, discussion, comparisons, perspectives, state-of-the-art...).
This neural network takes as input raw pixels as 80*80 images. All the different states, actions and rewards are stored in a MemoryBuffer which allows to train our network on randomly chosen states (experience replay). An epsilon-greedy approach is done with epsilon starting from 0.1 and going linearly to 0.001 after 300 000 steps.

# Installation

Expand All @@ -47,3 +24,8 @@ cd PyGame-Learning-Environment/
pip install -e .
```
Note that this version of FlappyBird in PLE has been slightly changed to make the challenge a bit easier: the background is turned to plain black, the bird and pipe colors are constant (red and green respectively).

# See Flappy Bird fly

The "Training.py" is the script used to generate the neural network "DQN".
To test it, execute "run.py" which will launch 10 games and store the mean score and max score for each game.
9 changes: 0 additions & 9 deletions RandomBird/FlappyAgent.py

This file was deleted.

Binary file added Vivant/DQN
Binary file not shown.
25 changes: 25 additions & 0 deletions Vivant/FlappyAgent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
from keras.models import load_model
from collections import deque
import numpy as np
from skimage.transform import resize
from skimage.color import rgb2gray

global frames

def process_screen(x):
return 255*resize(rgb2gray(x[60:, 25:310,:]),(80,80))

model = load_model("DQN")
list_actions = [119,None]
frames = deque([np.zeros((80,80)),np.zeros((80,80)),np.zeros((80,80)),np.zeros((80,80))], maxlen=4)

def FlappyPolicy(state,screen):

screen = process_screen(screen)
frames.append(screen)
frameStacked = np.stack(frames, axis=-1)
action = list_actions[np.argmax(model.predict(np.expand_dims(frameStacked,axis=0)))]
return action



185 changes: 185 additions & 0 deletions Vivant/Training.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
from ple.games.flappybird import FlappyBird
from ple import PLE

import numpy as np

from collections import deque

from skimage.color import rgb2gray
from skimage.transform import resize
import skimage

import matplotlib.pyplot as plt

import random

from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.optimizers import SGD , Adam
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten

class MemoryBuffer:
"An experience replay buffer using numpy arrays"
def __init__(self, length, screen_shape, action_shape):
self.length = length
self.screen_shape = screen_shape
self.action_shape = action_shape
shape = (length,) + screen_shape
self.screens_x = np.zeros(shape, dtype=np.uint8) # starting states
self.screens_y = np.zeros(shape, dtype=np.uint8) # resulting states
shape = (length,) + action_shape
self.actions = np.zeros(shape, dtype=np.uint8) # actions
self.rewards = np.zeros((length,1), dtype=np.uint8) # rewards
self.terminals = np.zeros((length,1), dtype=np.bool) # true if resulting state is terminal
self.terminals[-1] = True
self.index = 0 # points one position past the last inserted element
self.size = 0 # current size of the buffer

def append(self, screenx, a, r, screeny, d):
self.screens_x[self.index] = screenx
#plt.imshow(screenx)
#plt.show()
#plt.imshow(self.screens_x[self.index])
#plt.show()
self.actions[self.index] = a
self.rewards[self.index] = r
self.screens_y[self.index] = screeny
self.terminals[self.index] = d
self.index = (self.index+1) % self.length
self.size = np.min([self.size+1,self.length])

def stacked_frames_x(self, index):
im_deque = deque(maxlen=4)
pos = index % self.length
for i in range(4): # todo
im = self.screens_x[pos]
im_deque.appendleft(im)
test_pos = (pos-1) % self.length
if self.terminals[test_pos] == False:
pos = test_pos
return np.stack(im_deque, axis=-1)

def stacked_frames_y(self, index):
im_deque = deque(maxlen=4)
pos = index % self.length
for i in range(4): # todo
im = self.screens_y[pos]
im_deque.appendleft(im)
test_pos = (pos-1) % self.length
if self.terminals[test_pos] == False:
pos = test_pos
return np.stack(im_deque, axis=-1)

def minibatch(self, size):
#return np.random.choice(self.data[:self.size], size=sz, replace=False)
indices = np.random.choice(self.size, size=size, replace=False)
x = np.zeros((size,)+self.screen_shape+(4,))
y = np.zeros((size,)+self.screen_shape+(4,))
for i in range(size):
x[i] = self.stacked_frames_x(indices[i])
y[i] = self.stacked_frames_y(indices[i])
return x, self.actions[indices], self.rewards[indices], y, self.terminals[indices]

def dqn():
dqn = Sequential()
#1st layer
dqn.add(Conv2D(filters=16, kernel_size=(8,8), strides=4, activation="relu", input_shape=(80,80,4)))
#2nd layer
dqn.add(Conv2D(filters=32, kernel_size=(4,4), strides=2, activation="relu"))
dqn.add(Flatten())
#3rd layer
dqn.add(Dense(units=256, activation="relu"))
#output layer
dqn.add(Dense(units=2, activation="linear"))
adam = Adam(lr=1e-4)
dqn.compile(optimizer=adam, loss="mean_squared_error")
return dqn

#Epsilon going from 0.1 to 0.001 at 300000 steps linearly
def epsilon(step):
if step<(300000):
return 0.1 -(0.1-0.001)/(300000)*step
return 0.001

#Reward: 1 if pipe passed, else 0.1 if nothing or collision
def clip_reward(r):
if r!=1:
rr=0.1
else:
rr=r
return rr

def greedy_action(network, x):
Q = network.predict(np.array([x]))
return np.argmax(Q)

def process_screen(x):
return 255*resize(rgb2gray(x[60:, 25:310,:]),(80,80))

game = FlappyBird(graphics="fixed") # use "fancy" for full background, random bird color and random pipe color, use "fixed" (default) for black background and constant bird and pipe colors.
p = PLE(game, fps=30, frame_skip=1, num_steps=1, force_fps=True,display_screen=True)
p.init()


dqn = dqn()


total_steps = 300000
replay_memory_size = 300000
mini_batch_size = 32
gamma = 0.99

# initialize state and replay memory
p.reset_game()
screen_x = process_screen(p.getScreenRGB())
stacked_x = deque([screen_x, screen_x, screen_x, screen_x], maxlen=4)
x = np.stack(stacked_x, axis=-1)

replay_memory = MemoryBuffer(replay_memory_size, (80, 80), (1,))
evaluation_period = 3200

list_actions = p.getActionSet()

# Deep Q-learning with experience replay
for step in range(total_steps):

if np.random.rand() < epsilon(step):
a_test= np.random.randint(0,5)
if (a_test==1):
a=1
else:
a=0
else:
a = greedy_action(dqn,x)

action = list_actions[a]
r = clip_reward(p.act(action))
raw_screen_y = p.getScreenRGB()
screen_y = process_screen(raw_screen_y)

d = p.game_over()

replay_memory.append(screen_x, a, r, screen_y, d)

if step>evaluation_period:
X,A,R,Y,D = replay_memory.minibatch(mini_batch_size)
QY = dqn.predict(Y)
QYmax = QY.max(1).reshape((mini_batch_size,1))
update = R + gamma * (1-D) * QYmax
QX = dqn.predict(X)
QX[np.arange(mini_batch_size), A.ravel()] = update.ravel()
dqn.train_on_batch(x=X, y=QX)

if d==True:
p.reset_game()
screen_x = process_screen(p.getScreenRGB())
stacked_x = deque([screen_x, screen_x, screen_x, screen_x], maxlen=4)
x = np.stack(stacked_x, axis=-1)
else:
screen_x = screen_y
stacked_x.append(screen_x)
x = np.stack(stacked_x, axis=-1)

if (step%100000 == 0):
dqn.save("DQN")
7 changes: 5 additions & 2 deletions RandomBird/run.py → Vivant/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@
from FlappyAgent import FlappyPolicy

game = FlappyBird(graphics="fixed") # use "fancy" for full background, random bird color and random pipe color, use "fixed" (default) for black background and constant bird and pipe colors.
p = PLE(game, fps=30, frame_skip=1, num_steps=1, force_fps=False, display_screen=True)
p = PLE(game, fps=30, frame_skip=1, num_steps=1, force_fps=True, display_screen=True)
# Note: if you want to see you agent act in real time, set force_fps to False. But don't use this setting for learning, just for display purposes.

p.init()
reward = 0.0

nb_games = 100
nb_games = 10
cumulated = np.zeros((nb_games))

for i in range(nb_games):
Expand All @@ -27,3 +27,6 @@

average_score = np.mean(cumulated)
max_score = np.max(cumulated)

print(average_score)
print(max_score)