[논문 full 번역 + 이해를 위한 추가설명] DQN, Playing Atari with Deep Reinforcement Learning

+ 이나스AI에는 인공지능 관련 다양한 강의가 있습니다.

- Object detection, Text To Speech, Reinforcement learning, OCR, Chatbot, Time series prediction,

- Meta learning, eXplainable AI, Transformer/BERT/GPT, Graph-based ML, etc

- 소스코드, 다이어그램 및 중간데이터에 기반하여 인공지능 알고리즘들의 작동원리와 구조를 이해하기 쉽고 정확하게 설명합니다.

+ 작업할 때 켜놓는 라이브 스트리밍 study with me 채널도 운영하고 있습니다.

[] 안에는 의미를 명확히 하고자 원문 영어 표현이,

() 안에는 원문에서 작성된 부연 설명이,

<> 안에는 이해를 돕기 위해 번역자가 추가한 부연 설명이 작성되었습니다.

더 나은 표현 및 코멘트를 위해 계속 업데이트 합니다. 최종 수정 2022 10 04 Tue

https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

Abstract

Reinforcement learning을 사용해 고차원의 sensory input으로부터 control policies를 성공적으로 학습할 수 있는 첫번 째 deep learning model을 제안한다. model은 convolutional neural network로 구현되었고, Q-learning의 변형 방식으로 훈련되었다. Input은 raw pixels이며 output은 future reward를 추정하는 value fuction이다. 우리의 방법을 Arcade Learning Environment로 부터 7개의 Atari 2600 games에 구조 또는 learning algorithm의 조정 없이 적용하였다. 우리의 방법은 7개의 games에서 모든 이전의 접근법들의 성능을 능가하였고, 3개의 games에서는 인간 전문가도 능가하였다.

1 Introduction

시각 또는 음성과 같은 고차원의 sensory inputs으로 부터 agents를 control하는 것을 학습하는 것은 reinforcement learning (RL)의 오랜 도전 과제 중 하나이다. 이러한 domains에서 동작하는 대부분의 성공적인 RL applications는 linear value functions 또는 policy representations와 결합된 hand-crafted features에 의존한다. 당연히, 그러한 systems의 성능은 feature representation의 질적 수준에 강하게 의존한다.

deep learning에서의 최근 발전은 raw sensory data로 부터 고수준의 features를 추출하는 것을 가능하게 했고, 이는 computer vision과 speech recognition에서의 돌파구로 이어졌다.

이러한 방법들은 convolutional networks, multilayer perceptrons, restricted Boltzmann machines, recurrent neural networks 와 같은 다양한 neural network 구조를 사용하며, 지도 학습 및 비지도 학습 모두를 활용한다. 이와 비슷한 기술들이 sensory data를 사용하는 RL에도 유용할까에 대한 질문이 자연스럽게 떠오르게 된다.

그러나, reinforcement learning은 deep learning 관점에서 몇몇의 도전적인 특성을 보인다.

첫째로, 대부분의 성공적인 deep learning applications는 수작업으로 label된 training data를 요구한다.

반면에, RL algorithms는 빈번하게 sparse, noisy, delay된 scalar reward 신호로 부터 학습될 수 있다.

actions와 resulting rewards사이의 지연은 수천 timesteps 길이가 될 수 있는데, 이러한 사항은 지도학습에서 inputs와 targets사이의 직접적 연관과 비교하여 매우 어려운 특성이라고 할 수 있다.

다른 issue로는, 대부분의 deep learning algorithms은 data samples가 독립적임을 가정하는데, reinforcement learning에서는 매우 correlated된 states의 sequences를 다루게 된다. 뿐만 아니라, RL에서는, algorithm이 새로운 behaviours를 학습함에 따라 data의 분포가 변하는데, 이는 고정된 underlying 분포를 가정하는 deep learning 방법들에서 매우 문제가될 수 있다.

이 논문은 복합적인 RL 환경에서 raw video data로부터 성공적인 control policies를 학습할 때 발생하는 이러한 도전들을 convolutional neural network가 극복할 수 있다는 점을 입증한다. Network는 Q-learning algorithm의 변형된 방법으로 훈련되었고, weights는 stochastic gradient descent방법으로 update되었다.

correlated data와 비-정적인 분포에 관한 문제를 경감시키고자 experience replay mechanism을 사용하였다. Experience replay mechanism은 이전의 transitions를 무작위로 sample함으로써, 많은 과거의 behaviors에서 training distrbution을 부드럽게한다.

우리의 접근법을 The Arcade Learning Environment (ALE)에 구현된 Atari 2600 games에 적용하였다.

Atari 2600 은 고차원의 시각적 입력 (210 × 160 RGB video at 60Hz)을 갖는 agents가 표현되는 도전적인 RL testbed이며, human players에게도 어렵지만 다양하고 흥미로운 작업들로 구성되어 있다.

우리의 목표는 가능한 많은 games를 play하는 방법을 학습할 수 있는 하나의 neural network agent를 만드는 것이다.

Network에는 game에 특이적인 정보 또는 수작업으로 design된 visual features가 제공되지 않았으며, human player가 그러하듯이, 오로지 video input, reward, terminal signal, 가능한 actions의 집합으로만 훈련되었다.

더욱이, 훈련에 사용된 network 구조와 모든 hyperparameters 모든 games에서 일정하게 유지되었다.

우리의 network는 7개의 games 중에서 6개의 games에서 이전의 모든 RL algorithms보다 성능이 높았으며, 3개의 games에서는 인간 전문가보다도 점수가 높았다. Figure 1은 훈련에 사용된 5종류의 games의 sample screenshots를 제공한다.

Figure 1: 5종류의 Atari 2600 Games에서의 screen shots: (좌에서 우로) Pong, Breakout, Space Invaders, Seaquest, Beam Rider

2 Background

우리는 agents가 environment E와 상호작용하는 task를 고려한다. 이 경우에는, Atari emulator, actions의 sequence, observations, rewards 이다.

각 time-step에서, agent는 game에서 허용된 action 집합 \mathcal{A}=\{1,\dots,K\} 에서 action을 선택한다.

Action은 emulator에 전달되고, emulator는 내부 state와 game score를 변경시킨다. 일반적으로, environment E는 stochastic 하다.

Agent는 emulator의 내부 state를 볼 수 없다. 대신에, agent는 emulator로 부터의 image x_t\in\mathbb{R}^{d} 볼 수 있다. x_t는 raw pixel 값들의 vector이며, 현재 화면을 나타낸다.

게다가, agent는 game score에서 변화를 나타내는 reward r_t를 받는다<reward가 크면, game score가 증가할 것이다>. 일반적으로, game score는 actions와 observations의 전체적인 이전 sequence에 의존한다. Action에 대한 feedback은 수천 time-steps가 경과한 이후에 전달된다는 것을 의미한다.

agent는 오로지 현재 화면 images만 볼 수 있기 때문에, task는 부분적으로 관찰된다. 즉, 현재의 screen x_t만 사용하여, 현재의 situation을 완전히 이해하는 것은 불가능하다.

따라서, 우리는 다음과 같은 actions와 observations의 sequence를 고려한다 : s_t = x_1,a_1,x_2,\dots,a_{t-1},x_t. 그리고 이러한 sequences에 의존하는 game 전략들을 학습할 것이다.

emulator에서 모든 sequences는 유한번의 time-steps에서 종료되는 것을 가정한다.

이러한 formalism은 큰 time-step을 발생시키겠지만, 각 sequence는 개별적 state가 되는 유한의 Markov decision process (MDP)로 modeling 된다.

결과적으로, 우리는 time t에서의 state에 대한 표현인 완전한 sequence s_t를 사용하면서, 표준적인 reinforcement learning 방법을 MDPs에 적용한다.

Agent의 목표는 future rewards를 최대화하는 방식으로 actions를 선택함으로써 emulator와 상호작용하는 것이다. 우리는 future rewards는 time-step당 \gamma의 비율로 감쇄된다는 표준적인 가정을 갖는다. 또한 우리는 시각 t에서 future discounted return을 R_t = \sum\limits_{t^{'}=t}^{T} \gamma^{t^{'}-t}r_t 로 정의한다. T는 game이 종료될 때의 time-step이다.

최대 return 기대값을 의미하는 최적의 action-value function Q^{*}(s,a)을 정의하며, 다음의 전략들로 달성가능하다. 즉, 몇개의 sequence s를 관찰한 이후 action a을 수행한다.

Q^{*}(s,a)=\max_{\pi} \mathbb{E}[R_t|s_t=s,a_t=a,\pi]

\pi는 sequences를 actions로 mapping하는 policy이다 (또는 actions에 대한 분포)

최적의 action-value function은 Bellman 방정식으로 알려진 중요한 정체성을 따른다.

이것은 다음의 직관에 근거한다: 만약 다음 time-step에서 sequence s^{'}의 optmal value Q^{*}(s^{'},a^{'})가 모든 가능한 actions a^{'}에 대해 알려져 있다면, 최적의 전략은 r+\gamma Q^{*}(s^{'},a^{'})의 기대값을 최대화하는 action a^{'}을 선택하는 것이다.

Q^{*}(s,a)=\mathbb{E}_{s^{'}\sim \epsilon}[r+\gamma\max_{a^{'}}Q^{*}(s^{'},a^{'})|s,a]\;\;\;\;\;(1)

많은 reinforcement learning algorithms에 있는 기본적 idea는 iterative update로써 Bellman equation을 사용하여 action-value function을 추정하는 것이다.

Q_{i+1}(s,a)=\mathbb{E}[r+\gamma\max_{a^{'}}Q^{*}(s^{'},a^{'})|s,a]

이러한 value iteraton algorithms은 최적의 action-value function으로 수렴하며, 다음과 같이 표시할 수 있다: i\rightarrow \infty 일 때, Q_i\rightarrow Q^{*}.

실제적로는, 이러한 기본적 접근은 적용 불가능하다. 왜냐하면, action-value function은 일반화 없이 각 sequence에 대해 별개로 추정되기때문이다.

대신에, action-value function을 추정하기 위해 다음과 같이 function approximator를 사용하는 것이 일반적이다: Q(s,a;\theta)\approx Q^{*}(s,a)

Reinforcement learning community에서, linear function approximator가 사용된다. 그러나 때로는 neural network와 같은 비선형의 function approximator가 대신 사용되기도 한다.

우리는 weight \theta를 갖는 neural network function approximator인 Q-network를 사용할 것이다.

Q-network는 loss function L_i(\theta_i)의 sequence를 최소화함으로써 훈련될 수 있으며, 각 iteration i에서 Q-network는 업데이트되어 변경된다.

L_i(\theta_i) = \mathbb{E}_{s,a\sim \rho(\cdot)}[(y_i-Q(sa;\theta_i))^2] \;\;\;\;\; (2)

y_i = \mathbb{E}_{s^{'}\sim \epsilon}[r+\gamma \max_{a^{'}}Q(s^{'},a^{'};\theta_{i-1})|s,a] 는 iteration i에서의 target이다. \rho(s,a)는 sequence s와 actions a에서의 behaviour distribution로써 우리가 참조하는 확률 분포이다. 이전 iteration인 \theta_{i-1}에서 loss function L_i(\theta_i)를 최적화할 때, parameters는 고정된다.

Targets은 network weights에 의존한다는 점을 주목할 필요가 있다. 이러한 특성은 지도학습에서 사용되는 학습 시작 전에 고정되는 targets과 반대 특성이다. weights에 대하여 loss function을 미분함으로써, 다음의 gradient를 얻는다.

\nabla_{\theta_i}L_{i}(\theta_i) = \mathbb{E}_{s,a\sim \rho(\cdot);s^{'}\sim \epsilon} [ (r+\gamma\max_{a^{'}}Q(s^{'},a^{'};\theta_{i-1})-Q(s,a;\theta_i))\nabla_{\theta_i}Q(s,a;\theta_i) ] \;\;\;\;\; (3)

위의 gradient에서 완전한 기대값을 계산하기 보다, 대개는 stochastic gradient descent 방법으로 loss function을 최적화하는 것이 연산상 편리하다.

모든 time-step에서 weights가 update된 후, behaviour distribution \rho 와 emulator E로부터의 single samples에 의해 기대값들은 대체되며, 우리에게 친숙한 Q-learning algorithm 형태로 도출된다.

이러한 algorithm은 model-free라는 점을 주목해야한다. 이러한 방법은 \epsilon에 대한 estimate을 명시적으로 구축하는 과정이 없는 emulator \mathcal{E}로 부터의 samples를 사용하여 reinforcement learning task를 직접적으로 해결한다.

그것은 또한 off-policy이며, state space [상태 공간]의 적절한 탐색을 보장하는 behaviour distribution을 따르면서 greedy strategy를 학습한다.

실제적으로, behaviour distribution은 종종 \epsilon-greedy 전략에 의해 선택되는데, 선택될 때는 1-\epsilon의 확률을 가지는 greedy strategy가 사용되고, random action이 선택될 때는 \epsilon의 확률을 가지는 greedy strategy가 사용된다.

3 Related Work

아마도, 가장 잘 알려진 reinforcement learning의 성공 사례는 TD-gammon일 것이다. TD-gammon은 backgammon-playing program 으로써 reinforcement learning과 self-play에 의해서 전체적으로 학습되어, super-human level에 도달하였다. TD-gammon은 Q-learning과 비슷한 model-free reinforcement learning algorithm을 사용하며, 하나의 hidden layer를 갖는 multi-layer perceptron를 활용하여 value function을 근사한다. 그러나, chess, 바둑, checkers 분야서 TD-gammon을 따르려는 초기의 시도들은 덜 성공적이었다.

This led to a widespread belief that the TD-gammon approach was a special case that only worked in backgammon, perhaps because the stochasticity in the dice rolls helps explore the state space and also makes the value function particularly smooth [19]. Furthermore, it was shown that combining model-free reinforcement learning algorithms such as Qlearning with non-linear function approximators [25], or indeed with off-policy learning [1] could

cause the Q-network to diverge.

이것은 TD-gammon 접근법은 backgammon에만 동작하는 특수 사례라는 광범위한 믿음을 야기하였는데, 이는 아마도 state space를 탐험하는데 주사위를 던질 때의 stochasticity가 도움이 되고, value function을 특히 부드럽게 만들기 때문일 것이다.

Subsequently, the majority of work in reinforcement learning focused on linear function approximators with better convergence guarantees [25]. More recently, there has been a revival of interest in combining deep learning with reinforcement learning. Deep neural networks have been used to estimate the environment E; restricted Boltzmann machines have been used to estimate the value function [21]; or the policy [9]. In addition, the divergence issues with Q-learning have been partially addressed by gradient temporal-difference methods.

그 후에, reinforcement learning에서의 대부분의 작업은 더 나은 수렴 보장을 갖는 linear function approximators에 집중했다. 더 최근에는, deep learning과 reinforcement learning의 결합에 대한 관심이 커지고 있다. Deep neural networks는

environment \mathbb{E}

These methods are proven to converge when evaluating a fixed policy with a nonlinear

function approximator [14]; or when learning a control policy with linear function approximation

using a restricted variant of Q-learning [15]. However, these methods have not yet been extended to

nonlinear control.

Perhaps the most similar prior work to our own approach is neural fitted Q-learning (NFQ) [20].

NFQ optimises the sequence of loss functions in Equation 2, using the RPROP algorithm to update

the parameters of the Q-network. However, it uses a batch update that has a computational cost

per iteration that is proportional to the size of the data set, whereas we consider stochastic gradient

updates that have a low constant cost per iteration and scale to large data-sets. NFQ has also been

successfully applied to simple real-world control tasks using purely visual input, by first using deep

autoencoders to learn a low dimensional representation of the task, and then applying NFQ to this

representation [12]. In contrast our approach applies reinforcement learning end-to-end, directly

from the visual inputs; as a result it may learn features that are directly relevant to discriminating

action-values. Q-learning has also previously been combined with experience replay and a simple

neural network [13], but again starting with a low-dimensional state rather than raw visual inputs.

The use of the Atari 2600 emulator as a reinforcement learning platform was introduced by [3], who

applied standard reinforcement learning algorithms with linear function approximation and generic

visual features. Subsequently, results were improved by using a larger number of features, and

using tug-of-war hashing to randomly project the features into a lower-dimensional space [2]. The

HyperNEAT evolutionary architecture [8] has also been applied to the Atari platform, where it was

used to evolve (separately, for each distinct game) a neural network representing a strategy for that

game. When trained repeatedly against deterministic sequences using the emulator’s reset facility,

these strategies were able to exploit design flaws in several Atari games.

4 Deep Reinforcement Learning

Recent breakthroughs in computer vision and speech recognition have relied on efficiently training

deep neural networks on very large training sets. The most successful approaches are trained directly

from the raw inputs, using lightweight updates based on stochastic gradient descent. By feeding

sufficient data into deep neural networks, it is often possible to learn better representations than

handcrafted features [11]. These successes motivate our approach to reinforcement learning. Our

goal is to connect a reinforcement learning algorithm to a deep neural network which operates

directly on RGB images and efficiently process training data by using stochastic gradient updates.

Tesauro’s TD-Gammon architecture provides a starting point for such an approach. This architecture updates the parameters of a network that estimates the value function, directly from on-policy

samples of experience, st, at, rt, st+1, at+1, drawn from the algorithm’s interactions with the environment (or by self-play, in the case of backgammon). Since this approach was able to outperform

the best human backgammon players 20 years ago, it is natural to wonder whether two decades of

hardware improvements, coupled with modern deep neural network architectures and scalable RL

algorithms might produce significant progress.

In contrast to TD-Gammon and similar online approaches, we utilize a technique known as experience replay [13] where we store the agent’s experiences at each time-step, et = (st, at, rt, st+1)

in a data-set D = e1, ..., eN , pooled over many episodes into a replay memory. During the inner

loop of the algorithm, we apply Q-learning updates, or minibatch updates, to samples of experience,

e ∼ D, drawn at random from the pool of stored samples. After performing experience replay,

the agent selects and executes an action according to an -greedy policy. Since using histories of

arbitrary length as inputs to a neural network can be difficult, our Q-function instead works on fixed

length representation of histories produced by a function φ. The full algorithm, which we call deep

Q-learning, is presented in Algorithm 1.

This approach has several advantages over standard online Q-learning [23]. First, each step of

experience is potentially used in many weight updates, which allows for greater data efficiency. Second, learning directly from consecutive samples is inefficient, due to the strong correlations

between the samples; randomizing the samples breaks these correlations and therefore reduces the

variance of the updates. Third, when learning on-policy the current parameters determine the next

data sample that the parameters are trained on. For example, if the maximizing action is to move left

then the training samples will be dominated by samples from the left-hand side; if the maximizing

action then switches to the right then the training distribution will also switch. It is easy to see how

unwanted feedback loops may arise and the parameters could get stuck in a poor local minimum, or

even diverge catastrophically [25]. By using experience replay the behavior distribution is averaged

over many of its previous states, smoothing out learning and avoiding oscillations or divergence in

the parameters. Note that when learning by experience replay, it is necessary to learn off-policy

(because our current parameters are different to those used to generate the sample), which motivates

the choice of Q-learning.

In practice, our algorithm only stores the last N experience tuples in the replay memory, and samples

uniformly at random from D when performing updates. This approach is in some respects limited

since the memory buffer does not differentiate important transitions and always overwrites with

recent transitions due to the finite memory size N. Similarly, the uniform sampling gives equal

importance to all transitions in the replay memory. A more sophisticated sampling strategy might

emphasize transitions from which we can learn the most, similar to prioritized sweeping [17].

Algorithm 1 Deep Q-learning with Experience Replay

4.1 Preprocessing and Model Architecture

Working directly with raw Atari frames, which are 210 × 160 pixel images with a 128 color palette,

can be computationally demanding, so we apply a basic preprocessing step aimed at reducing the

input dimensionality. The raw frames are preprocessed by first converting their RGB representation

to gray-scale and down-sampling it to a 110×84 image. The final input representation is obtained by

cropping an 84 × 84 region of the image that roughly captures the playing area. The final cropping

stage is only required because we use the GPU implementation of 2D convolutions from [11], which

expects square inputs. For the experiments in this paper, the function φ from algorithm 1 applies this

preprocessing to the last 4 frames of a history and stacks them to produce the input to the Q-function.

There are several possible ways of parameterizing Q using a neural network. Since Q maps historyaction pairs to scalar estimates of their Q-value, the history and the action have been used as inputs

to the neural network by some previous approaches [20, 12]. The main drawback of this type

of architecture is that a separate forward pass is required to compute the Q-value of each action,

resulting in a cost that scales linearly with the number of actions. We instead use an architecture

in which there is a separate output unit for each possible action, and only the state representation is

an input to the neural network. The outputs correspond to the predicted Q-values of the individual

action for the input state. The main advantage of this type of architecture is the ability to compute

Q-values for all possible actions in a given state with only a single forward pass through the network.

We now describe the exact architecture used for all seven Atari games. The input to the neural

network consists is an 84 × 84 × 4 image produced by φ. The first hidden layer convolves 16 8 × 8

filters with stride 4 with the input image and applies a rectifier nonlinearity [10, 18]. The second

hidden layer convolves 32 4 × 4 filters with stride 2, again followed by a rectifier nonlinearity. The

final hidden layer is fully-connected and consists of 256 rectifier units. The output layer is a fullyconnected linear layer with a single output for each valid action. The number of valid actions varied between 4 and 18 on the games we considered. We refer to convolutional networks trained with our

approach as Deep Q-Networks (DQN).

5 Experiments

So far, we have performed experiments on seven popular ATARI games – Beam Rider, Breakout,

Enduro, Pong, Q*bert, Seaquest, Space Invaders. We use the same network architecture, learning

algorithm and hyperparameters settings across all seven games, showing that our approach is robust

enough to work on a variety of games without incorporating game-specific information. While we

evaluated our agents on the real and unmodified games, we made one change to the reward structure

of the games during training only. Since the scale of scores varies greatly from game to game, we

fixed all positive rewards to be 1 and all negative rewards to be −1, leaving 0 rewards unchanged.

Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to

use the same learning rate across multiple games. At the same time, it could affect the performance

of our agent since it cannot differentiate between rewards of different magnitude.

In these experiments, we used the RMSProp algorithm with minibatches of size 32. The behavior

policy during training was -greedy with annealed linearly from 1 to 0.1 over the first million

frames, and fixed at 0.1 thereafter. We trained for a total of 10 million frames and used a replay

memory of one million most recent frames.

Following previous approaches to playing Atari games, we also use a simple frame-skipping technique [3]. More precisely, the agent sees and selects actions on every k

th frame instead of every

frame, and its last action is repeated on skipped frames. Since running the emulator forward for one

step requires much less computation than having the agent select an action, this technique allows

the agent to play roughly k times more games without significantly increasing the runtime. We use

k = 4 for all games except Space Invaders where we noticed that using k = 4 makes the lasers

invisible because of the period at which they blink. We used k = 3 to make the lasers visible and

this change was the only difference in hyperparameter values between any of the games.

5.1 Training and Stability

In supervised learning, one can easily track the performance of a model during training by evaluating

it on the training and validation sets. In reinforcement learning, however, accurately evaluating the

progress of an agent during training can be challenging. Since our evaluation metric, as suggested

by [3], is the total reward the agent collects in an episode or game averaged over a number of

games, we periodically compute it during training. The average total reward metric tends to be very

noisy because small changes to the weights of a policy can lead to large changes in the distribution of

states the policy visits . The leftmost two plots in figure 2 show how the average total reward evolves

during training on the games Seaquest and Breakout. Both averaged reward plots are indeed quite

noisy, giving one the impression that the learning algorithm is not making steady progress. Another,

more stable, metric is the policy’s estimated action-value function Q, which provides an estimate of

how much discounted reward the agent can obtain by following its policy from any given state. We

collect a fixed set of states by running a random policy before training starts and track the average

of the maximum2 predicted Q for these states. The two rightmost plots in figure 2 show that average

predicted Q increases much more smoothly than the average total reward obtained by the agent and

plotting the same metrics on the other five games produces similarly smooth curves. In addition

to seeing relatively smooth improvement to predicted Q during training we did not experience any

divergence issues in any of our experiments. This suggests that, despite lacking any theoretical

convergence guarantees, our method is able to train large neural networks using a reinforcement

learning signal and stochastic gradient descent in a stable manner.

Figure 2: The two plots on the left show average reward per episode on Breakout and Seaquest

respectively during training. The statistics were computed by running an -greedy policy with =

0.05 for 10000 steps. The two plots on the right show the average maximum predicted action-value

of a held out set of states on Breakout and Seaquest respectively. One epoch corresponds to 50000

minibatch weight updates or roughly 30 minutes of training time

Figure 3: The leftmost plot shows the predicted value function for a 30 frame segment of the game

Seaquest. The three screenshots correspond to the frames labeled by A, B, and C respectively.

5.2 Visualizing the Value Function

Figure 3 shows a visualization of the learned value function on the game Seaquest. The figure shows

that the predicted value jumps after an enemy appears on the left of the screen (point A). The agent

then fires a torpedo at the enemy and the predicted value peaks as the torpedo is about to hit the

enemy (point B). Finally, the value falls to roughly its original value after the enemy disappears

(point C). Figure 3 demonstrates that our method is able to learn how the value function evolves for

a reasonably complex sequence of events.

5.3 Main Evaluation

We compare our results with the best performing methods from the RL literature [3, 4]. The method

labeled Sarsa used the Sarsa algorithm to learn linear policies on several different feature sets handengineered for the Atari task and we report the score for the best performing feature set [3]. Contingency used the same basic approach as Sarsa but augmented the feature sets with a learned

representation of the parts of the screen that are under the agent’s control [4]. Note that both of these

methods incorporate significant prior knowledge about the visual problem by using background subtraction and treating each of the 128 colors as a separate channel. Since many of the Atari games use

one distinct color for each type of object, treating each color as a separate channel can be similar to

producing a separate binary map encoding the presence of each object type. In contrast, our agents

only receive the raw RGB screenshots as input and must learn to detect objects on their own.

In addition to the learned agents, we also report scores for an expert human game player and a policy

that selects actions uniformly at random. The human performance is the median reward achieved

after around two hours of playing each game. Note that our reported human scores are much higher

than the ones in Bellemare et al. [3]. For the learned methods, we follow the evaluation strategy used

in Bellemare et al. [3, 5] and report the average score obtained by running an -greedy policy with

= 0.05 for a fixed number of steps. The first five rows of table 1 show the per-game average scores

on all games. Our approach (labeled DQN) outperforms the other learning methods by a substantial

margin on all seven games despite incorporating almost no prior knowledge about the inputs.

We also include a comparison to the evolutionary policy search approach from [8] in the last three

rows of table 1. We report two sets of results for this method. The HNeat Best score reflects the

results obtained by using a hand-engineered object detector algorithm that outputs the locations and

types of objects on the Atari screen. The HNeat Pixel score is obtained by using the special 8 color channel representation of the Atari emulator that represents an object label map at each channel. This method relies heavily on finding a deterministic sequence of states that represents a successful exploit. It is unlikely that strategies learnt in this way will generalize to random perturbations; therefore the algorithm was only evaluated on the highest scoring single episode. In contrast, our algorithm is evaluated on -greedy control sequences, and must therefore generalize across a wide variety of possible situations. Nevertheless, we show that on all the games, except Space Invaders, not only our max evaluation results (row 8), but also our average results (row 4) achieve better performance. Finally, we show that our method achieves better performance than an expert human player on Breakout, Enduro and Pong and it achieves close to human performance on Beam Rider. The games Q*bert, Seaquest, Space Invaders, on which we are far from human performance, are more challenging because they require the network to find a strategy that extends over long time scales.

Table 1: The upper table compares average total reward for various learning methods by running an -greedy policy with = 0.05 for a fixed number of steps. The lower table reports results of the single best performing episode for HNeat and DQN. HNeat produces deterministic policies that always get the same score while DQN used an -greedy policy with = 0.05.

6 Conclusion This paper introduced a new deep learning model for reinforcement learning, and demonstrated its ability to master difficult control policies for Atari 2600 computer games, using only raw pixels as input. We also presented a variant of online Q-learning that combines stochastic minibatch updates with experience replay memory to ease the training of deep networks for RL. Our approach gave state-of-the-art results in six of the seven games it was tested on, with no adjustment of the architecture or hyperparameters.

+ 이나스AI에는 인공지능 관련 다양한 강의가 있습니다.

- Object detection, Text To Speech, Reinforcement learning, OCR, Chatbot, Time series prediction,

- Meta learning, eXplainable AI, Transformer/BERT/GPT, Graph-based ML, etc

- 소스코드, 다이어그램 및 중간데이터에 기반하여 인공지능 알고리즘들의 작동원리와 구조를 이해하기 쉽고 정확하게 설명합니다.

+ 작업할 때 켜놓는 라이브 스트리밍 study with me 채널도 운영하고 있습니다.

이 블로그 검색

Inas AI lab

[논문 full 번역 + 이해를 위한 추가설명] DQN, Playing Atari with Deep Reinforcement Learning

댓글

댓글 쓰기

이 블로그의 인기 게시물

[논문 full 번역 + 이해를 위한 추가설명] CRNN, An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CUDA 와 CuDNN 의 설치, 삭제 및 버전 변경