Reinforcement Learning
Automatically determine the ideal behavior (action) within a specific situation (environment) in order to obtain optimal outcome (reward).

- Properties- No supervisor- Only rewards
 
- Importance of time- Current input depends on previous inputs
 
- Delayed rewards- Rewards may be given only after the completion of the entire task
 
- The agent's action affects its next input
- Performance-focesed- Balance between exploration (of uncharted territory) & exploitation (of current knowledge)
 
 
- No supervisor
- Elements- Agent
- Environment
- Reward
- State
 
- Training- Action & policy- Stochastic policy: $$\sum \pi(a|s) = 1$$
- Deterministic policy: $$\pi(s): S \rightarrow A$$
 
- Value- Long-term rewards with discounts of the current state
 
- Q-value- Long-term rewards with discounts, taking into account the current state & action
 
 
- Action & policy
Credit Assignment Problem
When the agent gets a reward, doesn't know which actions to credit or to blame.
- Discount rate- e.g. if an agent gets rewards 10, 0, -50, and discount rate is 0.8, the score of the first action: 10 + r * 0 + r^2 * (-50)
 
- e.g. if an agent gets rewards 10, 0, -50, and discount rate is 0.8, the score of the first action: 
Policy Gradient
Optimizes parameters of a policy by following the gradients toward higher rewards. Learns the policy function directly by maximizing rewards.
Algorithm
- At each step, compute gradients that makes the chosen action even more likely, but don't apply the gradients yet
- At the end of an episode, compute each action's score
- Multiply each gradient vector by the corresponding action's score
- Compute mean of all resulting gradient vectors, use it to perform a gradient descent step
Q-Learning
Arrive at the optimal policy through interactions with environment. Learns the state-action value function Q by minimizing temporal difference error.
repeat
    select an action a
    do(a)
    observe reward r and state s'
    Q[s, a] = Q[s, a] + alpha * (r + gamma * max_a'(Q[s', a']) - Q[s, a])
    s = s'
until termination
Markov Decision Process (MDP)
- Markov chain- A sequence of possible events
- Probability of each event depends only on the state
- Memoryless
 
- Markov decision process- Add in actions & rewards
 

- Fundamental solutions: assume the agent knows the transition & reward probability functions, plan actions offline- Value-iteration
- Policy-iteration- Try policy after policy
 
 
- Q-learning solution: model-free learning environment, improves behavior through online learning
Bellman Optimality Equation
Optimal State Value
The sum of all discounted future rewards the agent can expect after reaching state $$s$$, assuming it acts optimally.
Value iteration algorithm:
$$ V^(s) = maxa \sum{s'} T(s, a, s')(R(s, a, s') + \gamma \cdot V^(s'))
$$
- $$T(s, a, s')$$: transition probability
- $$R(s, a, s')$$: reward
- $$\gamma$$: discount rate
To compute, init all $$V^*(s)$$ to 0, compute the estimated state values at each iteration:
$$ V{k+1}(s) = max_a \sum{s'} T(s, a, s')(R(s, a, s') + \gamma \cdot V_k(s'))
$$
Optimal State-Action Value (Q-Value)
The sum of all discounted future rewards the agent can expect after reaching state $$s$$ and choosing action $$a$$ but before it sees the outcome of this action, assuming it acts optimally.
Q-value iteration algorithm:
$$ Q{k+1}(s) = \sum{s'} T(s, a, s')(R(s, a, s') + \gamma \cdot max_{a'} Q_k(s', a'))
$$
Optimal policy:
$$ \pi^(s) = argmax_a Q^ (s, a)
$$
Temporal Difference Learning
Now, we don't know $$T(s, a, s')$$ and $$R(s, a, s')$$. Need to learn them:
$$ V_{k+1}(s) = (1 - \alpha)V_k(s) + \alpha(r + \gamma \cdot V_k(s'))
$$
This converges only if learning rate $$\alpha$$ gradually reduced.
Adapting Q-value iteration algorithm -> Q-learning:
$$ Q{k+1}(s, a) = (1 - \alpha)Q_k(s, a) + \alpha(r + \gamma \cdot max{a'} Q_k(s', a'))
$$
This can converge at the optimal policy if explored MDP thoroughly enough.

Exploitation v.s. Exploration
Finding optimal policy may take extremely long time.
- Exploitation- Spend time at interesting parts of the environment
 
- Exploration- Spend time visiting unknown regions of the MDP
 
ε-Greedy Policy
At each step, acts randomly with probability ε, or greedily (action with highest Q-value) with probability 1-ε. Common to start from high ε and gradually reduce it.
Bonus
Offer bonus for trying actions not tried before.
$$ Q{k+1}(s, a) = (1 - \alpha)Q_k(s, a) + \alpha(r + \gamma \cdot max{a'} f(Q_k(s', a'), N(s', a')))
$$
- $$N(s', a')$$: counts of number of times action $$a'$$ was chosen in state $$s'$$
- $$f(q, n) = q+ \frac{K}{1+n}$$: exploration function, $$K$$ is the curiosity parameter
SARSA
State-action-reward-state-action. Use current policy to choose the current & next actions, then update the Q-function.
- Off-policy learner- Learns values of an optimal policy independently of the agent's actions
- Can learn optimal policy even if acting randomly, but might take long
- Exploitation
- e.g. Q-learning
 
- On-policy learner- Learns values of the policy the agent is actually carrying out
- Exploration
- e.g. SARSA
 
Algorithm
select an action a using a policy based on Q
repeat
    do(a)
    observe reward r and state s'
    select an action a' using a policy based on Q
    Q[s, a] = Q[s, a] + alpha * (r + gamma * Q[s', a'] - Q[s, a])
    s = s'
    a = a'
until termination
- Properties- Can find different policy than Q-learning when exploring may incur large penalties
- Will find optimal policy
 
Function Approximation
When state space gets large, Q-value updates can be extremely slow. Can learn a value function as a linear combination of features.
- For each state/action pair- Determine its representation in terms of features
- Perform Q-learning update on each feature
- Q-value estimate is a weighted sum over the state/action pair's features
 
SARSA
Let $$F_1(s, a), \dots, F_n(s, a)$$ be the features of state $$s$$ and action $$a$$.
$$ Q_{\bar{w}}(s, a) = w_0 + w_1F_1(s, a) + \dots + w_nF_n(s, a)
$$
Algorithm
assign weights W arbitrarily
observe current state s
select an action a
repeat
    do(a)
    observe reward r and state s'
    select an action a' using a policy based on Q_w
    let delta = r + gamma * Q_w[s', a'] - Q_w[s, a]
    for i = 0..n
        W[i] = W[i] + eta * delta * F[s, a][i]
    s = s'
    a = a'
until termination
- Properties- Reduction of size
- Feature selection/extraction non-trivial
- Learns weights for a tiny number of features
- No longer tracking values for state/action pairs- Values estimated from features
 
- Weight update a form of gradient descent
- A variant of linear regression
- Restricted accuracy of learned rewards