LRL-7: Temporal Difference (TD) Learning - Learning from Incomplete Episodes

Let's dive into Chapter 7: Temporal Difference (TD) Learning - Learning from Incomplete Episodes. In the last chapter, we explored Monte Carlo methods, which learn from complete episodes. Now, we're going to discover Temporal Difference (TD) Learning, a powerful class of model-free RL methods that can learn from incomplete episodes and even in continuing tasks. TD learning is a cornerstone of modern RL and often considered more efficient and versatile than Monte Carlo methods.

Chapter 7: Temporal Difference (TD) Learning - Learning from Incomplete Episodes

In this chapter, we'll explore Temporal Difference (TD) Learning. TD learning is a central idea in Reinforcement Learning, combining ideas from both Dynamic Programming (DP) and Monte Carlo (MC) methods. Like MC methods, TD learning is model-free and learns from experience. But unlike MC methods, TD learning can learn from incomplete episodes by bootstrapping – updating value function estimates based on other value function estimates. This ability to learn online and from partial information makes TD learning highly efficient and applicable to a broader range of problems, including continuing tasks.

1. Introduction to Temporal Difference (TD) Learning: Learning from Bootstrapping

What is Temporal Difference Learning? TD learning is a type of Reinforcement Learning that learns to predict a quantity (like value functions) by updating its predictions based on the difference between its current prediction and a better prediction (or target). This "difference" is called the temporal difference error.

(Analogy: Learning to Predict the Weather) Imagine you are learning to predict tomorrow's temperature.
- Monte Carlo Approach: You wait until tomorrow to see the actual temperature and then adjust your prediction based on the difference between your prediction and the actual temperature. You learn from complete "episodes" of weather (day by day).
- Temporal Difference (TD) Approach: You make a prediction for tomorrow's temperature today. Then, tomorrow comes, and you observe today's actual temperature (which is part of the information relevant to tomorrow's temperature). Instead of waiting for the final outcome (temperature two days from now), you update your prediction for tomorrow's temperature immediately based on today's actual temperature. You are using your prediction for tomorrow (based on today's information) to update your prediction for today. This is "bootstrapping" - using a prediction to update another prediction.
(Diagram: Comparing MC and TD Learning in Weather Prediction)
```
MC (Wait for Actual Outcome):
Today's Prediction -> Wait -> Tomorrow's Actual Temp -> Update Today's Prediction based on Tomorrow's Actual Temp

TD (Bootstrap from Next Prediction):
Today's Prediction -> Tomorrow Comes -> Observe Today's Actual Temp -> Update Today's Prediction based on Today's Actual Temp (and maybe yesterday's prediction)
```
Bootstrapping in TD Learning: The key characteristic of TD learning is bootstrapping. TD methods update value function estimates based on other estimated values. In contrast, Monte Carlo methods wait until the end of an episode and update values based on the actual return.
- TD learns a guess from a guess. It's like pulling yourself up by your own bootstraps (hence "bootstrapping"). At first, this might seem strange – can you really learn something new from just guesses? The answer is yes, especially in sequential decision problems where future states are related to current states.
Model-Free and Sample-Based: Like Monte Carlo methods, TD learning is model-free. It doesn't require knowledge of transition probabilities or reward functions. It learns directly from experience. It's also sample-based, meaning it learns from sampled trajectories or episodes of interaction with the environment.
Applicable to Episodic and Continuing Tasks: TD learning can be applied to both episodic tasks (like Monte Carlo) and continuing tasks (tasks that don't have natural episodes and go on indefinitely). This is a significant advantage over basic Monte Carlo methods.
Advantages of TD Learning over Monte Carlo:
- Learn from Incomplete Episodes: Can learn without waiting for the end of an episode. Update value functions after each step.
- Can be Used in Continuing Tasks: Applicable to tasks without termination.
- Lower Variance: Typically have lower variance than Monte Carlo estimates because they don't rely on full episode returns, which can be highly variable.
- Online Learning: Can learn online, updating value functions as experience unfolds, without needing to wait for the end of an episode.
- More Efficient: Often converge faster than Monte Carlo methods in practice.
Disadvantages of TD Learning compared to Monte Carlo:
- Bias: TD estimates can be biased, especially if bootstrapping from inaccurate initial estimates. Monte Carlo estimates are generally unbiased (but have higher variance).
- Markov Assumption More Important: TD methods rely more heavily on the Markov property. If the Markov property is significantly violated, TD methods might perform poorly.

2. TD Prediction: Estimating V^π(s) using TD(0)

Let's start with TD Prediction, where we want to estimate the state value function V^π(s) for a given policy π. The simplest TD method is called TD(0) (pronounced "TD zero").

TD(0) Update Rule for V^π(s):

For each transition S_t, A_t, R_t+1, S_t+1 observed during interaction with the environment, TD(0) updates the value function estimate for state S_t as follows:

V(S_t) ← V(S_t) + α [R_t+1 + γV(S_t+1) - V(S_t)]

(Equation: TD(0) Update Rule for V^π)
- V(S_t) is the current estimate of the value of state S_t.
- V(S_t+1) is the current estimate of the value of the next state S_t+1 (this is the "bootstrapping" part).
- R_t+1 is the reward received after transitioning to S_t+1.
- γ is the discount factor.
- α (alpha) is the step-size parameter (learning rate), 0 < α ≤ 1. It controls how much we update the value estimate in each step. A smaller α means slower learning, but potentially more stable convergence. A larger α means faster learning, but potentially more unstable.
TD Error (δ_t): The term in the square brackets is called the TD error at time t:
- δ_t = R_t+1 + γV(S_t+1) - V(S_t)
(Equation: TD Error)
- R_t+1 + γV(S_t+1) is the TD target – it's an estimate of what the return should be, based on the immediate reward and the estimated value of the next state.
- V(S_t) is the current prediction.
- The TD error δ_t is the difference between the TD target and the current prediction. It represents the "surprise" or "error" in our current prediction. We update V(S_t) in the direction of reducing this error.
Comparison with Monte Carlo Update:
- MC Update (after an episode): V(S_t) ← V(S_t) + α [G_t - V(S_t)] (where G_t is the actual return from time t to the end of the episode).
- TD(0) Update (after each step): V(S_t) ← V(S_t) + α [R_t+1 + γV(S_t+1) - V(S_t)]
- Key Difference: MC updates are based on the actual return G_t, which is only known at the end of the episode. TD updates are based on the TD target R_t+1 + γV(S_t+1), which uses the next reward and the current estimate of the value of the next state V(S_t+1).

Algorithm: TD(0) for Policy Evaluation

Algorithm: TD(0) for Policy Evaluation

Input: Policy π, MDP environment, step-size α
Output: Estimated state value function V<sup>π</sup>

Initialize: V(s) = 0 for all s ∈ S (or arbitrarily)

Loop (for each episode):
    Initialize S (starting state of episode)
    Loop (for each step of episode):
        Choose action A using policy π given state S
        Take action A, observe reward R and next state S'
        V(S) ← V(S) + α [R + γV(S') - V(S)]  (TD(0) Update)
        S ← S'
        If S is a terminal state, break loop

Return: V (Estimated value function V<sup>π</sup>)

3. TD Control: Learning Optimal Policies

Now, let's see how we can use TD learning for control, i.e., finding optimal policies. Similar to Monte Carlo control, we can use the idea of Generalized Policy Iteration (GPI), but now using TD methods for both policy evaluation and improvement.

Two Main TD Control Algorithms:

SARSA (State-Action-Reward-State-Action): On-Policy TD Control
Q-Learning: Off-Policy TD Control

a) SARSA: On-Policy TD Control

SARSA is an on-policy TD control algorithm. "On-policy" means that it learns the value function for the policy that is being used to make decisions. In SARSA, we learn the action value function Q^π(s, a).

SARSA Update Rule for Q^π(s, a):

For each step S_t, A_t, R_t+1, S_t+1, A_t+1 observed during interaction (note that we need the next action A_t+1 as well), SARSA updates the Q-value for the state-action pair (S_t, A_t) as follows:

Q(S_t, A_t) ← Q(S_t, A_t) + α [R_t+1 + γQ(S_t+1, A_t+1) - Q(S_t, A_t)]

(Equation: SARSA Update Rule for Q^π)
- Q(S_t, A_t) is the current estimate of the Q-value for state-action pair (S_t, A_t).
- Q(S_t+1, A_t+1) is the current estimate of the Q-value for the next state-action pair (S_t+1, A_t+1).
- A_t+1 is the action actually taken in state S_t+1. This is crucial for SARSA being "on-policy". A_t+1 is chosen according to the current policy π.
- R_t+1 is the reward.
- γ is the discount factor.
- α is the step-size parameter.

SARSA Algorithm (with ε-greedy policy):

Algorithm: SARSA (On-Policy TD Control)

Input: MDP environment, step-size α, small ε > 0
Output: Optimal policy π<sub>*</sub> and optimal action value function Q<sub>*</sub> (approximations)

Initialize: Q(s, a) = 0 for all s ∈ S, a ∈ A (or arbitrarily)

Loop (for each episode):
    Initialize S (starting state of episode)
    Choose action A using ε-greedy policy based on Q(S, .) (e.g., π(·|S))
    Loop (for each step of episode):
        Take action A, observe reward R and next state S'
        Choose next action A' using ε-greedy policy based on Q(S', .) (e.g., π(·|S'))
        Q(S, A) ← Q(S, A) + α [R + γQ(S', A') - Q(S, A)]  (SARSA Update)
        S ← S'
        A ← A'
        If S is a terminal state, break loop

Return: Policy derived from Q (e.g., greedy policy π(s) = argmax<sub>a</sub> Q(s, a)) and Q (Optimal action value function approximation Q<sub>*</sub>)

b) Q-Learning: Off-Policy TD Control

Q-Learning is an off-policy TD control algorithm. "Off-policy" means that it learns the value function for a policy (typically the optimal policy) independent of the policy being used to make decisions (the behavior policy). This can be advantageous because it allows us to learn about the optimal policy even while following an exploratory policy.

Q-Learning Update Rule for Q^*(s, a):

For each transition S_t, A_t, R_t+1, S_t+1 observed, Q-Learning updates the Q-value for the state-action pair (S_t, A_t) as follows:

Q(S_t, A_t) ← Q(S_t, A_t) + α [R_t+1 + γ max_a'∈A Q(S_t+1, a') - Q(S_t, A_t)]

(Equation: Q-Learning Update Rule for Q^*)
- Q(S_t, A_t) is the current estimate of the Q-value for state-action pair (S_t, A_t).
- max_a'∈A Q(S_t+1, a') is the maximum Q-value among all possible actions a' in the next state S_t+1. This is the key difference from SARSA. Q-Learning is trying to learn the value of taking action A_t in state S_t and then following the optimal policy afterwards, regardless of what action is actually taken in state S_t+1 in the current episode.
- R_t+1 is the reward.
- γ is the discount factor.
- α is the step-size parameter.

Q-Learning Algorithm (with ε-greedy behavior policy):

Algorithm: Q-Learning (Off-Policy TD Control)

Input: MDP environment, step-size α, small ε > 0
Output: Optimal policy π<sub>*</sub> and optimal action value function Q<sub>*</sub> (approximations)

Initialize: Q(s, a) = 0 for all s ∈ S, a ∈ A (or arbitrarily)

Loop (for each episode):
    Initialize S (starting state of episode)
    Loop (for each step of episode):
        Choose action A using ε-greedy policy based on Q(S, .) (behavior policy, e.g., π<sub>b</sub>(·|S))
        Take action A, observe reward R and next state S'
        Q(S, A) ← Q(S, A) + α [R + γ max<sub>a'∈A</sub> Q(S', a') - Q(S, A)]  (Q-Learning Update)
        S ← S'
        If S is a terminal state, break loop

Return: Policy derived from Q (e.g., greedy policy π<sub>*</sub>(s) = argmax<sub>a</sub> Q(s, a)) and Q (Optimal action value function approximation Q<sub>*</sub>)

Key Difference: SARSA vs. Q-Learning:

SARSA (On-Policy): Updates Q-values based on the action actually taken in the next state (A_t+1, chosen by the current policy). It learns the Q-value for the policy it is currently following (e.g., ε-greedy policy). It's learning about the "policy in action."
Q-Learning (Off-Policy): Updates Q-values based on the maximum possible Q-value in the next state (max_a'∈A Q(S_t+1, a')). It learns an estimate of the optimal Q-value, independent of the policy being followed. It's learning about the "optimal policy," even while behaving differently (e.g., ε-greedily).

(Diagram: SARSA vs. Q-Learning Update Targets)

SARSA Target:   R + γQ(S', A')   (where A' is action taken by policy in S')
Q-Learning Target: R + γ max<sub>a'</sub> Q(S', a') (maximum Q-value in S', regardless of action taken)

Which one to choose: SARSA or Q-Learning?

SARSA: Learns about the policy it's executing. Can be more cautious because it considers the expected rewards under its current policy, including exploration. Might be preferred in situations where safety or risk-avoidance is important, as it tends to learn policies that are smoother and avoid taking actions that might lead to very bad outcomes (even if they might also lead to very good outcomes).
Q-Learning: Learns directly about the optimal policy, regardless of the exploration policy. Can be more aggressive in seeking out optimal actions. Might be preferred when the goal is to find the absolutely best policy, even if it involves some risk during learning. Can converge to the optimal policy even with more exploratory behavior policies.

4. Comparing Monte Carlo and TD Learning (Summary)

Feature	Monte Carlo (MC)	Temporal Difference (TD)
Learning Type	Learn from complete episodes	Learn from incomplete episodes (can bootstrap)
Update Time	Episode-end update	Step-by-step update
Variance	High variance (due to variable episode returns)	Lower variance (based on next step's value, less noisy)
Bias	Unbiased (estimates converge to true values in limit)	Biased (especially with bootstrapping, might converge to wrong value if initial estimates are bad)
Markov Assumption	Less reliant on Markov property	More reliant on Markov property
Applicable Tasks	Episodic tasks (naturally)	Episodic and continuing tasks
Online Learning	Not directly (needs complete episodes)	Yes (can learn online, update after each step)
Efficiency	Less efficient (high variance, episode-end updates)	More efficient (lower variance, step-by-step updates)
Example Algorithms	First-Visit MC, Every-Visit MC	TD(0), SARSA, Q-Learning

5. Advantages of TD Learning (Reiterated)

Learn from Incomplete Episodes: Major advantage for real-time and continuous tasks.
Lower Variance: More stable and faster learning in many cases.
Online Learning: Can learn and update policies as the agent interacts with the environment.
Applicable to Continuing Tasks: Opens up RL to a wider range of problems.

6. When to Use TD Learning:

TD learning methods are generally preferred over Monte Carlo methods in many Reinforcement Learning scenarios, especially when:

Tasks are continuing or episodes are very long.
Online learning is required.
Lower variance in value estimates is desired.
Computational efficiency is important.

What's Next?

We've now covered Dynamic Programming (for known environments), Monte Carlo methods (model-free, episodic), and Temporal Difference Learning (model-free, more efficient, episodic and continuing). However, all these methods, in their basic forms, assume that we can represent value functions as tables (e.g., Q-tables). This works for small state spaces and action spaces. But in many real-world problems, the state space is huge or even continuous (think of images, sensor readings, etc.).

In the next chapter, we'll tackle the challenge of scaling up RL to large state spaces by introducing Function Approximation. We'll see how we can use functions (like linear functions, neural networks, etc.) to approximate value functions and policies, allowing us to generalize from limited experience to unseen states and actions. This is a crucial step towards applying Reinforcement Learning to complex, real-world problems – leading us into the realm of Deep Reinforcement Learning.

Yuxing's Intelligent Universe

Search This Blog

LRL-7: Temporal Difference (TD) Learning - Learning from Incomplete Episodes

Labels

Comments

Post a Comment

Popular posts from this blog

Long Term Memory Technology Comparison

Economic Impact of New Tariffs on Canada, Mexico, China, and Europe

Comprehensive Analysis of Modern AI-Agent IDE Coding Tools: Features, Costs, and Model Ecosystems