Skip to main content

LRL-7: Temporal Difference (TD) Learning - Learning from Incomplete Episodes

Let's dive into Chapter 7: Temporal Difference (TD) Learning - Learning from Incomplete Episodes. In the last chapter, we explored Monte Carlo methods, which learn from complete episodes. Now, we're going to discover Temporal Difference (TD) Learning, a powerful class of model-free RL methods that can learn from incomplete episodes and even in continuing tasks. TD learning is a cornerstone of modern RL and often considered more efficient and versatile than Monte Carlo methods.

Chapter 7: Temporal Difference (TD) Learning - Learning from Incomplete Episodes

In this chapter, we'll explore Temporal Difference (TD) Learning. TD learning is a central idea in Reinforcement Learning, combining ideas from both Dynamic Programming (DP) and Monte Carlo (MC) methods. Like MC methods, TD learning is model-free and learns from experience. But unlike MC methods, TD learning can learn from incomplete episodes by bootstrapping – updating value function estimates based on other value function estimates. This ability to learn online and from partial information makes TD learning highly efficient and applicable to a broader range of problems, including continuing tasks.

1. Introduction to Temporal Difference (TD) Learning: Learning from Bootstrapping

  • What is Temporal Difference Learning? TD learning is a type of Reinforcement Learning that learns to predict a quantity (like value functions) by updating its predictions based on the difference between its current prediction and a better prediction (or target). This "difference" is called the temporal difference error.

    (Analogy: Learning to Predict the Weather) Imagine you are learning to predict tomorrow's temperature.

    • Monte Carlo Approach: You wait until tomorrow to see the actual temperature and then adjust your prediction based on the difference between your prediction and the actual temperature. You learn from complete "episodes" of weather (day by day).
    • Temporal Difference (TD) Approach: You make a prediction for tomorrow's temperature today. Then, tomorrow comes, and you observe today's actual temperature (which is part of the information relevant to tomorrow's temperature). Instead of waiting for the final outcome (temperature two days from now), you update your prediction for tomorrow's temperature immediately based on today's actual temperature. You are using your prediction for tomorrow (based on today's information) to update your prediction for today. This is "bootstrapping" - using a prediction to update another prediction.

    (Diagram: Comparing MC and TD Learning in Weather Prediction)

    MC (Wait for Actual Outcome):
    Today's Prediction -> Wait -> Tomorrow's Actual Temp -> Update Today's Prediction based on Tomorrow's Actual Temp
    
    TD (Bootstrap from Next Prediction):
    Today's Prediction -> Tomorrow Comes -> Observe Today's Actual Temp -> Update Today's Prediction based on Today's Actual Temp (and maybe yesterday's prediction)
    
  • Bootstrapping in TD Learning: The key characteristic of TD learning is bootstrapping. TD methods update value function estimates based on other estimated values. In contrast, Monte Carlo methods wait until the end of an episode and update values based on the actual return.

    • TD learns a guess from a guess. It's like pulling yourself up by your own bootstraps (hence "bootstrapping"). At first, this might seem strange – can you really learn something new from just guesses? The answer is yes, especially in sequential decision problems where future states are related to current states.
  • Model-Free and Sample-Based: Like Monte Carlo methods, TD learning is model-free. It doesn't require knowledge of transition probabilities or reward functions. It learns directly from experience. It's also sample-based, meaning it learns from sampled trajectories or episodes of interaction with the environment.

  • Applicable to Episodic and Continuing Tasks: TD learning can be applied to both episodic tasks (like Monte Carlo) and continuing tasks (tasks that don't have natural episodes and go on indefinitely). This is a significant advantage over basic Monte Carlo methods.

  • Advantages of TD Learning over Monte Carlo:

    • Learn from Incomplete Episodes: Can learn without waiting for the end of an episode. Update value functions after each step.
    • Can be Used in Continuing Tasks: Applicable to tasks without termination.
    • Lower Variance: Typically have lower variance than Monte Carlo estimates because they don't rely on full episode returns, which can be highly variable.
    • Online Learning: Can learn online, updating value functions as experience unfolds, without needing to wait for the end of an episode.
    • More Efficient: Often converge faster than Monte Carlo methods in practice.
  • Disadvantages of TD Learning compared to Monte Carlo:

    • Bias: TD estimates can be biased, especially if bootstrapping from inaccurate initial estimates. Monte Carlo estimates are generally unbiased (but have higher variance).
    • Markov Assumption More Important: TD methods rely more heavily on the Markov property. If the Markov property is significantly violated, TD methods might perform poorly.

2. TD Prediction: Estimating Vπ(s) using TD(0)

Let's start with TD Prediction, where we want to estimate the state value function Vπ(s) for a given policy π. The simplest TD method is called TD(0) (pronounced "TD zero").

TD(0) Update Rule for Vπ(s):

For each transition St, At, Rt+1, St+1 observed during interaction with the environment, TD(0) updates the value function estimate for state St as follows:

  • V(St) ← V(St) + α [Rt+1 + γV(St+1) - V(St)]

    (Equation: TD(0) Update Rule for Vπ)

    • V(St) is the current estimate of the value of state St.
    • V(St+1) is the current estimate of the value of the next state St+1 (this is the "bootstrapping" part).
    • Rt+1 is the reward received after transitioning to St+1.
    • γ is the discount factor.
    • α (alpha) is the step-size parameter (learning rate), 0 < α ≤ 1. It controls how much we update the value estimate in each step. A smaller α means slower learning, but potentially more stable convergence. A larger α means faster learning, but potentially more unstable.
  • TD Error (δt): The term in the square brackets is called the TD error at time t:

    • δt = Rt+1 + γV(St+1) - V(St)

    (Equation: TD Error)

    • Rt+1 + γV(St+1) is the TD target – it's an estimate of what the return should be, based on the immediate reward and the estimated value of the next state.
    • V(St) is the current prediction.
    • The TD error δt is the difference between the TD target and the current prediction. It represents the "surprise" or "error" in our current prediction. We update V(St) in the direction of reducing this error.
  • Comparison with Monte Carlo Update:

    • MC Update (after an episode): V(St) ← V(St) + α [Gt - V(St)] (where Gt is the actual return from time t to the end of the episode).
    • TD(0) Update (after each step): V(St) ← V(St) + α [Rt+1 + γV(St+1) - V(St)]

    • Key Difference: MC updates are based on the actual return Gt, which is only known at the end of the episode. TD updates are based on the TD target Rt+1 + γV(St+1), which uses the next reward and the current estimate of the value of the next state V(St+1).

Algorithm: TD(0) for Policy Evaluation

Algorithm: TD(0) for Policy Evaluation

Input: Policy π, MDP environment, step-size α
Output: Estimated state value function V<sup>π</sup>

Initialize: V(s) = 0 for all s ∈ S (or arbitrarily)

Loop (for each episode):
    Initialize S (starting state of episode)
    Loop (for each step of episode):
        Choose action A using policy π given state S
        Take action A, observe reward R and next state S'
        V(S) ← V(S) + α [R + γV(S') - V(S)]  (TD(0) Update)
        S ← S'
        If S is a terminal state, break loop

Return: V (Estimated value function V<sup>π</sup>)

3. TD Control: Learning Optimal Policies

Now, let's see how we can use TD learning for control, i.e., finding optimal policies. Similar to Monte Carlo control, we can use the idea of Generalized Policy Iteration (GPI), but now using TD methods for both policy evaluation and improvement.

Two Main TD Control Algorithms:

  • SARSA (State-Action-Reward-State-Action): On-Policy TD Control
  • Q-Learning: Off-Policy TD Control

a) SARSA: On-Policy TD Control

SARSA is an on-policy TD control algorithm. "On-policy" means that it learns the value function for the policy that is being used to make decisions. In SARSA, we learn the action value function Qπ(s, a).

SARSA Update Rule for Qπ(s, a):

For each step St, At, Rt+1, St+1, At+1 observed during interaction (note that we need the next action At+1 as well), SARSA updates the Q-value for the state-action pair (St, At) as follows:

  • Q(St, At) ← Q(St, At) + α [Rt+1 + γQ(St+1, At+1) - Q(St, At)]

    (Equation: SARSA Update Rule for Qπ)

    • Q(St, At) is the current estimate of the Q-value for state-action pair (St, At).
    • Q(St+1, At+1) is the current estimate of the Q-value for the next state-action pair (St+1, At+1).
    • At+1 is the action actually taken in state St+1. This is crucial for SARSA being "on-policy". At+1 is chosen according to the current policy π.
    • Rt+1 is the reward.
    • γ is the discount factor.
    • α is the step-size parameter.

SARSA Algorithm (with ε-greedy policy):

Algorithm: SARSA (On-Policy TD Control)

Input: MDP environment, step-size α, small ε > 0
Output: Optimal policy π<sub>*</sub> and optimal action value function Q<sub>*</sub> (approximations)

Initialize: Q(s, a) = 0 for all s ∈ S, a ∈ A (or arbitrarily)

Loop (for each episode):
    Initialize S (starting state of episode)
    Choose action A using ε-greedy policy based on Q(S, .) (e.g., π(·|S))
    Loop (for each step of episode):
        Take action A, observe reward R and next state S'
        Choose next action A' using ε-greedy policy based on Q(S', .) (e.g., π(·|S'))
        Q(S, A) ← Q(S, A) + α [R + γQ(S', A') - Q(S, A)]  (SARSA Update)
        S ← S'
        A ← A'
        If S is a terminal state, break loop

Return: Policy derived from Q (e.g., greedy policy π(s) = argmax<sub>a</sub> Q(s, a)) and Q (Optimal action value function approximation Q<sub>*</sub>)

b) Q-Learning: Off-Policy TD Control

Q-Learning is an off-policy TD control algorithm. "Off-policy" means that it learns the value function for a policy (typically the optimal policy) independent of the policy being used to make decisions (the behavior policy). This can be advantageous because it allows us to learn about the optimal policy even while following an exploratory policy.

Q-Learning Update Rule for Q*(s, a):

For each transition St, At, Rt+1, St+1 observed, Q-Learning updates the Q-value for the state-action pair (St, At) as follows:

  • Q(St, At) ← Q(St, At) + α [Rt+1 + γ maxa'∈A Q(St+1, a') - Q(St, At)]

    (Equation: Q-Learning Update Rule for Q*)

    • Q(St, At) is the current estimate of the Q-value for state-action pair (St, At).
    • maxa'∈A Q(St+1, a') is the maximum Q-value among all possible actions a' in the next state St+1. This is the key difference from SARSA. Q-Learning is trying to learn the value of taking action At in state St and then following the optimal policy afterwards, regardless of what action is actually taken in state St+1 in the current episode.
    • Rt+1 is the reward.
    • γ is the discount factor.
    • α is the step-size parameter.

Q-Learning Algorithm (with ε-greedy behavior policy):

Algorithm: Q-Learning (Off-Policy TD Control)

Input: MDP environment, step-size α, small ε > 0
Output: Optimal policy π<sub>*</sub> and optimal action value function Q<sub>*</sub> (approximations)

Initialize: Q(s, a) = 0 for all s ∈ S, a ∈ A (or arbitrarily)

Loop (for each episode):
    Initialize S (starting state of episode)
    Loop (for each step of episode):
        Choose action A using ε-greedy policy based on Q(S, .) (behavior policy, e.g., π<sub>b</sub>(·|S))
        Take action A, observe reward R and next state S'
        Q(S, A) ← Q(S, A) + α [R + γ max<sub>a'∈A</sub> Q(S', a') - Q(S, A)]  (Q-Learning Update)
        S ← S'
        If S is a terminal state, break loop

Return: Policy derived from Q (e.g., greedy policy π<sub>*</sub>(s) = argmax<sub>a</sub> Q(s, a)) and Q (Optimal action value function approximation Q<sub>*</sub>)

Key Difference: SARSA vs. Q-Learning:

  • SARSA (On-Policy): Updates Q-values based on the action actually taken in the next state (At+1, chosen by the current policy). It learns the Q-value for the policy it is currently following (e.g., ε-greedy policy). It's learning about the "policy in action."
  • Q-Learning (Off-Policy): Updates Q-values based on the maximum possible Q-value in the next state (maxa'∈A Q(St+1, a')). It learns an estimate of the optimal Q-value, independent of the policy being followed. It's learning about the "optimal policy," even while behaving differently (e.g., ε-greedily).

(Diagram: SARSA vs. Q-Learning Update Targets)

SARSA Target:   R + γQ(S', A')   (where A' is action taken by policy in S')
Q-Learning Target: R + γ max<sub>a'</sub> Q(S', a') (maximum Q-value in S', regardless of action taken)

Which one to choose: SARSA or Q-Learning?

  • SARSA: Learns about the policy it's executing. Can be more cautious because it considers the expected rewards under its current policy, including exploration. Might be preferred in situations where safety or risk-avoidance is important, as it tends to learn policies that are smoother and avoid taking actions that might lead to very bad outcomes (even if they might also lead to very good outcomes).
  • Q-Learning: Learns directly about the optimal policy, regardless of the exploration policy. Can be more aggressive in seeking out optimal actions. Might be preferred when the goal is to find the absolutely best policy, even if it involves some risk during learning. Can converge to the optimal policy even with more exploratory behavior policies.

4. Comparing Monte Carlo and TD Learning (Summary)

Feature Monte Carlo (MC) Temporal Difference (TD)
Learning Type Learn from complete episodes Learn from incomplete episodes (can bootstrap)
Update Time Episode-end update Step-by-step update
Variance High variance (due to variable episode returns) Lower variance (based on next step's value, less noisy)
Bias Unbiased (estimates converge to true values in limit) Biased (especially with bootstrapping, might converge to wrong value if initial estimates are bad)
Markov Assumption Less reliant on Markov property More reliant on Markov property
Applicable Tasks Episodic tasks (naturally) Episodic and continuing tasks
Online Learning Not directly (needs complete episodes) Yes (can learn online, update after each step)
Efficiency Less efficient (high variance, episode-end updates) More efficient (lower variance, step-by-step updates)
Example Algorithms First-Visit MC, Every-Visit MC TD(0), SARSA, Q-Learning

5. Advantages of TD Learning (Reiterated)

  • Learn from Incomplete Episodes: Major advantage for real-time and continuous tasks.
  • Lower Variance: More stable and faster learning in many cases.
  • Online Learning: Can learn and update policies as the agent interacts with the environment.
  • Applicable to Continuing Tasks: Opens up RL to a wider range of problems.

6. When to Use TD Learning:

TD learning methods are generally preferred over Monte Carlo methods in many Reinforcement Learning scenarios, especially when:

  • Tasks are continuing or episodes are very long.
  • Online learning is required.
  • Lower variance in value estimates is desired.
  • Computational efficiency is important.

What's Next?

We've now covered Dynamic Programming (for known environments), Monte Carlo methods (model-free, episodic), and Temporal Difference Learning (model-free, more efficient, episodic and continuing). However, all these methods, in their basic forms, assume that we can represent value functions as tables (e.g., Q-tables). This works for small state spaces and action spaces. But in many real-world problems, the state space is huge or even continuous (think of images, sensor readings, etc.).

In the next chapter, we'll tackle the challenge of scaling up RL to large state spaces by introducing Function Approximation. We'll see how we can use functions (like linear functions, neural networks, etc.) to approximate value functions and policies, allowing us to generalize from limited experience to unseen states and actions. This is a crucial step towards applying Reinforcement Learning to complex, real-world problems – leading us into the realm of Deep Reinforcement Learning.

Comments

Popular posts from this blog

Comprehensive Analysis of Modern AI-Agent IDE Coding Tools: Features, Costs, and Model Ecosystems

The integration of large language models (LLMs) into coding workflows has revolutionized software development, enabling AI-agent IDEs to automate code generation, debugging, and project management. This essay compares 15 leading tools across three categories— standalone IDEs , IDE extensions , and CLI/framework tools —evaluating their cost structures , supported LLMs , and use-case suitability as of February 2025. I. Standalone AI-Agent IDEs 1. GitHub Copilot Workspace (GitHub/Microsoft) URL : GitHub Copilot Previous Names : GitHub Copilot (2021), Copilot X (2024). Cost : $10–$39/month (individual); enterprise pricing on request. LLMs : GPT-4o, Claude 3.5 Sonnet, Google Gemini 1.5, and o3-mini (speed-optimized). Features : Real-time autocomplete, Workspaces for end-to-end project management, and autonomous Agent Mode for multi-file edits. 2. Cursor (Cursor Inc.) URL : Cursor Cost : Free (2,000 completions/month); Pro at $20/month (unlimited). LLMs : GPT-4o, ...

Long Term Memory Technology Comparison

Let’s compare traditional databases , graph databases , and LLM network memory in terms of accuracy , structured data , and retrieval . 1. Accuracy Aspect Traditional Database Storage Graph Database (e.g., Neo4j) LLM Network Memory Definition Data is stored explicitly in tables, rows, and columns. Data is stored as nodes, edges, and properties, representing relationships. Data is encoded in the weights of a neural network as patterns and relationships. Accuracy High : Data is stored exactly as input, so retrieval is precise and deterministic. High : Relationships and connections are explicitly stored, enabling precise queries. Variable : LLMs generate responses based on learned patterns, which can lead to errors or approximations. Example If you store "2 + 2 = 4" in a database, it will always return "4" when queried. If you store "Alice is friends with Bob," the relationship is explicitly stored and retrievable. An LLM might c...

LRL-10: Applications and Future of Reinforcement Learning

Alright, let's wrap up our Reinforcement Learning journey with Chapter 10: Applications and Future of Reinforcement Learning . We've come a long way from puppy training analogies to understanding complex algorithms. Now it's time to look at the bigger picture – where is RL being used, what are its potential impacts, and what exciting challenges and opportunities lie ahead? Chapter 10: Applications and Future of Reinforcement Learning In this final chapter, we'll explore the diverse and growing landscape of Reinforcement Learning applications across various domains. We'll also discuss some of the key challenges and open research areas in RL, and finally, look towards the future of Reinforcement Learning and its potential impact on our world. 1. Real-world Applications of Reinforcement Learning Reinforcement Learning is no longer just a theoretical concept; it's rapidly transitioning into a powerful tool for solving real-world problems. Here are some exci...