RL in Gridworld, a few questions...

1. How the Values of Each Cell Are Calculated

In Gridworld, the value of each cell (state) is computed using the Bellman Equation, which balances immediate rewards and future discounted rewards. For a state

s

V (s) = a max [s^{'} \sum P (s^{'} ∣ s, a) \cdot (R (s, a, s^{'}) + γ \cdot V (s^{'}))]

$V (s)$ : Value of the current state.
$a$ : Action taken in state $s$ .
$P (s^{'} ∣ s, a)$ : Probability of transitioning to state $s^{'}$ after action $a$ .
$R (s, a, s^{'})$ : Immediate reward for transitioning to $s^{'}$ .
$γ$ : Discount factor (0 ≤ $γ$ ≤ 1).

Example:

If moving "right" from state $s$ leads to state $s^{'}$ with reward $R = - 0.01$ (step cost) and $V (s^{'}) = 5$ , the contribution to $V (s)$ for this action is $- 0.01 + γ \cdot 5$ .
The value $V (s)$ is the maximum of such contributions across all actions.

Values are iteratively updated (e.g., via value iteration) until convergence to the optimal value function.

2. Does Gamma Affect Shortest Path Preference?
Yes,

γ

influences path selection if there are step costs (e.g.,

R = - 0.01

per step).

If $γ = 1$ :
- Future rewards are valued equally with immediate rewards.
- The agent prioritizes minimizing step costs (e.g., shorter paths) to maximize cumulative rewards.
- Example: A path with 5 steps (total cost = -0.05) is better than a path with 10 steps (total cost = -0.10), even if both reach the goal.
If $γ < 1$ :
- Future rewards are discounted.
- The agent prefers shorter paths to "lock in" the goal reward sooner (e.g., $10 \cdot γ^{5}$ > $10 \cdot γ^{10}$ ).
- Example: With $γ = 0.9$ , a 5-step path yields $10 \cdot 0. 9^{5} \approx 5.9$ , while a 10-step path yields $10 \cdot 0. 9^{10} \approx 3.4$ .

If there are no step costs:

$γ$ has no impact on path length, as the total reward is always 10 (goal reward).
The agent can take any path (long or short) without penalty.

Summary:

Values are computed via the Bellman Equation, incorporating rewards and discounted future values.
Gamma affects shortest-path preference only if step costs exist. Higher $γ$ (closer to 1) still prioritizes shorter paths due to cumulative cost minimization, while lower $γ$ amplifies this effect by devaluing delayed rewards.

Yuxing's Intelligent Universe

Search This Blog

RL in Gridworld, a few questions...

Labels

Comments

Post a Comment

Popular posts from this blog

Long Term Memory Technology Comparison

Economic Impact of New Tariffs on Canada, Mexico, China, and Europe

Comprehensive Analysis of Modern AI-Agent IDE Coding Tools: Features, Costs, and Model Ecosystems