Eligibility trace in Temporal difference learning

3 min readSep 30, 2020

Question: what is eligibility trace in temporal difference learning (TD(lambda))? and why do we need eligibility trace?

Answer: eligibility traces is typically used to represent how frequency and recency of a state is during the learning of an episode.

Recency: more credits are assigned to more recent state.

Frequency: more credits are assigned to events that have occurred more times. One such way is to use the backward view of TD(λ), which involves eligibility traces.

Eligibility traces are ways to keep a history of what happened in the past and how the states we’ve visited affected the reward we’re seeing. It allows us to update multiple state-value function estimates at once, in a way that is weighted by recency and frequency.
There is a different type of eligibility trace called a replacing eligibility trace, which does not sum one, but instead sets the eligibility back to one when a state is visited. The formulation of eligibility traces is called an accumulating eligibility trace. An alternative to it is to use a replacing eligibility trace that, instead of summing 1 to the eligibility when visiting a state, resets the eligibility back to 1.
This is still a Temporal Difference method. It is still biased because of the bootstrapping. It should intuitively have less variance than a Monte Carlo method because of the averaging of low-variance information, but I have no theory to back up this claim.

Note: The figure below shows the algorithm with eligibility traces for TD(0), TD(lambda), TD(1).

OK, so now another question comes into my mind, why do we need lambda for TD learning, is it the same as eligibility trace we used to represent the recency and frequency of visiting state ? As we have learned from TD(0) and TD(1), TD(0) is just one step estimator which only looks at the most recent state in the past, while TD(1) takes all states in the past into consideration, but with a value of gamma weighted for all visited states (eligibility traces) as we have discussed earlier.

Therefore, TD(λ) is really just a combination of weighted k-step estimators as shown in the figure 2 below, we wanted to weight the k -step return using a weight that decays exponentially with the time, and this can be done by introducing λ(0–1), and the nth — step estimator can be presented as λ^{n-1}. The equation below shows that we need to normalize the sum of all weighted lambda to 1. so the lambda return will be shown here.

Figure 2 — TD(lambda) method for k — step estimators

References

David Silver’s class on TD(λ) (link here)
Reinforcement Learning with Replacing Eligibility Traces (link)
Alister Reis’s blog

Eligibility trace in Temporal difference learning

References

Written by Guilin Zhu