the reward is propagated back to each of the steps as well, so that the model konws what to do at each step

Referenced in:

All notes