Assume that we have a policy that is able to output a variable length of actions given any state . We define a Q function which computes the state-action value of any action sequence given . The intuitive meaning of this quantity is: what expected future return will we get, if starting from , we take actions regardless of the future states in the following steps, after which we follow our policy ? In fact, our -step action sequence is open-loop because it doesn’t depend on the environment feedback in the next steps.
We can use the Bellman equation to learn from the data of a memory (e.g. a replay buffer or a search tree). The bootstrapping target for is defined as
where is the environment transition probability. In practice, each time we sample a -step transition from the memory, sample a new sequence of actions with the current policy, and use as a point estimate of the above target to fit . So the key question is: whether the empirical frequency of (from the memory) correctly matches the theoretical distribution of after we executes in an open loop manner at ?
The answer is not always, and it depends on how the transition is obtained from the memory (e.g. replay buffer or search tree). In a usual case, the action sequence might come from stitching two action sequence outputs during rollout, as the example below shows.

Here, the replayed transition actually spans over two rollout policy samplings. This might cause issues for learning the state-action value in some MDPs.
Deterministic environment
For a deterministic environment, it’s completely fine to learn by arbitrarily stitching together temporally contiguous from a memory, even though these actions were not generated in one shot (i.e., generated once for all at ). The reason is that given and any subsequence , is uniquely determined in a deterministic environment. So the final state correctly reflects the transition destination state with a probability of 1.
Stochastic environment

However, in a stochastic environment, arbitrarily stitching together contiguous actions will have issues. To see this, consider the example above, where can be generated at two different states (prob of 0.2) and (prob of 1), where these two states are valid stochastic successors to with a chance of 0.5. We assume that the episode always starts with and ends after either or (suppose the end state is denoted by ).
By our definition, (the state-action value of taking and then starting from ) should be . Note that when we compute this value, we should not look at the action distribution at or . Even though has only a prob of 0.2 at , we always take it after .
Now suppose all data in the memory are generated step by step. When using TD learning to estimate , we sample either or from the memory. The problem is, the second transition appears 4x more frequently than the first one! Thus the learned will be . This is completely wrong given the correct value of 0.5.
So what happened? In this case, we can only learn the correlation between and but not their causal relationship (i.e., what expected return we will get if taking staring from ?).
How to avoid the causal confusion
For a policy that is able to output a variable length of actions given any state , in general we need to honor the action sequence boundary when sampling from the memory for TD learning. That is to say, in the first figure, we can only sample subsequences within or , but not across the boundary .