A two-way switch example to better understand Total Correlation

Recently, I was working on a project that requires learning a latent representation with disentangled factors for high-dimensional inputs. For a brief introduction to disentanglement, while we could use an autoencoder (AE) to compress a high-dimensional input into a compact embedding, there is always dependence among the embedding dimensions, meaning that multiple dimensions always change together in a dependent way. This is undesirable in many scenarios, for example:
- We’d like to train a generative model that maps a latent embedding to a high dimensional output (e.g., image), but meanwhile we wish to control the generation result by modifying only one dimension of the embedding each time. (This also facilitates our interpretation of the embedding space structure.)
- We’d like to train a policy operating on the latent embedding. For training efficiency, we need to restrict the action space from being combinatorial. To do so, we make each action only modify one dimension while it still can result in meaningful changes in the observation space. This is similar in spirit to Independently Controllable Factors.
There are many techniques to enforce disentanglement in the representation learning literature, among which variational autoencoder (VAE) is probably the most known one. A research work in 2019 delved into the KLD term of VAE, and concluded that the most important loss contained by KLD that results in disentanglement is the total correlation (TC) among latent dimensions. (As an aside, the dimension-wise distribution match to a prior also contained by KLD can actually hinder the reconstruction precision.)
Formally, let the latent embedding be
where
There are many ways to estimate the TC given samples of
Before I found the above literature about TC definition and estimation, in my project I tended to believe that I should encourage minimal MI between every two dimensions
It’s easy to verify that
But can we actually come up with an intuitive example? Below I show a simple example of three random variables where
Suppose that we have three Bernoulli variables:
0 | 0 | 0 |
1 | 1 | 0 |
1 | 0 | 1 |
0 | 1 | 1 |
Namely, the light is on only when the two switches are not in the same state. This is called two-way switching for lighting. This design is usually used in a scenario where one wants to control the light at two places that are far apart. For example, the two switches can be placed at the two ends of a long hallway, controlling the light in the middle. (I realized that there is in fact a two-way switching design in my own kitchen!)
Now suppose
That is, in this scenario it’s impossible to infer the state of a switch/light without looking at both remaining variables.
What does this example imply? It implies that if I use