We propose an unsupervised variational model for disentangling video into independent factors, i.e. each factors future can be predicted from its past without considering the others. We show that our approach often learns factors which are interpretable as objects in a scene.