Deep supervised neural networks trained to classify objects have emerged as popular models of computation in the primate ventral stream. These models represent information with a high-dimensional distributed population code, implying that inferotemporal (IT) responses are also too complex to interpret at the single-neuron level. We challenge this view by modelling neural responses to faces in the macaque IT with a deep unsupervised generative model, beta-VAE. Unlike deep classifiers, beta-VAE disentangles sensory data into interpretable latent factors, such as gender or hair length. We found a remarkable correspondence between the generative factors discovered by the model and those coded by single IT neurons. Moreover, we were able to reconstruct face images using the signals from just a handful of cells. This suggests that the ventral visual stream may be optimising the disentangling objective, producing a neural code that is low-dimensional and semantically interpretable at the single-unit level.