We discuss work extraction from classical information engines (e.g., Szilard) with $N$-particles, $q$ partitions, and initial arbitrary non-equilibrium states. In particular, we focus on their {em optimal} behaviour, which includes the measurement of a set of quantities $Phi$ with a feedback protocol that extracts the maximal average amount of work. We show that the optimal non-equilibrium state to which the engine should be driven before the measurement is given by the normalised maximum-likelihood probability distribution of a statistical model that admits $Phi$ as sufficient statistics. Furthermore, we show that the minimax universal code redundancy $mathcal{R}^*$ associated to this model, provides an upper bound to the work that the demon can extract on average from the cycle, in units of $k_{rm B}T$. We also find that, in the limit of $N$ large, the maximum average extracted work cannot exceed $H[Phi]/2$, i.e. one half times the Shannon entropy of the measurement. Our results establish a connection between optimal work extraction in stochastic thermodynamics and optimal universal data compression, providing design principles for optimal information engines. In particular, they suggest that: (i) optimal coding is thermodynamically efficient, and (ii) it is essential to drive the system into a critical state in order to achieve optimal performance.