Collective motion is found in various animal systems, active suspensions and robotic or virtual agents. This is often understood using high level models that directly encode selected empirical features, such as co-alignment and cohesion. Can these features be shown to emerge from an underlying, low-level principle? We find that they emerge naturally under Future State Maximisation (FSM). Here agents perceive a visual representation of the world around them, such as might be recorded on a simple retina, and then move to maximise the number of different visual environments that they expect to be able to access in the future. Such a control principle may confer evolutionary fitness in an uncertain world by enabling agents to deal with a wide variety of future scenarios. The collective dynamics that spontaneously emerge under FSM resemble animal systems in several qualitative aspects, including cohesion, co-alignment and collision suppression, none of which are explicitly encoded in the model. A multi-layered neural network trained on simulated trajectories is shown to represent a heuristic mimicking FSM. Similar levels of reasoning would seem to be accessible under animal cognition, demonstrating a possible route to the emergence of collective motion in social animals directly from the control principle underlying FSM. Such models may also be good candidates for encoding into possible future realisations of artificial intelligent matter, able to sense light, process information and move.