Wireless visual sensor networks (VSNs) are expected to play a major role in future IEEE 802.15.4 personal area networks (PAN) under recently-established collision-free medium access control (MAC) protocols, such as the IEEE 802.15.4e-2012 MAC. In such environments, the VSN energy consumption is affected by the number of camera sensors deployed (spatial coverage), as well as the number of captured video frames out of which each node processes and transmits data (temporal coverage). In this paper, we explore this aspect for uniformly-formed VSNs, i.e., networks comprising identical wireless visual sensor nodes connected to a collection node via a balanced cluster-tree topology, with each node producing independent identically-distributed bitstream sizes after processing the video frames captured within each network activation interval. We derive analytic results for the energy-optimal spatio-temporal coverage parameters of such VSNs under a-priori known bounds for the number of frames to process per sensor and the number of nodes to deploy within each tier of the VSN. Our results are parametric to the probability density function characterizing the bitstream size produced by each node and the energy consumption rates of the system of interest. Experimental results reveal that our analytic results are always within 7% of the energy consumption measurements for a wide range of settings. In addition, results obtained via a multimedia subsystem show that the optimal spatio-temporal settings derived by the proposed framework allow for substantial reduction of energy consumption in comparison to ad-hoc settings. As such, our analytic modeling is useful for early-stage studies of possible VSN deployments under collision-free MAC protocols prior to costly and time-consuming experiments in the field.