Driving requires reacting to a wide variety of complex environment conditions and agent behaviors. Explicitly modeling each possible scenario is unrealistic. In contrast, imitation learning can, in theory, leverage data from large fleets of human-driven cars. Behavior cloning in particular has been successfully used to learn simple visuomotor policies end-to-end, but scaling to the full spectrum of driving behaviors remains an unsolved problem. In this paper, we propose a new benchmark to experimentally investigate the scalability and limitations of behavior cloning. We show that behavior cloning leads to state-of-the-art results, including in unseen environments, executing complex lateral and longitudinal maneuvers without these reactions being explicitly programmed. However, we confirm well-known limitations (due to dataset bias and overfitting), new generalization issues (due to dynamic objects and the lack of a causal model), and training instability requiring further research before behavior cloning can graduate to real-world driving. The code of the studied behavior cloning approaches can be found at https://github.com/felipecode/coiltraine .