To rapidly process temporal information at a low metabolic cost, biological neurons integrate inputs as an analog sum but communicate with spikes, binary events in time. Analog neuromorphic hardware uses the same principles to emulate spiking neural networks with exceptional energy-efficiency. However, instantiating high-performing spiking networks on such hardware remains a significant challenge due to device mismatch and the lack of efficient training algorithms. Here, we introduce a general in-the-loop learning framework based on surrogate gradients that resolves these issues. Using the BrainScaleS-2 neuromorphic system, we show that learning self-corrects for device mismatch resulting in competitive spiking network performance on both vision and speech benchmarks. Our networks display sparse spiking activity with, on average, far less than one spike per hidden neuron and input, perform inference at rates of up to 85 k frames/second, and consume less than 200 mW. In summary, our work sets several new benchmarks for low-energy spiking network processing on analog neuromorphic hardware and paves the way for future on-chip learning algorithms.