If NISQ-era quantum computers are to perform useful tasks, they will need to employ powerful error mitigation techniques. Quasi-probability methods can permit perfect error compensation at the cost of additional circuit executions, provided that the nature of the error model is fully understood and sufficiently local both spatially and temporally. Unfortunately these conditions are challenging to satisfy. Here we present a method by which the proper compensation strategy can instead be learned ab initio. Our training process uses multiple variants of the primary circuit where all non-Clifford gates are substituted with gates that are efficient to simulate classically. The process yields a configuration that is near-optimal versus noise in the real system with its non-Clifford gate set. Having presented a range of learning strategies, we demonstrate the power of the technique both with real quantum hardware (IBM devices) and exactly-emulated imperfect quantum computers. The systems suffer a range of noise severities and types, including spatially and temporally correlated variants. In all cases the protocol successfully adapts to the noise and mitigates it to a high degree.