Complex dynamical systems are used for predictions in many domains. Because of computational costs, models are truncated, coarsened, or aggregated. As the neglected and unresolved terms become important, the utility of model predictions diminishes. We develop a novel, versatile, and rigorous methodology to learn non-Markovian closure parameterizations for known-physics/low-fidelity models using data from high-fidelity simulations. The new neural closure models augment low-fidelity models with neural delay differential equations (nDDEs), motivated by the Mori-Zwanzig formulation and the inherent delays in complex dynamical systems. We demonstrate that neural closures efficiently account for truncated modes in reduced-order-models, capture the effects of subgrid-scale processes in coarse models, and augment the simplification of complex biological and physical-biogeochemical models. We find that using non-Markovian over Markovian closures improves long-term prediction accuracy and requires smaller networks. We derive adjoint equations and network architectures needed to efficiently implement the new discrete and distributed nDDEs, for any time-integration schemes and allowing nonuniformly-spaced temporal training data. The performance of discrete over distributed delays in closure models is explained using information theory, and we find an optimal amount of past information for a specified architecture. Finally, we analyze computational complexity and explain the limited additional cost due to neural closure models.