Recent literature established that neural networks can represent good policies across a range of stochastic dynamic models in supply chain and logistics. We incorporate variance reduction techniques in a newly proposed algorithm, to overcome limitations of the model-free algorithms typically employed to learn such neural network policies. For the classical lost sales inventory model, the algorithm learns neural network policies that are superior to those learned using model-free algorithms, while outperforming the best heuristic benchmarks by an order of magnitude. The algorithm is an interesting candidate to apply to other stochastic dynamic problems in supply chain and logistics, because the ideas in its development are generic.