In the domain of social signal processing, audio event detection is a promising avenue for accessing daily behaviors that contribute to health and well-being. However, despite advances in mobile computing and machine learning, audio behavior detection models are largely constrained to data collected in controlled settings, such as call centers. This is problematic as it means their performance is unlikely to generalize to real-world applications. In this paper, we present a novel dataset of infant distress vocalizations compiled from over 780 hours of real-world audio data, collected via recorders worn by infants. We develop a model that combines deep spectrum and acoustic features to detect and classify infant distress vocalizations, which dramatically outperforms models trained on equivalent real-world data (F1 score of 0.630 vs 0.166). We end by discussing how dataset size can facilitate such gains in accuracy, critical when considering noisy and complex naturalistic data.