Identifying underlying governing equations and physical relevant information from high-dimensional observable data has always been a challenge in physical sciences. With the recent advances in sensing technology and available datasets, various machine learning techniques have made it possible to distill underlying mathematical models from sufficiently clean and usable datasets. However, most of these techniques rely on prior knowledge of the system and noise-free data obtained by simulation of physical system or by direct measurements of the signals. Hence, the inference obtained by using these techniques is often unreliable to be used in the real world where observed data is noisy and requires feature engineering to extract relevant features. In this work, we provide a deep-learning framework that extracts relevant information from real-world videos of highly stochastic systems, with no prior knowledge and distills the underlying governing equation representing the system. We demonstrate this approach on videos of confined multi-agent/particle systems of ants, termites, fishes as well as a simulated confined multi-particle system with elastic collision interactions. Furthermore, we explore how these seemingly diverse systems have predictable underlying behavior. In this study, we have used computer vision and motion tracking to extract spatial trajectories of individual agents/particles in a system, and by using LSTM VAE we projected these features on a low-dimensional latent space from which the underlying differential equation representing the data was extracted using SINDy framework.