N-body simulations are essential tools in physical cosmology to understand the large-scale structure (LSS) formation of the Universe. Large-scale simulations with high resolution are important for exploring the substructure of universe and for determining fundamental physical parameters like neutrino mass. However, traditional particle-mesh (PM) based algorithms use considerable amounts of memory, which limits the scalability of simulations. Therefore, we designed a two-level PM algorithm CUBE towards optimal performance in memory consumption reduction. By using the fixed-point compression technique, CUBE reduces the memory consumption per N-body particle toward 6 bytes, an order of magnitude lower than the traditional PM-based algorithms. We scaled CUBE to 512 nodes (20,480 cores) on an Intel Cascade Lake based supercomputer with $simeq$95% weak-scaling efficiency. This scaling test was performed in Cosmo-$pi$ -- a cosmological LSS simulation using $simeq$4.4 trillion particles, tracing the evolution of the universe over $simeq$13.7 billion years. To our best knowledge, Cosmo-$pi$ is the largest completed cosmological N-body simulation. We believe CUBE has a huge potential to scale on exascale supercomputers for larger simulations.