Most deep network methods for compressive sensing reconstruction suffer from the black-box characteristic of DNN. In this paper, a deep neural network with interpretable motion estimation named CSMCNet is proposed. The network is able to realize high-quality reconstruction of video compressive sensing by unfolding the iterative steps of optimization based algorithms. A DNN based, multi-hypothesis motion estimation module is designed to improve the reconstruction quality, and a residual module is employed to further narrow down the gap between re-construction results and original signal in our proposed method. Besides, we propose an interpolation module with corresponding training strategy to realize scalable CS reconstruction, which is capable of using the same model to decode various compression ratios. Experiments show that a PSNR of 29.34dB can be achieved at 2% CS ratio (compressed by 98%), which is superior than other state-of-the-art methods. Moreover, the interpolation module is proved to be effective, with significant cost saving and acceptable performance losses.