Model interpretability has become important to engenders appropriate user trust by providing the insight into the model prediction. However, most of the existing machine learning methods provide no interpretability for depression prediction, hence their predictions are obscure to human. In this work, we propose interpretive Multi-Modal Depression Detection with Hierarchical Attention Network MDHAN, for detection depressed users on social media and explain the model prediction. We have considered user posts along with Twitter-based multi-modal features, specifically, we encode user posts using two levels of attention mechanisms applied at the tweet-level and word-level, calculate each tweet and words importance, and capture semantic sequence features from the user timelines (posts). Our experiments show that MDHAN outperforms several popular and robust baseline methods, demonstrating the effectiveness of combining deep learning with multi-modal features. We also show that our model helps improve predictive performance when detecting depression in users who are posting messages publicly on social media. MDHAN achieves excellent performance and ensures adequate evidence to explain the prediction.