Despite recent advances in MOOC, the current e-learning systems have advantages of alleviating barriers by time differences, and geographically spatial separation between teachers and students. However, there has been a lack of supervision problem that e-learners learning unit state(LUS) cant be supervised automatically. In this paper, we present a fusion framework considering three channel data sources: 1) videos/images from a camera, 2) eye movement information tracked by a low solution eye tracker and 3) mouse movement. Based on these data modalities, we propose a novel approach of multi-channel data fusion to explore the learning unit state recognition. We also propose a method to build a learning state recognition model to avoid manually labeling image data. The experiments were carried on our designed online learning prototype system, and we choose CART, Random Forest and GBDT regression model to predict e-learners learning state. The results show that multi-channel data fusion model have a better recognition performance in comparison with single channel model. In addition, a best recognition performance can be reached when image, eye movement and mouse movement features are fused.