High-speed trains (HSTs) are being widely deployed around the world. To meet the high-rate data transmission requirements on HSTs, millimeter wave (mmWave) HST communications have drawn increasingly attentions. To realize sufficient link margin, mmWave HST systems employ directional beamforming with large antenna arrays, which results in that the channel estimation is rather time-consuming. In HST scenarios, channel conditions vary quickly and channel estimations should be performed frequently. Since the period of each transmission time interval (TTI) is too short to allocate enough time for accurate channel estimation, the key challenge is how to design an efficient beam searching scheme to leave more time for data transmission. Motivated by the successful applications of machine learning, this paper tries to exploit the similarities between current and historical wireless propagation environments. Using the knowledge of reinforcement learning, the beam searching problem of mmWave HST communications is formulated as a multi-armed bandit (MAB) problem and a bandit inspired beam searching scheme is proposed to reduce the number of measurements as many as possible. Unlike the popular deep learning methods, the proposed scheme does not need to collect and store a massive amount of training data in advance, which can save a huge amount of resources such as storage space, computing time, and power energy. Moreover, the performance of the proposed scheme is analyzed in terms of regret. The regret analysis indicates that the proposed schemes can approach the theoretical limit very quickly, which is further verified by simulation results.