Federated learning is an effective approach to realize collaborative learning among edge devices without exchanging raw data. In practice, these devices may connect to local hubs instead of connecting to the global server (aggregator) directly. Due to the (possibly limited) computation capability of these local hubs, it is reasonable to assume that they can perform simple averaging operations. A natural question is whether such local averaging is beneficial under different system parameters and how much gain can be obtained compared to the case without such averaging. In this paper, we study hierarchical federated learning with stochastic gradient descent (HF-SGD) and conduct a thorough theoretical analysis to analyze its convergence behavior. In particular, we first consider the two-level HF-SGD (one level of local averaging) and then extend this result to arbitrary number of levels (multiple levels of local averaging). The analysis demonstrates the impact of local averaging precisely as a function of system parameters. Due to the higher communication cost of global averaging, a strategy of decreasing the global averaging frequency and increasing the local averaging frequency is proposed. Experiments validate the proposed theoretical analysis and the advantages of HF-SGD.