Polar codes are a class of linear block codes that provably achieves channel capacity. They have been selected as a coding scheme for the control channel of enhanced mobile broadband (eMBB) scenario for $5^{text{th}}$ generation wireless communication networks (5G) and are being considered for additional use scenarios. As a result, fast decoding techniques for polar codes are essential. Previous works targeting improved throughput for successive-cancellation (SC) decoding of polar codes are semi-parallel implementations that exploit special maximum-likelihood (ML) nodes. In this work, we present a new fast simplified SC (Fast-SSC) decoder architecture. Compared to a baseline Fast-SSC decoder, our solution is able to reduce the memory requirements. We achieve this through a more efficient memory utilization, which also enables to execute multiple operations in a single clock cycle. Finally, we propose new special node merging techniques that improve the throughput further, and detail a new Fast-SSC-based decoder architecture to support merged operations. The proposed decoder reduces the operation sequence requirement by up to $39%$, which enables to reduce the number of time steps to decode a codeword by $35%$. ASIC implementation results with 65 nm TSMC technology show that the proposed decoder has a throughput improvement of up to $31%$ compared to previous Fast-SSC decoder architectures.