Layered decoding is well appreciated in Low-Density Parity-Check (LDPC) decoder implementation since it can achieve effectively high decoding throughput with low computation complexity. This work, for the first time, addresses low complexity column-layered decoding schemes and VLSI architectures for multi-Gb/s applications. At first, the Min-Sum algorithm is incorporated into the column-layered decoding. Then algorithmic transformations and judicious approximations are explored to minimize the overall computation complexity. Compared to the original column-layered decoding, the new approach can reduce the computation complexity in check node processing for high-rate LDPC codes by up to 90% while maintaining the fast convergence speed of layered decoding. Furthermore, a relaxed pipelining scheme is presented to enable very high clock speed for VLSI implementation. Equipped with these new techniques, an efficient decoder architecture for quasi-cyclic LDPC codes is developed and implemented with 0.13um CMOS technology. It is shown that a decoding throughput of nearly 4 Gb/s at maximum of 10 iterations can be achieved for a (4096, 3584) LDPC code. Hence, this work has facilitated practical applications of column-layered decoding and particularly made it very attractive in high-speed, high-rate LDPC decoder implementation.