Polar codes have recently emerged as one of the most favorable capacityachieving error correction codes due to their low encoding and decoding complexity. However, because of the large code length required by practical applications, the few existing successive cancellation (SC) decoder implementations still suffer from not only high hardware cost but also long decoding latency. In this paper, a data-flow graph (DFG) for the SC decoder is derived. A complete hardware architecture is first derived for the conventional tree SC decoder and the feedback part is presented next. Precomputation look-ahead technique is exploited to reduce the achievable minimum decoding latency. Substructure sharing is used to design a merged processing element (PE) for higher hardware utilization. In order to meet throughput requirements for a diverse set of application scenarios, a systematic approach to construct different overlapped SC polar decoder architectures is also presented. Compared with a conventional
-bit tree SC decoder, the proposed overlapped architectures can achieve as high as
times speedup with only
merged PEs. The proposed pre-computation approach leads to a 50% reduction in latency for
, and 40% reduction for
.