ترغب بنشر مسار تعليمي؟ اضغط هنا

Even with generational improvements in DRAM technology, memory access latency still remains the major bottleneck for application accelerators, primarily due to limitations in memory interface IPs which cannot fully account for variations in target ap plications, the algorithms used, and accelerator architectures. Since developing memory controllers for different applications is time-consuming, this paper introduces a modular and programmable memory controller that can be configured for different target applications on available hardware resources. The proposed memory controller efficiently supports cache-line accesses along with bulk memory transfers. The user can configure the controller depending on the available logic resources on the FPGA, memory access pattern, and external memory specifications. The modular design supports various memory access optimization techniques including, request scheduling, internal caching, and direct memory access. These techniques contribute to reducing the overall latency while maintaining high sustained bandwidth. We implement the system on a state-of-the-art FPGA and evaluate its performance using two widely studied domains: graph analytics and deep learning workloads. We show improved overall memory access time up to 58% on CNN and GCN workloads compared with commercial memory controller IPs.
Hash table is a fundamental data structure for quick search and retrieval of data. It is a key component in complex graph analytics and AI/ML applications. State-of-the-art parallel hash table implementations either make some simplifying assumptions such as supporting only a subset of hash table operations or employ optimizations that lead to performance that is highly data dependent and in the worst case can be similar to a sequential implementation. In contrast, in this work we develop a dynamic hash table that supports all the hash table queries - search, insert, delete, update, while allowing us to support p parallel queries (p>1) per clock cycle via p processing engines (PEs) in the worst case i.e. the performance is data agnostic. We achieve this by implementing novel XOR based multi-ported block memories on FPGAs. Additionally, we develop a technique to optimize the memory requirement of the hash table if the ratio of search to insert/update/delete queries is known beforehand. We implement our design on state-of-the-art FPGA devices. Our design is scalable to 16 PEs and supports throughput up to 5926 MOPS. It matches the throughput of the state-of-the-art hash table design - FASTHash, which only supports search and insert operations. Comparing with the best FPGA design that supports the same set of operations, our hash table achieves up to 12.3x speedup.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا