Proceedings Article10.1109/DSD53832.2021.00033
Near-Data-Processing Architectures Performance Estimation and Ranking using Machine Learning Predictors
Veronia Iskandar,Mohamed A. Abd El Ghany,Diana Goehringer +2 more
- 01 Sep 2021
- pp 158-165
5
TL;DR: In this paper, a machine learning framework is proposed to predict the performance of near-data processing (NDP) applications on 3D-stacked DRAM systems, based on an input set of application characteristics.
read more
Abstract: The near-data processing (NDP) paradigm has emerged as a promising solution for the memory wall challenges of future computing architectures. Modern 3D-stacked DRAM systems can be exploited to prevent unnecessary data movement between the main memory and the CPU. To date, no standardized simulation frameworks or benchmarks are available for the systematic evaluation of NDP systems. Identifying which type of high-performance 3D memory is suitable to use in an NDP system remains a challenge. This is mainly due to the fact that understanding the interactions between modern workloads and the memory subsystem is not a trivial task. Each memory type has its advantages and drawbacks. Additionally, memory access patterns vary greatly across applications. As a result, the performance of a given application on a given memory type is difficult to intuitively predict. There is no specific memory type that can effectively provide high performance for all applications.In this work, we propose a machine learning framework that can efficiently decide which NDP system is suitable for an application. The framework relies on performance prediction based on an input set of application characteristics. For each NDP system we are examining, we build a machine learning model that can accurately predict performance of previously unseen applications on this system. Our models are on average 200x faster than architectural simulation. They can accurately predict performance with coefficients of determination ranging between 0.88 and 0.92, and root mean square errors ranging between 0.08 and 0.19.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
AI/ML Algorithms and Applications in VLSI Design and Technology
Deepthi Amuru,Harsha V. Vudumula,Pavan K. Cherupally,Sushanth R. Gurram,Amir Ahmad,Andleeb Zahra,Ziaee Pour Abbas +6 more
TL;DR: In this article , the authors discuss the future scope of AI/ML applications to revolutionize the field of VLSI design, aiming for high-speed, highly intelligent, and efficient implementations.
NDP-RANK: Prediction and ranking of NDP systems performance using machine learning
TL;DR: In this article , a machine learning framework is proposed to decide which near-data processing (NDP) system is suitable for an application based on an input set of application and micro-architecture characteristics.
2
Performance Estimation and Prototyping of Reconfigurable Near-Memory Computing Systems
Veronia Iskandar,Mohamed A. Abd El Ghany,Diana Goehringer +2 more
- 04 Sep 2023
TL;DR: This work addresses the research questions of how to leverage the full bandwidth of 3D-stacked high-bandwidth memory and how to facilitate the adoption of the near-memory computing paradigm.
1
Auto-DOK: Compiler-Assisted Automatic Detection of Offload Kernels for FPGA-HBM Architectures
Veronia Iskandar,Mohamed A. Abd El Ghany,Diana Goehringer +2 more
- 06 Sep 2023
TL;DR: Auto-DOK is a compiler-assisted tool for identifying offload kernels for FPGA-HBM architectures. It analyzes application code based on hardware design goals and automatically identifies kernels suitable for offloading.
An End-to-End Framework for Compiling Dense and Sparse Matrix-Vector Multiplications for FPGA-HBM Acceleration
Veronia Iskandar,Mohamed A. Abd El Ghany,Diana Goehringer +2 more
Abstract: The bandwidth improvement provided by high-bandwidth memory (HBM), and the capability of FPGAs to customize the processing and memory hierarchy, results in a considerable performance increase for memory-intensive workloads such as graph processing, sorting, machine learning, and database analytics. Modern systems integrating 3D-stacked DRAM memory can be leveraged to realize the Near-Memory Computing (NMC) paradigm by offloading some computations to accelerators placed near the HBM. Matrix-vector multiplication (MVM) kernels, which are memory-bound, can significantly benefit from being executed on FPGA-HBM platforms. MVM kernels can be broadly categorized into two types: dense (General Matrix-Vector Multiplication, GEMV) and sparse (Sparse Matrix-Vector Multiplication, SpMV). Recent literature has predominantly focused on optimizing SpMV for FPGA-HBM, leaving a unified solution relatively unexplored. In this work, we introduce an end-to-end framework for compiling MVM kernels for FPGA-HBM Acceleration. It consists of a software and a hardware components. The software component introduces the MATIO compiler, a novel toolflow for detecting MVM and matrix multiplication (MM) kernels in C or C++ code, and replacing MVM kernels with a call to our FPGA accelerator. MATIO is capable of detecting 90% of MVM and MM kernels in real-world benchmarks collected from Github. Additionally, it is faster than state-of-the-art detection methods by at least 45x. On the hardware side, we introduce VecMADS, a novel FPGA architecture designed to efficiently handle both GEMV and SpMV operations. Our architecture leverages the high memory bandwidth of HBM to overcome memory bottlenecks, providing a comprehensive solution for accelerating matrix-vector multiplication on FPGAs. Evaluation results show that VecMADS delivers 1.5x higher throughput and 4.8x higher energy efficiency compared to cuSPARSE library on GPU. Considering dense benchmarks, VecMADS achieves 1.26x higher throughput than the hipBLAS library running on GPU.
References
•Journal Article
Scikit-learn: Machine Learning in Python
Fabian Pedregosa,Gaël Varoquaux,Alexandre Gramfort,Vincent Michel,Bertrand Thirion,Olivier Grisel,Mathieu Blondel,Peter Prettenhofer,Ron Weiss,Vincent Dubourg,Jake Vanderplas,Alexandre Passos,David Cournapeau,Matthieu Brucher,Matthieu Perrot,Edouard Duchesnay +15 more
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
LLVM: a compilation framework for lifelong program analysis & transformation
Chris Lattner,Vikram Adve +1 more
- 20 Mar 2004
TL;DR: The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.
The gem5 simulator
Nathan Binkert,Bradford M. Beckmann,Gabriel Black,Steven K. Reinhardt,Ali G. Saidi,Arkaprava Basu,Joel Hestness,Derek R. Hower,Tushar Krishna,Somayeh Sardashti,Rathijit Sen,Korey Sewell,Muhammad Shoaib,Nilay Vaish,Mark D. Hill,Darien Wood +15 more
TL;DR: The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.
A scalable processing-in-memory accelerator for parallel graph processing
Junwhan Ahn,Sungpack Hong,Sungjoo Yoo,Onur Mutlu,Kiyoung Choi +4 more
- 13 Jun 2015
TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
Ramulator: A Fast and Extensible DRAM Simulator
TL;DR: This paper presents Ramulator, a fast and cycle-accurate DRAM simulator that is built from the ground up for extensibility, and is able to provide out-of-the-box support for a wide array of DRAM standards.