Journal Article10.1109/ICCD56317.2022.00084
A Highly-Efficient Error Detection Technique for General Matrix Multiplication using Tiled Processing on SIMD Architecture
4
TL;DR: In this article , an algorithm-based fault tolerance (ABFT) approach is proposed to detect silent errors in general matrix multiplication (GEMM) computations traced to hardware sources.
read more
Abstract: General Matrix Multiplication (GEMM) is instrumental in myriads of scientific, high-performance computing, and machine learning applications such as computer vision, recommendation models, and weather forecasts. It is vital to make them fail-safe in safety-critical and high-precision applications. Companies like Meta and Google have recently reported sporadic silent errors in GEMM computations traced to hardware sources. Silent errors are hard to detect, requiring specialized solutions to detect them. Hardware redundancy approaches such as double or triple modular redundancy effectively detect or correct such errors, but they have a large area and power overhead. Algorithm-based Fault Tolerance (ABFT) has been shown to offer an effective alternative at a far lower overhead. Modern CPUs feature advanced vector extensions (AVX) capable of executing SIMD instructions. This paper describes a new ABFT approach designed to take advantage of the AVX feature. Our core algorithm relies on the classical tile-based outer-product approach but enhances standard check-sum calculation using a tile vector. The implementation parameters are fine-tuned to fit the available number of AVX registers. Our results indicate that we can achieve 100% error detection in GEMM at an overhead of just 0.21% for the integer data type. Unfortunately, due to rounding errors, addition of floating-point numbers is not an associative operation, creating difficulties for ABFT. To mitigate the impact of rounding errors, we introduce the concept of relative error checking and perform error analysis for various error classes to show that the proposed approach totally eliminates false positive errors.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Highly Efficient Self-Checking Matrix Multiplication on Tiled AMX Accelerators
Chandra sekhar Mummidi,Victor C. Ferreira,Sudarshan K. Srinivasan,Sandip Kundu +3 more
TL;DR: An adaptive error threshold is proposed that takes into account the TMUL input values to address the problem of false triggers and error escapes, and provide a taxonomy of various error classes to ensure full coverage for all hardware faults.
1
A Novel Self-Repair Mechanism for Tiled Matrix Multiplication Unit
Chandra Sekhar Mummidi,Sandeep Bal,Sandip Kundu +2 more
- 08 Oct 2024
TL;DR: This paper presents a novel self-repair mechanism for Tiled Matrix Multiplication Units, leveraging software-based column avoidance to detect and circumvent persistent faults, enabling real-time correction of both persistent and intermittent faults with low-performance overhead.
1
Exploring the Architecture of Multiple GEMM Accelerators in Heterogeneous Systems
Jianfeng Zhang,Li Zhou,Hengzhu Liu +2 more
- 16 Jun 2023
TL;DR: A detailed workload characterization for different sizes of GEMM kernels is performed and a comprehensive design space exploration is made to find the Pareto optimal architecture configurations.
Testability and Dependability of AI Hardware: Survey, Trends, Challenges, and Perspectives
TL;DR: Tahoori et al. as mentioned in this paper presented a survey on the dependability and testability of artificial intelligence (AI) systems. But their focus was not on the underlying technologies and their dependability requirements, challenges, and solutions.
References
Radiation-induced soft errors in advanced semiconductor technologies
TL;DR: In this article, the authors review the types of failure modes for soft errors, the three dominant radiation mechanisms responsible for creating soft errors in terrestrial applications, and how these soft errors are generated by the collection of radiation-induced charge.
Algorithm-Based Fault Tolerance for Matrix Operations
Kuang-Hua Huang,Abraham +1 more
TL;DR: Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems.
1.4K
Worldwide electricity used in data centers
TL;DR: This study estimates historical electricity use by data centers worldwide and regionally on the basis of more detailed data than were available for previous assessments, including electricity used by servers, data center communications, and storage equipment.
1K
Intel Math Kernel Library
Endong Wang,Qing Zhang,Bo Shen,Guangyong Zhang,Xiaowei Lu,Qing Wu,Yajuan Wang +6 more
- 01 Jan 2014
TL;DR: In order to achieve optimal performance on multi-core and multi-processor systems, the features of parallelism and manage the memory hierarchical characters efficiently need to be used.
737
Understanding error propagation in deep learning neural network (DNN) accelerators and applications
Guanpeng Li,Siva Kumar Sastry Hari,Michael J. Sullivan,Timothy Tsai,Karthik Pattabiraman,Joel Emer,Stephen W. Keckler +6 more
- 12 Nov 2017
TL;DR: It is found that the error resilience of a DNN system depends on the data types, values, data reuses, and types of layers in the design, and two efficient protection techniques are proposed.
594