A Highly-Efficient Error Detection Technique for General Matrix Multiplication using Tiled Processing on SIMD Architecture

doi:10.1109/ICCD56317.2022.00084

Journal Article10.1109/ICCD56317.2022.00084

A Highly-Efficient Error Detection Technique for General Matrix Multiplication using Tiled Processing on SIMD Architecture

Chandra sekhar Mummidi, +4 more

- 01 Oct 2022

- International Conference on Community De...

- pp 529-536

4

TL;DR: In this article , an algorithm-based fault tolerance (ABFT) approach is proposed to detect silent errors in general matrix multiplication (GEMM) computations traced to hardware sources.

Abstract: General Matrix Multiplication (GEMM) is instrumental in myriads of scientific, high-performance computing, and machine learning applications such as computer vision, recommendation models, and weather forecasts. It is vital to make them fail-safe in safety-critical and high-precision applications. Companies like Meta and Google have recently reported sporadic silent errors in GEMM computations traced to hardware sources. Silent errors are hard to detect, requiring specialized solutions to detect them. Hardware redundancy approaches such as double or triple modular redundancy effectively detect or correct such errors, but they have a large area and power overhead. Algorithm-based Fault Tolerance (ABFT) has been shown to offer an effective alternative at a far lower overhead. Modern CPUs feature advanced vector extensions (AVX) capable of executing SIMD instructions. This paper describes a new ABFT approach designed to take advantage of the AVX feature. Our core algorithm relies on the classical tile-based outer-product approach but enhances standard check-sum calculation using a tile vector. The implementation parameters are fine-tuned to fit the available number of AVX registers. Our results indicate that we can achieve 100% error detection in GEMM at an overhead of just 0.21% for the integer data type. Unfortunately, due to rounding errors, addition of floating-point numbers is not an associative operation, creating difficulties for ABFT. To mitigate the impact of rounding errors, we introduce the concept of relative error checking and perform error analysis for various error classes to show that the proposed approach totally eliminates false positive errors.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1145/3633332

Highly Efficient Self-Checking Matrix Multiplication on Tiled AMX Accelerators

Chandra sekhar Mummidi, +3 more

- 22 Nov 2023

- ACM Transactions on Architecture and Cod...

TL;DR: An adaptive error threshold is proposed that takes into account the TMUL input values to address the problem of false triggers and error escapes, and provide a taxonomy of various error classes to ensure full coverage for all hardware faults.

...read moreread less

1

Journal Article•10.1109/dft63277.2024.10753526

A Novel Self-Repair Mechanism for Tiled Matrix Multiplication Unit

Chandra Sekhar Mummidi, +2 more

- 08 Oct 2024

TL;DR: This paper presents a novel self-repair mechanism for Tiled Matrix Multiplication Units, leveraging software-based column avoidance to detect and circumvent persistent faults, enabling real-time correction of both persistent and intermittent faults with low-performance overhead.

...read moreread less

1

Proceedings Article•10.1109/iccsse59359.2023.10245050

Exploring the Architecture of Multiple GEMM Accelerators in Heterogeneous Systems

Jianfeng Zhang, +2 more

- 16 Jun 2023

TL;DR: A detailed workload characterization for different sizes of GEMM kernels is performed and a comprehensive design space exploration is made to find the Pareto optimal architecture configurations.

...read moreread less

Journal Article•10.1109/MDAT.2023.3241116

Testability and Dependability of AI Hardware: Survey, Trends, Challenges, and Perspectives

Fei Su, +2 more

- 01 Apr 2023

- IEEE design & test

TL;DR: Tahoori et al. as mentioned in this paper presented a survey on the dependability and testability of artificial intelligence (AI) systems. But their focus was not on the underlying technologies and their dependability requirements, challenges, and solutions.

...read moreread less

References

Journal Article•10.1109/TDMR.2005.853449

Radiation-induced soft errors in advanced semiconductor technologies

Robert Baumann

- 05 Dec 2005

- IEEE Transactions on Device and Material...

TL;DR: In this article, the authors review the types of failure modes for soft errors, the three dominant radiation mechanisms responsible for creating soft errors in terrestrial applications, and how these soft errors are generated by the collection of radiation-induced charge.

...read moreread less

1.4K

Journal Article•10.1109/TC.1984.1676475

Algorithm-Based Fault Tolerance for Matrix Operations

Kuang-Hua Huang, +1 more

- 01 Jun 1984

- IEEE Transactions on Computers

TL;DR: Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems.

...read moreread less

1.4K

•Journal Article•10.1088/1748-9326/3/3/034008

Worldwide electricity used in data centers

Jonathan G. Koomey

- 01 Jul 2008

- Environmental Research Letters

TL;DR: This study estimates historical electricity use by data centers worldwide and regionally on the basis of more detailed data than were available for previous assessments, including electricity used by servers, data center communications, and storage equipment.

...read moreread less

1K

Book Chapter•10.1007/978-3-319-06486-4_7

Intel Math Kernel Library

Endong Wang, +6 more

- 01 Jan 2014

TL;DR: In order to achieve optimal performance on multi-core and multi-processor systems, the features of parallelism and manage the memory hierarchical characters efficiently need to be used.

...read moreread less

737

Proceedings Article•10.1145/3126908.3126964

Understanding error propagation in deep learning neural network (DNN) accelerators and applications

Guanpeng Li, +6 more

- 12 Nov 2017

TL;DR: It is found that the error resilience of a DNN system depends on the data types, values, data reuses, and types of layers in the design, and two efficient protection techniques are proposed.

...read moreread less

594

...

Expand