Matthew Boyd
6 Papers
2 Citations
Matthew Boyd is an academic researcher. The author has contributed to research in topics: Computer science & Scalability. The author has an hindex of 2, co-authored 6 publications.
Chat about Author
Papers
A software-defined tensor streaming multiprocessor for large-scale machine learning
Dennis Abts,Garrin Kimmell,Andrew S. Ling,John Kim,Matthew Boyd,Andrew Bitar,Sahil Parmar,Ibrahim Ahmed,Roberto DiCecco,David Han,John Matthew Thompson,Michael Bye,Jennifer Hwang,Jeremy Fowers,Peter Lillian,Ashwin Murthy,Elyas Mehtabuddin,Chetan Tekur,Thomas Sohmers,Kris Kang,Stephen Maresh,Jonathan K. Ross +21 more
- 11 Jun 2022
TL;DR: The topology, routing and flow control are described to characterize the performance of the network that serves as the fabric for a large-scale parallel machine learning system with up to 10,440 TSPs and more than 2 TeraBytes of global memory accessible in less than 3 microseconds of end-to-end system latency.
A Comprehensive Evaluation of Novel AI Accelerators for Deep Learning Workloads
Murali Emani,Zheng Xie,Siddhisanket Raskar,Varuni K. Sastry,William Arnold,Bruce Wilson,Rajeev Thakur,Venkatram Vishwanath,Zhengchun Liu,Michael E. Papka,Cindy Orozco Bohorquez,Rickey C. Weisner,Yongning Sheng,Yun Du,Jian Zhang,A. I. Tsyplikhin,Gurdaman Khaira,Jeremy Fowers,R. Sivakumar,Victoria Godsoe,Adrian Macias,Chetan Tekur,Matthew Boyd +22 more
- 01 Nov 2022
TL;DR: In this article, the authors present an overview of dataflow-based AI accelerators from SambaNova, Cerebras, Graphcore, and Groq and evaluate the performance of collective communication, which is key for distributed DL implementation.
10
Answer Fast: Accelerating BERT on the Tensor Streaming Processor
Ibrahim Ahmed,Sahil Parmar,Matthew Boyd,Michael Beidler,Kris Kang,Bill Liu,Kyle Roach,John Kim,Dennis Abts +8 more
- 22 Jun 2022
TL;DR: By carefully fusing all the nonlinear components with the matrix multiplication components, the on-chip matrix multiplication units are efficiently utilized resulting in a deterministic tail latency of 130 μs for a batch-1 inference through BERT-base, which is 6× faster than the current state-of-the-art.
Challenges/Opportunities to Enable Dependable Scale-out System with Groq Deterministic Tensor-Streaming Processors
Dennis Abts,Ibrahim Omer Ahmed,Andrew Bitar,Matthew Boyd,John Kim,Garrin Kimmell,Andrew S. Ling +6 more
- 01 Jun 2022
TL;DR: This work explores the challenges and opportunities to scale such deterministic architecture across multiple processors to ensure a dependable scale-out system and the high-radix, low diameter topology enables N + 1 redundancy to improve the reliability of the system.
2