Topic

Xeon

About: Xeon is a research topic. Over the lifetime, 1910 publications have been published within this topic receiving 30536 citations. The topic is also known as: Intel Xeon.

...read moreread less

Topic Tools

Find unexplored research gaps

Generate a literature review

Explore related concepts

Papers published on a yearly basis

Papers

Posted Content•

Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems

[...]

Vasimuddin¹, Sanchit Misra¹, Heng Li², Srinivas Aluru³•Institutions (3)

Intel¹, Harvard University², Georgia Institute of Technology³

27 Jul 2019-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This work improves the performance of the three kernels of BWA-MEM by using techniques to improve cache reuse, simplifying the algorithms, and replacing many small memory allocations with a few large contiguous ones to improve hardware prefetching of data, and focusing on performance improvements on a single socket multicore processor.

...read moreread less

Abstract: Innovations in Next-Generation Sequencing are enabling generation of DNA sequence data at ever faster rates and at very low cost. Large sequencing centers typically employ hundreds of such systems. Such high-throughput and low-cost generation of data underscores the need for commensurate acceleration in downstream computational analysis of the sequencing data. A fundamental step in downstream analysis is mapping of the reads to a long reference DNA sequence, such as a reference human genome. Sequence mapping is a compute-intensive step that accounts for more than 30% of the overall time of the GATK workflow. BWA-MEM is one of the most widely used tools for sequence mapping and has tens of thousands of users. In this work, we focus on accelerating BWA-MEM through an efficient architecture aware implementation, while maintaining identical output. The volume of data requires distributed computing environment, usually deploying multicore processors. Since the application can be easily parallelized for distributed memory systems, we focus on performance improvements on a single socket multicore processor. BWA-MEM run time is dominated by three kernels, collectively responsible for more than 85% of the overall compute time. We improved the performance of these kernels by 1) improving cache reuse, 2) simplifying the algorithms, 3) replacing small fragmented memory allocations with a few large contiguous ones, 4) software prefetching, and 5) SIMD utilization wherever applicable - and massive reorganization of the source code enabling these improvements. As a result, we achieved nearly 2x, 183x, and 8x speedups on the three kernels, respectively, resulting in up to 3.5x and 2.4x speedups on end-to-end compute time over the original BWA-MEM on single thread and single socket of Intel Xeon Skylake processor. To the best of our knowledge, this is the highest reported speedup over BWA-MEM.

...read moreread less

1,096 citations

The microarchitecture of the Pentium 4 processor

[...]

G. Hinton

1 Jan 2001

TL;DR: The main features and functions of the NetBurst microarchitecture of Intel’s new flagship Pentium 4 processor are described, including its new form of instruction cache called the Execution Trace Cache.

...read moreread less

Abstract: This paper describes the Intel NetBurstTM microarchitecture of Intel’s new flagship Pentium 4 processor. This microarchitecture is the basis of a new family of processors from Intel starting with the Pentium 4 processor. The Pentium 4 processor provides a substantial performance gain for many key application areas where the end user can truly appreciate the difference. In this paper we describe the main features and functions of the NetBurst microarchitecture. We present the frontend of the machine, including its new form of instruction cache called the Execution Trace Cache. We also describe the out-of-order execution engine, including the extremely low latency double-pumped Arithmetic Logic Unit (ALU) that runs at 3GHz. We also discuss the memory subsystem, including the very low latency Level 1 data cache that is accessed in just two clock cycles. We then touch on some of the key features that allow the Pentium 4 processor to have outstanding floating-point and multi-media performance. We provide some key performance numbers for this processor, comparing it to the Pentium III processor.

...read moreread less

671 citations

Journal Article•10.1007/S13389-012-0027-1•

High-speed high-security signatures

[...]

Daniel J. Bernstein¹, N Niels Duif², Tanja Lange², Peter Schwabe³, Bo-Yin Yang⁴ - Show less +1 more•Institutions (4)

University of Illinois at Chicago¹, Eindhoven University of Technology², National Taiwan University³, Academia Sinica⁴

14 Aug 2012-Journal of Cryptographic Engineering

TL;DR: In this paper, the authors show that a $390 mass-market quad-core 2.4GHz Intel Westmere (Xeon E5620) CPU can create 109000 signatures per second and verify 71000 signature per second on an elliptic curve at a 2128 security level.

...read moreread less

Abstract: This paper shows that a $390 mass-market quad-core 2.4GHz Intel Westmere (Xeon E5620) CPU can create 109000 signatures per second and verify 71000 signatures per second on an elliptic curve at a 2128 security level. Public keys are 32 bytes, and signatures are 64 bytes. These performance figures include strong defenses against software side-channel attacks: there is no data flow from secret keys to array indices, and there is no data flow from secret keys to branch conditions.

...read moreread less

629 citations

Journal Article•10.1093/BIOINFORMATICS/BTL582•

Striped Smith--Waterman speeds database searches six times over other SIMD implementations

[...]

Michael Farrar

05 Jan 2007-Bioinformatics

TL;DR: To speed up the Smith-Waterman algorithm, Single-Instruction Multiple-Data (SIMD) instructions have been used to parallelize the algorithm at the instruction level.

...read moreread less

Abstract: Motivation: The only algorithm guaranteed to find the optimal local alignment is the Smith--Waterman. It is also one of the slowest due to the number of computations required for the search. To speed up the algorithm, Single-Instruction Multiple-Data (SIMD) instructions have been used to parallelize the algorithm at the instruction level. Results: A faster implementation of the Smith--Waterman algorithm is presented. This algorithm achieved 2--8 times performance improvement over other SIMD based Smith--Waterman implementations. On a 2.0 GHz Xeon Core 2 Duo processor, speeds of >3.0 billion cell updates/s were achieved. Availability: http://farrar.michael.googlepages.com/Smith-waterman Contact: farrar.michael@gmail.com

...read moreread less

551 citations

Proceedings Article•10.1145/3225058.3225069•

ImageNet Training in Minutes

[...]

Yang You¹, Zhao Zhang, Cho-Jui Hsieh², James Demmel¹, Kurt Keutzer¹ - Show less +1 more•Institutions (2)

University of California, Berkeley¹, University of California, Davis²

13 Aug 2018

TL;DR: This paper empirically evaluates the effectiveness on two neural networks: AlexNet and ResNet-50 trained with the ImageNet-1k dataset while preserving the state-of-the-art test accuracy, and uses large batch size, powered by the Layer-wise Adaptive Rate Scaling (LARS) algorithm, for efficient usage of massive computing resources.

...read moreread less

Abstract: In this paper, we investigate large scale computers' capability of speeding up deep neural networks (DNN) training. Our approach is to use large batch size, powered by the Layer-wise Adaptive Rate Scaling (LARS) algorithm, for efficient usage of massive computing resources. Our approach is generic, as we empirically evaluate the effectiveness on two neural networks: AlexNet and ResNet-50 trained with the ImageNet-1k dataset while preserving the state-of-the-art test accuracy. Compared to the baseline of a previous study from a group of researchers at Facebook, our approach shows higher test accuracy on batch sizes that are larger than 16K. Using 2,048 Intel Xeon Platinum 8160 processors, we reduce the 100-epoch AlexNet training time from hours to 11 minutes. With 2,048 Intel Xeon Phi 7250 Processors, we reduce the 90-epoch ResNet-50 training time from hours to 20 minutes. Our implementation is open source and has been released in the Intel distribution of Caffe v1.0.7.

...read moreread less

541 citations

...

Expand

Performance Metrics

2,096

Papers

10,907

Citations

No. of papers in the topic in previous years
Year	Papers
2025	2
2024	9
2023	46
2022	110
2021	84
2020	110

Xeon

Topic Tools

Papers published on a yearly basis

Papers

Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems

The microarchitecture of the Pentium 4 processor

High-speed high-security signatures

Striped Smith--Waterman speeds database searches six times over other SIMD implementations

ImageNet Training in Minutes

Related Topics (5)

Performance Metrics