Xeon Phi

Topic Tools

Papers published on a yearly basis

Papers

Face classification using electronic synapses

[...]

Peng Yao¹, Huaqiang Wu¹, Bin Gao¹, Sukru Burc Eryilmaz², Xueyao Huang¹, Wenqiang Zhang¹, Qingtian Zhang¹, Ning Deng¹, Luping Shi¹, H-S Philip Wong², He Qian¹ - Show less +7 more•Institutions (2)

Tsinghua University¹, Stanford University²

12 May 2017-Nature Communications

TL;DR: An analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials is presented and shows bidirectional continuous weight modulation behaviour, consolidating the feasibility of analogue synaptic array and paving the way toward building an energy efficient and large-scale neuromorphic system.

...read moreread less

Abstract: Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.

...read moreread less

870 citations

Journal Article•10.1093/BIOINFORMATICS/BTY648•

Scaling read aligners to hundreds of threads on general-purpose processors.

[...]

Ben Langmead¹, Christopher Wilks¹, Valentin Antonescu¹, Rone Charles¹•Institutions (1)

Johns Hopkins University¹

01 Feb 2019-Bioinformatics

TL;DR: This work greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture, and highlights how bottlenecks are exacerbated by variable‐record‐length file formats like FASTQ and suggest changes that enable superior scaling.

...read moreread less

Abstract: Motivation General-purpose processors can now contain many dozens of processor cores and support hundreds of simultaneous threads of execution. To make best use of these threads, genomics software must contend with new and subtle computer architecture issues. We discuss some of these and propose methods for improving thread scaling in tools that analyze each read independently, such as read aligners. Results We implement these methods in new versions of Bowtie, Bowtie 2 and HISAT. We greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture. We also highlight how bottlenecks are exacerbated by variable-record-length file formats like FASTQ and suggest changes that enable superior scaling. Availability and implementation Experiments for this study: https://github.com/BenLangmead/bowtie-scaling. Bowtie http://bowtie-bio.sourceforge.net. Bowtie 2 http://bowtie-bio.sourceforge.net/bowtie2. Hisat http://www.ccb.jhu.edu/software/hisat. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

686 citations

Journal Article•10.1016/J.CPC.2013.06.003•

A flexible algorithm for calculating pair interactions on SIMD architectures

[...]

Szilárd Páll¹, Szilárd Páll², Berk Hess¹, Berk Hess²•Institutions (2)

Science for Life Laboratory¹, Royal Institute of Technology²

01 Dec 2013-Computer Physics Communications

TL;DR: This work presents an algorithm for SIMD parallelization based on grouping a fixed number of particles, e.g. 2, 4, or 8, into spatial clusters, which improves data reuse compared to the traditional scheme and results in a more efficient SIMD Parallelization.

...read moreread less

683 citations

Proceedings Article•10.1145/3225058.3225069•

ImageNet Training in Minutes

[...]

Yang You¹, Zhao Zhang, Cho-Jui Hsieh², James Demmel¹, Kurt Keutzer¹ - Show less +1 more•Institutions (2)

University of California, Berkeley¹, University of California, Davis²

13 Aug 2018

TL;DR: This paper empirically evaluates the effectiveness on two neural networks: AlexNet and ResNet-50 trained with the ImageNet-1k dataset while preserving the state-of-the-art test accuracy, and uses large batch size, powered by the Layer-wise Adaptive Rate Scaling (LARS) algorithm, for efficient usage of massive computing resources.

...read moreread less

Abstract: In this paper, we investigate large scale computers' capability of speeding up deep neural networks (DNN) training. Our approach is to use large batch size, powered by the Layer-wise Adaptive Rate Scaling (LARS) algorithm, for efficient usage of massive computing resources. Our approach is generic, as we empirically evaluate the effectiveness on two neural networks: AlexNet and ResNet-50 trained with the ImageNet-1k dataset while preserving the state-of-the-art test accuracy. Compared to the baseline of a previous study from a group of researchers at Facebook, our approach shows higher test accuracy on batch sizes that are larger than 16K. Using 2,048 Intel Xeon Platinum 8160 processors, we reduce the 100-epoch AlexNet training time from hours to 11 minutes. With 2,048 Intel Xeon Phi 7250 Processors, we reduce the 90-epoch ResNet-50 training time from hours to 20 minutes. Our implementation is open source and has been released in the Intel distribution of Caffe v1.0.7.

...read moreread less

541 citations

Journal Article•10.1007/S11432-016-5588-7•

The Sunway TaihuLight supercomputer: system and applications

[...]

Haohuan Fu¹, Junfeng Liao¹, Jinzhe Yang¹, Lanning Wang², Zhenya Song³, Xiaomeng Huang¹, Chao Yang⁴, Wei Xue¹, Fangfang Liu⁴, Fangli Qiao³, Wei Zhao³, Xunqiang Yin³, Chaofeng Hou⁴, Chenglong Zhang⁴, Wei Ge⁴, Jian Zhang⁴, Yangang Wang⁴, Chunbo Zhou⁴, Guangwen Yang¹ - Show less +15 more•Institutions (4)

Tsinghua University¹, Beijing Normal University², State Oceanic Administration³, Chinese Academy of Sciences⁴

21 Jun 2016-Science in China Series F: Information Sciences

TL;DR: Preliminary efforts on developing and optimizing applications on the TaihuLight system are reported, focusing on key application domains, such as earth system modeling, ocean surface wave modeling, atomistic simulation, and phase-field simulation.

...read moreread less

Abstract: The Sunway TaihuLight supercomputer is the worlds first system with a peak performance greater than 100 PFlops. In this paper, we provide a detailed introduction to the TaihuLight system. In contrast with other existing heterogeneous supercomputers, which include both CPU processors and PCIe-connected many-core accelerators (NVIDIA GPU or Intel Xeon Phi), the computing power of TaihuLight is provided by a homegrown many-core SW26010 CPU that includes both the management processing elements (MPEs) and computing processing elements (CPEs) in one chip. With 260 processing elements in one CPU, a single SW26010 provides a peak performance of over three TFlops. To alleviate the memory bandwidth bottleneck in most applications, each CPE comes with a scratch pad memory, which serves as a user-controlled cache. To support the parallelization of programs on the new many-core architecture, in addition to the basic C/C++ and Fortran compilers, the system provides a customized Sunway OpenACC tool that supports the OpenACC 2.0 syntax. This paper also reports our preliminary efforts on developing and optimizing applications on the TaihuLight system, focusing on key application domains, such as earth system modeling, ocean surface wave modeling, atomistic simulation, and phase-field simulation.

...read moreread less

523 citations

...

Expand

Year	Papers
2025	2
2024	7
2023	23
2022	82
2021	35
2020	59

Topic Tools

Papers published on a yearly basis

Papers

Face classification using electronic synapses

Scaling read aligners to hundreds of threads on general-purpose processors.

A flexible algorithm for calculating pair interactions on SIMD architectures

ImageNet Training in Minutes

The Sunway TaihuLight supercomputer: system and applications

Related Topics (5)

Performance Metrics