InfiniBand

Topic Tools

Papers published on a yearly basis

Papers

Journal Article•10.1016/J.CPC.2011.10.012•

Implementing Molecular Dynamics on Hybrid High Performance Computers - Particle-Particle Particle-Mesh

[...]

W. Michael Brown¹, Axel Kohlmeyer², Steven J. Plimpton³, Arnold N. Tharrington¹•Institutions (3)

National Center for Computational Sciences¹, Temple University², Sandia National Laboratories³

01 Mar 2012-Computer Physics Communications

TL;DR: This paper presents an efficient implementation of the particle–particle particle-mesh method based on the work by Harvey and De Fabritiis, and provides a performance comparison of the same kernels compiled with both CUDA and OpenCL.

...read moreread less

499 citations

Proceedings Article•10.1109/CVPR.2016.284•

FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters

[...]

Forrest Iandola¹, Matthew W. Moskewicz¹, Khalid Ashraf¹, Kurt Keutzer¹•Institutions (1)

University of California, Berkeley¹

1 Jun 2016

TL;DR: FireCaffe is presented, which successfully scales deep neural network training across a cluster of GPUs, and finds that reduction trees are more efficient and scalable than the traditional parameter server approach.

...read moreread less

Abstract: Long training times for high-accuracy deep neural networks (DNNs) impede research into new DNN architectures and slow the development of high-accuracy DNNs. In this paper we present FireCaffe, which successfully scales deep neural network training across a cluster of GPUs. We also present a number of best practices to aid in comparing advancements in methods for scaling and accelerating the training of deep neural networks. The speed and scalability of distributed algorithms is almost always limited by the overhead of communicating between servers, DNN training is not an exception to this rule. Therefore, the key consideration here is to reduce communication overhead wherever possible, while not degrading the accuracy of the DNN models that we train. Our approach has three key pillars. First, we select network hardware that achieves high bandwidth between GPU servers – Infiniband or Cray interconnects are ideal for this. Second, we consider a number of communication algorithms, and we find that reduction trees are more efficient and scalable than the traditional parameter server approach. Third, we optionally increase the batch size to reduce the total quantity of communication during DNN training, and we identify hyperparameters that allow us to reproduce the small-batch accuracy while training with large batch sizes. When training GoogLeNet and Network-in-Network on ImageNet, we achieve a 47x and 39x speedup, respectively, when training on a cluster of 128 GPUs.

...read moreread less

370 citations

Proceedings Article•

High performance VMM-bypass I/O in virtual machines

[...]

Jiuxing Liu¹, Wei Huang², Bulent Abali¹, Dhabaleswar K. Panda²•Institutions (2)

IBM¹, Ohio State University²

30 May 2006

TL;DR: VMM-bypass allows time-critical I/O operations to be carried out directly in guest VMs without involvement of the VMM and/or a privileged VM by exploiting the intelligence found in modern high speed network interfaces.

...read moreread less

Abstract: Currently, I/O device virtualization models in virtual machine (VM) environments require involvement of a virtual machine monitor (VMM) and/or a privileged VM for each I/O operation, which may turn out to be a performance bottleneck for systems with high I/O demands, especially those equipped with modern high speed interconnects such as InfiniBand. In this paper, we propose a new device virtualization model called VMM-bypass I/O, which extends the idea of OS-bypass originated from user-level communication. Essentially, VMM-bypass allows time-critical I/O operations to be carried out directly in guest VMs without involvement of the VMM and/or a privileged VM. By exploiting the intelligence found in modern high speed network interfaces, VMM-bypass can significantly improve I/O and communication performance for VMs without sacrificing safety or isolation. To demonstrate the idea of VMM-bypass, we have developed a prototype called Xen-IB, which offers InfiniBand virtualization support in the Xen 3.0 VM environment. Xen-IB runs with current InfiniBand hardware and does not require modifications to existing user-level applications or kernel-level drivers that use InfiniBand. Our performance measurements show that Xen-IB is able to achieve nearly the same raw performance as the original InfiniBand driver running in a non-virtualized environment.

...read moreread less

339 citations

Journal Article•10.17815/JLSRF-5-171•

JUWELS: Modular Tier-0/1 Supercomputer at the Jülich Supercomputing Centre

[...]

Dorian Krause¹•Institutions (1)

Forschungszentrum Jülich¹

06 Feb 2019-Journal of large-scale research facilities JLSRF

TL;DR: JUWELS is a multi-petaflop modular supercomputer operated by Jülich Supercomputing Centre at Forschungszentrum Jü Lichtenstein as a European and national supercomputing resource for the Gauss Centre for Supercom computing.

...read moreread less

Abstract: JUWELS is a multi-petaflop modular supercomputer operated by Julich Supercomputing Centre at Forschungszentrum Julich as a European and national supercomputing resource for the Gauss Centre for Supercomputing. In addition, JUWELS serves the Earth system modeling community within the Helmholtz Association. The first module deployed in 2018, is a Cluster module based on the BullSequana X1000 architecture with Intel Xeon Skylake-SP processors and Mellanox EDR InfiniBand. An extension by a second Booster module is scheduled for deployment in 2020.

...read moreread less

283 citations

Posted Content•

FireCaffe: near-linear acceleration of deep neural network training on compute clusters

[...]

Forrest Iandola, Khalid Ashraf, Mattthew W. Moskewicz, Kurt Keutzer

31 Oct 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors present FireCaffe, which scales deep neural network training across a cluster of GPUs by selecting network hardware that achieves high bandwidth between GPU servers and using reduction trees to reduce communication overhead.

...read moreread less

Abstract: Long training times for high-accuracy deep neural networks (DNNs) impede research into new DNN architectures and slow the development of high-accuracy DNNs. In this paper we present FireCaffe, which successfully scales deep neural network training across a cluster of GPUs. We also present a number of best practices to aid in comparing advancements in methods for scaling and accelerating the training of deep neural networks. The speed and scalability of distributed algorithms is almost always limited by the overhead of communicating between servers; DNN training is not an exception to this rule. Therefore, the key consideration here is to reduce communication overhead wherever possible, while not degrading the accuracy of the DNN models that we train. Our approach has three key pillars. First, we select network hardware that achieves high bandwidth between GPU servers -- Infiniband or Cray interconnects are ideal for this. Second, we consider a number of communication algorithms, and we find that reduction trees are more efficient and scalable than the traditional parameter server approach. Third, we optionally increase the batch size to reduce the total quantity of communication during DNN training, and we identify hyperparameters that allow us to reproduce the small-batch accuracy while training with large batch sizes. When training GoogLeNet and Network-in-Network on ImageNet, we achieve a 47x and 39x speedup, respectively, when training on a cluster of 128 GPUs.

...read moreread less

214 citations

...

Expand

Year	Papers
2025	5
2024	4
2023	29
2022	37
2021	25
2020	63

Topic Tools

Papers published on a yearly basis

Papers

Implementing Molecular Dynamics on Hybrid High Performance Computers - Particle-Particle Particle-Mesh

FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters

High performance VMM-bypass I/O in virtual machines

JUWELS: Modular Tier-0/1 Supercomputer at the Jülich Supercomputing Centre

FireCaffe: near-linear acceleration of deep neural network training on compute clusters

Related Topics (5)

Performance Metrics