Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Distributed memory
  4. 2011
  1. Home
  2. Topics
  3. Distributed memory
  4. 2011
Showing papers on "Distributed memory published in 2011"
Journal Article•10.1016/J.CPC.2010.12.021•
Implementing molecular dynamics on hybrid high performance computers - short range forces

[...]

W. Michael Brown1, Peng Wang2, Steven J. Plimpton3, Arnold N. Tharrington1•
National Center for Computational Sciences1, Nvidia2, Sandia National Laboratories3
01 Apr 2011-Computer Physics Communications
TL;DR: Algorithm for efficient short range force calculation on hybrid high-performance machines, an approach for dynamic load balancing of work between CPU and accelerator cores, and the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators are described.

686 citations

Journal Article•10.1177/1094342011403516•
The Combinatorial BLAS: design, implementation, and applications

[...]

Aydin Buluc1, John R. Gilbert2•
Lawrence Berkeley National Laboratory1, University of California, Berkeley2
1 Nov 2011
TL;DR: The parallel Combinatorial BLAS is described, which consists of a small but powerful set of linear algebra primitives specifically targeting graph and data mining applications, and an extensible library interface and some guiding principles for future development are provided.
Abstract: This paper presents a scalable high-performance software library to be used for graph analysis and data mining. Large combinatorial graphs appear in many applications of high-performance computing, including computational biology, informatics, analytics, web search, dynamical systems, and sparse matrix methods. Graph computations are difficult to parallelize using traditional approaches due to their irregular nature and low operational intensity. Many graph computations, however, contain sufficient coarse-grained parallelism for thousands of processors, which can be uncovered by using the right primitives. We describe the parallel Combinatorial BLAS, which consists of a small but powerful set of linear algebra primitives specifically targeting graph and data mining applications. We provide an extensible library interface and some guiding principles for future development. The library is evaluated using two important graph algorithms, in terms of both performance and ease-of-use. The scalability and raw performance of the example applications, using the Combinatorial BLAS, are unprecedented on distributed memory clusters.

481 citations

Proceedings Article•10.1145/2155620.2155664•
Reducing memory interference in multicore systems via application-aware memory channel partitioning

[...]

Sai Prashanth Muralidhara1, Lavanya Subramanian2, Onur Mutlu2, Mahmut Kandemir1, Thomas Moscibroda3 •
Pennsylvania State University1, Carnegie Mellon University2, Microsoft3
3 Dec 2011
TL;DR: In this paper, the authors present an alternative approach to reduce inter-application interference in the memory system: application-aware memory channel partitioning (MCP), which maps the data of applications that are likely to severely interfere with each other to different memory channels.
Abstract: Main memory is a major shared resource among cores in a multicore system. If the interference between different applications' memory requests is not controlled effectively, system performance can degrade significantly. Previous work aimed to mitigate the problem of interference between applications by changing the scheduling policy in the memory controller, i.e., by prioritizing memory requests from applications in a way that benefits system performance.In this paper, we first present an alternative approach to reducing inter-application interference in the memory system: application-aware memory channel partitioning (MCP). The idea is to map the data of applications that are likely to severely interfere with each other to different memory channels. The key principles are to partition onto separate channels 1) the data of light (memory non-intensive) and heavy (memory-intensive) applications, 2) the data of applications with low and high row-buffer locality.Second, we observe that interference can be further reduced with a combination of memory channel partitioning and scheduling, which we call integrated memory partitioning and scheduling (IMPS). The key idea is to 1) always prioritize very light applications in the memory scheduler since such applications cause negligible interference to others, 2) use MCP to reduce interference among the remaining applications.We evaluate MCP and IMPS on a variety of multi-programmed workloads and system configurations and compare them to four previously proposed state-of-the-art memory scheduling policies. Averaged over 240 workloads on a 24-core system with 4 memory channels, MCP improves system throughput by 7.1% over an application-unaware memory scheduler and 1% over the previous best scheduler, while avoiding modifications to existing memory schedulers. IMPS improves system throughput by 11.1% over an application-unaware scheduler and 5% over the previous best scheduler, while incurring much lower hardware complexity than the latter.

295 citations

Journal Article•10.1016/J.PARCO.2011.05.004•
A Hybrid MPI-OpenMP Scheme for Scalable Parallel Pseudospectral Computations for Fluid Turbulence

[...]

Pablo D. Mininni1, Pablo D. Mininni2, Duane Rosenberg2, Raghu Reddy3, Annick Pouquet2 •
Facultad de Ciencias Exactas y Naturales1, National Center for Atmospheric Research2, Pittsburgh Supercomputing Center3
1 Jun 2011
TL;DR: It is shown that the hybrid scheme achieves good scalability up to ∼20,000 compute cores with a maximum efficiency of 89%, and a mean of 79%.
Abstract: A hybrid scheme that utilizes MPI for distributed memory parallelism and OpenMP for shared memory parallelism is presented. The work is motivated by the desire to achieve exceptionally high Reynolds numbers in pseudospectral computations of fluid turbulence on emerging petascale, high core-count, massively parallel processing systems. The hybrid implementation derives from and augments a well-tested scalable MPI-parallelized pseudospectral code. The hybrid paradigm leads to a new picture for the domain decomposition of the pseudospectral grids, which is helpful in understanding, among other things, the 3D transpose of the global data that is necessary for the parallel fast Fourier transforms that are the central component of the numerical discretizations. Details of the hybrid implementation are provided, and performance tests illustrate the utility of the method. It is shown that the hybrid scheme achieves good scalability up to ∼20,000 compute cores with a maximum efficiency of 89%, and a mean of 79%. Data are presented that help guide the choice of the optimal number of MPI tasks and OpenMP threads in order to maximize code performance on two different platforms.

253 citations

Proceedings Article•10.1145/1963405.1963461•
Parallel boosted regression trees for web search ranking

[...]

Stephen Tyree1, Kilian Q. Weinberger1, Kunal Agrawal1, Jennifer Paykin2•
Washington University in St. Louis1, Wesleyan University2
28 Mar 2011
TL;DR: A novel method for parallelizing the training of GBRT, which parallelizes the construction of the individual regression trees and operates using the master-worker paradigm, and demonstrates that the loss in accuracy induced due to the histogram approximation in the regression tree creation can be compensated for through slightly deeper trees.
Abstract: Gradient Boosted Regression Trees (GBRT) are the current state-of-the-art learning paradigm for machine learned web-search ranking - a domain notorious for very large data sets. In this paper, we propose a novel method for parallelizing the training of GBRT. Our technique parallelizes the construction of the individual regression trees and operates using the master-worker paradigm as follows. The data are partitioned among the workers. At each iteration, the worker summarizes its data-partition using histograms. The master processor uses these to build one layer of a regression tree, and then sends this layer to the workers, allowing the workers to build histograms for the next layer. Our algorithm carefully orchestrates overlap between communication and computation to achieve good performance.Since this approach is based on data partitioning, and requires a small amount of communication, it generalizes to distributed and shared memory machines, as well as clouds. We present experimental results on both shared memory machines and clusters for two large scale web search ranking data sets. We demonstrate that the loss in accuracy induced due to the histogram approximation in the regression tree creation can be compensated for through slightly deeper trees. As a result, we see no significant loss in accuracy on the Yahoo data sets and a very small reduction in accuracy for the Microsoft LETOR data. In addition, on shared memory machines, we obtain almost perfect linear speed-up with up to about 48 cores on the large data sets. On distributed memory machines, we get a speedup of 25 with 32 processors. Due to data partitioning our approach can scale to even larger data sets, on which one can reasonably expect even higher speedups.

204 citations

Journal Article•10.1016/J.PARCO.2011.02.002•
High performance computing using MPI and OpenMP on multi-core parallel systems

[...]

Haoqiang Jin1, Dennis C. Jespersen1, Piyush Mehrotra1, Rupak Biswas1, Lei Huang2, Barbara Chapman2 •
Ames Research Center1, University of Houston2
1 Sep 2011
TL;DR: This paper presents the performance of standard benchmarks from the multi-zone NAS Parallel Benchmarks and two full applications using this approach on several multi-core based systems and presents new data locality extensions to OpenMP to better match the hierarchical memory structure of multi- core architectures.
Abstract: The rapidly increasing number of cores in modern microprocessors is pushing the current high performance computing (HPC) systems into the petascale and exascale era. The hybrid nature of these systems - distributed memory across nodes and shared memory with non-uniform memory access within each node - poses a challenge to application developers. In this paper, we study a hybrid approach to programming such systems - a combination of two traditional programming models, MPI and OpenMP. We present the performance of standard benchmarks from the multi-zone NAS Parallel Benchmarks and two full applications using this approach on several multi-core based systems including an SGI Altix 4700, an IBM p575+ and an SGI Altix ICE 8200EX. We also present new data locality extensions to OpenMP to better match the hierarchical memory structure of multi-core architectures.

195 citations

Proceedings Article•10.1145/2155620.2155663•
Parallel application memory scheduling

[...]

Eiman Ebrahimi1, Rustam Miftakhutdinov1, Chris Fallin2, Chang Joo Lee3, José A. Joao1, Onur Mutlu2, Yale N. Patt1 •
University of Texas at Austin1, Carnegie Mellon University2, Intel3
3 Dec 2011
TL;DR: This paper proposes a memory scheduling algorithm designed specifically for parallel applications, targeting two common synchronization primitives that cause inter-dependence of threads: locks and barriers, and shows that it speeds up a set of memory-intensive parallel applications by 12.6% compared to the best previous memory scheduling technique.
Abstract: A primary use of chip-multiprocessor (CMP) systems is to speed up a single application by exploiting thread-level parallelism. In such systems, threads may slow each other down by issuing memory requests that interfere in the shared memory subsystem. This inter-thread memory system interference can significantly degrade parallel application performance. Better memory request scheduling may mitigate such performance degradation. However, previously proposed memory scheduling algorithms for CMPs are designed for multi-programmed workloads where each core runs an independent application, and thus do not take into account the inter-dependent nature of threads in a parallel application. In this paper, we propose a memory scheduling algorithm designed specifically for parallel applications. Our approach has two main components, targeting two common synchronization primitives that cause inter-dependence of threads: locks and barriers. First, the runtime system estimates threads holding the locks that cause the most serialization as the set of limiter threads, which are prioritized by the memory scheduler. Second, the memory scheduler shuffles thread priorities to reduce the time threads take to reach the barrier.We show that our memory scheduler speeds up a set of memory-intensive parallel applications by 12.6% compared to the best previous memory scheduling technique.

156 citations

Patent•
Inter-processor communication

[...]

James C. Wilshire1•
Western Digital1
23 Jun 2011
TL;DR: In this paper, a multi-processor system consisting of a first processor, a first memory coupled to the first processor and a second processor, and a shared memory subsystem is described.
Abstract: A multi-processor system is disclosed comprising a first processor, a first memory coupled to the first processor, a second processor, and a shared memory subsystem including a shared memory and a data transfer unit. The first processor is configured to build a data structure in the first memory and to send a direct memory access (DMA) transfer request to the data transfer unit of the shared memory subsystem, the DMA transfer request including an address of the data structure in the first memory. The data transfer unit is configured to retrieve the data structure from the first memory based on the DMA transfer request, to store the data structure in the shared memory, and to send a shared memory pointer to the second processor indicating an address of the data structure in the shared memory.

147 citations

Book•
Efficient synchronization on multiprocessors with shared memory

[...]

Clyde P. Kruskal1, Larry Rudolph2, Marc Snir2•
University of Maryland, College Park1, Hebrew University of Jerusalem2
25 Aug 2011
TL;DR: A new formalism is given for read-modify-write (RMW) synchronization operations, used to extend the memory reference combining mechanism introduced in the NYU Ultracomputer, to arbitrary RMW operations.
Abstract: A new formalism is given for read-modify-write (RMW) synchronization operations. This formalism is used to extend the memory reference combining mechanism introduced in the NYU Ultracomputer, to arbitrary RMW operations. A formal correctness proof of this combining mechanism is given. General requirements for the practicality of combining are discussed. Combining is shown to be practical for many useful memory access operations. This includes memory updates of the form mem_val := mem_val op val, where op need not be associative, and a variety of synchronization primitives. The computation involved is shown to be closely related to parallel prefix evaluation.

117 citations

Patent•
Cluster of processing nodes with distributed global flash memory using commodity server technology

[...]

Thomas A. Riddle, Darpan Dinker, Andrew D. Eckhardt1, Michael Koster1•
SanDisk1
11 Apr 2011
TL;DR: In this article, the authors propose a distributed storage system that comprises a plurality of nodes, where each node, of the plurality, executes one or more application processes which are capable of accessing persistent shared memory.
Abstract: Approaches for a distributed storage system that comprises a plurality of nodes. Each node, of the plurality of nodes, executes one or more application processes which are capable of accessing persistent shared memory. The persistent shared memory is implemented by solid state devices physically maintained on each of the plurality of nodes. Each the one or more application processes, maintained on a particular node, of the plurality of nodes, communicates with a shared data fabric (SDF) to access the persistent shared memory. The persistent shared memory comprises a scoreboard implemented in shared DRAM memory that is mapped to a persistent storage. The scoreboard provides a crash tolerant mechanism for enabling application processes to communicate with the shared data fabric (SDF).

92 citations

Proceedings Article•10.5555/2016802.2016889•
The K computer: Japanese next-generation supercomputer development project

[...]

Mitsuo Yokokawa, Fumiyoshi Shoji, Atsuya Uno, Motoyoshi Kurokawa, Tadashi Watanabe 
1 Aug 2011
TL;DR: The K computer is a distributed memory supercomputer system consisting of more than 80,000 compute nodes which is being developed by RIKEN as a Japanese national project and its performance is aimed at achieving 10 peta-flops sustained in the LINPACK benchmark.
Abstract: The K computer is a distributed memory supercomputer system consisting of more than 80,000 compute nodes which is being developed by RIKEN as a Japanese national project. Its performance is aimed at achieving 10 peta-flops sustained in the LINPACK benchmark. The system is under installation and adjustment. The whole system will be operational in 2012.
Proceedings Article•10.1109/CLOUDCOM.2011.27•
Applications Know Best: Performance-Driven Memory Overcommit with Ginkgo

[...]

Michael R. Hines1, Abel Gordon1, Marcio Silva1, Dilma Da Silva1, Kyung Dong Ryu1, Muli Ben-Yehuda1 •
IBM1
29 Nov 2011
TL;DR: Ginkgo, a policy framework for over omitting memory in an informed and automated fashion, automates the redistribution of scarce memory across all virtual machines, satisfying performance and capacity constraints.
Abstract: Memory over commitment enables cloud providers to host more virtual machines on a single physical server, exploiting spare CPU and I/O capacity when physical memory becomes the bottleneck for virtual machine deployment However, over commiting memory can also cause noticeable application performance degradation We present Ginkgo, a policy framework for over omitting memory in an informed and automated fashion By directly correlating application-level performance to memory, Ginkgo automates the redistribution of scarce memory across all virtual machines, satisfying performance and capacity constraints Ginkgo also achieves memory gains for traditionally fixed-size Java applications by coordinating the redistribution of available memory with the activities of the Java Virtual Machine heap When compared to a non-over commited system, Ginkgo runs the Day Trader 20 and SPEC Web 2009 benchmarks with the same number of virtual machines while saving up to 73% (50% omitting free space) of a physical server's memory while keeping application performance degradation within 7%
Patent•
Disabling communication in a multiprocessor system

[...]

Michael B. Doerr, Carl S. Dobbs, Michael B. Solka, Michael R. Trocino, David A. Gibson 
14 Oct 2011
TL;DR: In this article, a configuration for disabling communication in a multiprocessor fabric is presented, which specifies disabling of communication paths between one or more of: one processor and one communication element; one processor, and one other communication element.
Abstract: Disabling communication in a multiprocessor fabric. The multiprocessor fabric may include a plurality of processors and a plurality of communication elements and each of the plurality of communication elements may include a memory. A configuration may be received for the multiprocessor fabric, which specifies disabling of communication paths between one or more of: one or more processors and one or more communication elements; one or more processors and one or more other processors; or one or more communication elements and one or more other communication elements. Accordingly, the multiprocessor fabric may be automatically configured in hardware to disable the communication paths specified by the configuration. The multiprocessor fabric may be operated to execute a software application according to the configuration.
Migration of a generic multi-physics framework to HPC environments

[...]

Pooyan Dadvand, Riccardo Rossi, Marisa Gil, Xavier Martorell Bofill, Jordi Cotela Dalmau, E. Juanpere, Sergio Rodolfo Idelsohn Barg, Eugenio Oñate Ibáñez de Navarra 
1 Jan 2011
TL;DR: This presentation explains how a relatively complex framework but with modular structure can be prepared for high performance computing with minimum modification and presents the Telescope problem.
Abstract: Creating a highly parallelizable code is a challenge and development for distributed memory machines (DMMs) can be very different form developing a serial code in term of algorithms and structure. For this reason, many developers in the field prefer to develop their own code from scratch. However, for an already existing framework with large development background the idea of transformation becomes attractive in order to reuse the effort done during years of development. In this presentation we explain how a relatively complex framework but with modular structure can be prepared for high performance computing with minimum modification. Kratos Multi-Physics [1] is an open source generic multi-disciplinary platform for solution of coupled problems consist of fluid, structure, thermal and electromagnetic fields. The parallelization of this framework is performed with objective of enforcing the less possible changes to its different solver modules and encapsulate the changes as much as possible in its common kernel. This objective is achieved thanks to the Kratos design and also innovative way of dealing with data transfers for a multi-disciplinary code. This work is completed by the migration of the framework from the x86 architecture to the Marenostrum Supercomputing platform. The migration has been verified by a set of benchmarks which show very good scalability, from which we present the Telescope problem in this paper.
Journal Article•10.1109/TVLSI.2010.2077314•
A Generalized Conflict-Free Memory Addressing Scheme for Continuous-Flow Parallel-Processing FFT Processors With Rescheduling

[...]

Pei-Yun Tsai1, Chung-Yi Lin1•
National Central University1
01 Dec 2011-IEEE Transactions on Very Large Scale Integration Systems
TL;DR: A generalized conflict-free memory addressing scheme for memory-based fast Fourier transform (FFT) processors with parallel arithmetic processing units made up of radix-2q multi-path delay commutator (MDC) is presented.
Abstract: This paper presents a generalized conflict-free memory addressing scheme for memory-based fast Fourier transform (FFT) processors with parallel arithmetic processing units made up of radix-2q multi-path delay commutator (MDC). The proposed addressing scheme considers the continuous-flow operation with minimum shared memory requirements. To improve throughput, parallel high-radix processing units are employed. We prove that the solution to non-conflict memory access satisfying the constraints of the continuous-flow, variable-size, higher-radix, and parallel-processing operations indeed exists. In addition, a rescheduling technique for twiddle-factor multiplication is developed to reduce hardware complexity and to enhance hardware efficiency. From the results, we can see that the proposed processor has high utilization and efficiency to support flexible configurability for various FFT sizes with fewer computation cycles than the conventional radix-2/radix-4 memory-based FFT processors.
Journal Article•10.1145/1880050.1880058•
Real-time performance analysis of multiprocessor systems with shared memory

[...]

Simon Schliecker1, Rolf Ernst1•
Braunschweig University of Technology1
07 Jan 2011-ACM Transactions in Embedded Computing Systems
TL;DR: This article presents a general methodology to model the shared resource traffic and consider its effect on the local task execution, and proposes an iterative approach to tackle the analysis dependencies that exist in systems with event-driven task activation and dynamic resource arbitration.
Abstract: Predicting timing behavior is key to reliable real-time system design and verification, but becomes increasingly difficult for current multiprocessor systems on chip. The integration of formerly separate functionality into a single multicore system introduces new intercore timing dependencies resulting from the common use of the now shared resources. This feedback of system timing on local timing makes traditional performance analysis approaches inappropriate.This article presents a general methodology to model the shared resource traffic and consider its effect on the local task execution. The aggregate busy time captures the timing of multiple accesses to a shared memory far better than the traditional models that focus on the timing of individual events. An iterative approach is proposed to tackle the analysis dependencies that exist in systems with event-driven task activation and dynamic resource arbitration.
Journal Article•
Large Data Visualization on Distributed Memory Mulit-GPU Clusters

[...]

Hank Childs
17 Jun 2011-Lawrence Berkeley National Laboratory
TL;DR: This work studies a common visualization technique in a GPU-accelerated, distributed memory setting, and presents performance characteristics when scaling to extremely large data sets.
Abstract: Data sets of immense size are regularly generated on large scale computing resources. Even among more traditional methods for acquisition of volume data, such as MRI and CT scanners, data which is too large to be effectively visualization on standard workstations is now commonplace. One solution to this problem is to employ a 'visualization cluster,' a small to medium scale cluster dedicated to performing visualization and analysis of massive data sets generated on larger scale supercomputers. These clusters are designed to fit a different need than traditional supercomputers, and therefore their design mandates different hardware choices, such as increased memory, and more recently, graphics processing units (GPUs). While there has been much previous work on distributed memory visualization as well as GPU visualization, there is a relative dearth of algorithms which effectively use GPUs at a large scale in a distributed memory environment. In this work, we study a common visualization technique in a GPU-accelerated, distributed memory setting, and present performance characteristics when scaling to extremely large data sets.
Proceedings Article•10.1109/IPDPS.2011.80•
Using Shared Memory to Accelerate MapReduce on Graphics Processing Units

[...]

Feng Ji1, Xiaosong Ma1•
North Carolina State University1
16 May 2011
TL;DR: This work designed and implemented a GPU MapReduce framework, whose key techniques include shared memory staging area management, thread-role partitioning, and intra-block thread synchronization, and proposes a novel GPU data staging scheme for Map Reduce workloads, tailored toward the GPU memory hierarchy.
Abstract: Modern General Purpose Graphics Processing Units (GPGPUs) provide high degrees of parallelism in computation and memory access, making them suitable for data parallel applications such as those using the elastic MapReduce model. Yet designing a MapReduce framework for GPUs faces significant challenges brought by their multi-level memory hierarchy. Due to the absence of atomic operations in the earlier generations of GPUs, existing GPU MapReduce frameworks have problems in handling input/output data with varied or unpredictable sizes. Also, existing frameworks utilize mostly a single level of memory, \emph{i.e.}, the relatively spacious yet slow global memory. In this work, we attempt to explore the potential benefit of enabling a GPU MapReduce framework to use multiple levels of the GPU memory hierarchy. We propose a novel GPU data staging scheme for MapReduce workloads, tailored toward the GPU memory hierarchy. Centering around the efficient utilization of the fast but very small shared memory, we designed and implemented a GPU MapReduce framework, whose key techniques include (1) shared memory staging area management, (2) thread-role partitioning, and (3) intra-block thread synchronization. We carried out evaluation with five popular MapReduce workloads and studied their performance under different GPU memory usage choices. Our results reveal that exploiting GPU shared memory is highly promising for the Map phase (with an average 2.85x speedup over using global memory only), while in the Reduce phase the benefit of using shared memory is much less pronounced, due to the high input-to-output ratio. In addition, when compared to Mars, an existing GPU MapReduce framework, our system is shown to bring a significant speedup in Map/Reduce phases.
Proceedings Article•10.1109/ICRA.2011.5979751•
Parallelizing RRT on distributed-memory architectures

[...]

Didier Devaurs, Thierry Siméon, Juan Cortés
9 May 2011
TL;DR: This paper addresses the problem of improving the performance of the Rapidly-exploring Random Tree (RRT) algorithm by parallelizing it by presenting three parallel versions of RRT along with the technicalities involved in their implementation.
Abstract: This paper addresses the problem of improving the performance of the Rapidly-exploring Random Tree (RRT) algorithm by parallelizing it. For scalability reasons we do so on a distributed-memory architecture, using the message-passing paradigm. We present three parallel versions of RRT along with the technicalities involved in their implementation. We also evaluate the algorithms and study how they behave on different motion planning problems.
Patent•
Cooperative memory resource management for virtualized computing devices

[...]

Harvey Tuch1, Craig Newell1, Cyprien Laplace1•
VMware1
23 Aug 2011
TL;DR: In this article, a cooperative memory management technique is employed to dynamically balance memory resources between host and guest systems running within a computing device, where memory that is allocated to the guest system is dynamically adjusted up and down according to a fairness policy that takes into account various factors.
Abstract: A computing device employs a cooperative memory management technique to dynamically balance memory resources between host and guest systems running therein. According to this cooperative memory management technique, memory that is allocated to the guest system is dynamically adjusted up and down according to a fairness policy that takes into account various factors including the relative amount of readily freeable memory resources in the host and guest systems and the relative amount of memory allocated to hidden applications in the host and guest systems.
Proceedings Article•10.1145/2063384.2063452•
Efficient data race detection for distributed memory parallel programs

[...]

Chang-Seo Park1, Koushik Sen1, Paul Hargrove2, Costin Iancu2•
University of California, Berkeley1, Lawrence Berkeley National Laboratory2
12 Nov 2011
TL;DR: The framework for the UPC programming language is implemented and scalability up to a thousand cores for programs with both fine-grained and bulk (MPI style) communication and single- and split-phase barriers is demonstrated.
Abstract: In this paper we present a precise data race detection technique for distributed memory parallel programs. Our technique, which we call Active Testing, builds on our previous work on race detection for shared memory Java and C programs and it handles programs written using shared memory approaches as well as bulk communication. Active testing works in two phases: in the first phase, it performs an imprecise dynamic analysis of an execution of the program and finds potential data races that could happen if the program is executed with a different thread schedule. In the second phase, active testing re-executes the program by actively controlling the thread schedule so that the data races reported in the first phase can be confirmed. A key highlight of our technique is that it can scalably handle distributed programs with bulk communication and single- and split-phase barriers. Another key feature of our technique is that it is precise — a data race confirmed by active testing is an actual data race present in the program; however, being a testing approach, our technique can miss actual data races. We implement the framework for the UPC programming language and demonstrate scalability up to a thousand cores for programs with both fine-grained and bulk (MPI style) communication. The tool confirms previously known bugs and uncovers several unknown ones. Our extensions capture constructs proposed in several modern programming languages for High Performance Computing, most notably non-blocking barriers and collectives.
Proceedings Article•10.1109/ICPP.2011.59•
Understanding Off-Chip Memory Contention of Parallel Programs in Multicore Systems

[...]

Bogdan Marius Tudor1, Yong Meng Teo1, Simon See2•
National University of Singapore1, Nvidia2
13 Sep 2011
TL;DR: An analytical model is proposed that relates the growth of memory contention to the number of active cores and to the problem size, for both uniform (UMA) and non-uniform memory access (NUMA), which differs from measurements on average by less than 14%.
Abstract: Memory contention is an important performance issue in current multicore architectures. In this paper, we focus on understanding how off-chip memory contention affects the performance of parallel applications. Using measurements conducted on state-of-the-art multicore systems, we observed that off-chip memory traffic is not always bursty, as it was previously reported in literature. Burstiness depends on the problem size. Small problem sizes lead to bursty memory traffic, and generate small off-chip contention. In contrast, when large program sizes cause memory contention, the memory traffic is non-bursty. Based on these observations, we propose an analytical model that relates the growth of memory contention to the number of active cores and to the problem size, for both uniform (UMA) and non-uniform memory access (NUMA) systems. Our model differs from measurements on average by less than 14\%. Contention for off-chip memory grows exponentially with the number of active cores, but adding additional memory controllers reduces the memory contention. For programs such as the penta diagonal solver SP from NPB benchmark, with a large matrix of $162^3$ elements (input size C), our analysis shows that memory contention increases the total number of processor cycles to execute the program by more than ten times on a machine with 24 cores.
Journal Article•10.1088/0067-0049/195/2/20•
Fish: a three-dimensional parallel magnetohydrodynamics code for astrophysical applications

[...]

Roger Käppeli1, S. C. Whitehouse1, Simon Scheidegger1, Ue-Li Pen2, Matthias Liebendörfer1 •
University of Basel1, University of Toronto2
04 Aug 2011-Astrophysical Journal Supplement Series
TL;DR: FISH as mentioned in this paper is a fast ideal magnetohydrodynamics code that scales to 10,000 processes for a Cartesian computational domain of ~10003 cells, which is achieved by the rigorous application of the operator splitting technique.
Abstract: FISH is a fast and simple ideal magnetohydrodynamics code that scales to ~10,000 processes for a Cartesian computational domain of ~10003 cells. The simplicity of FISH has been achieved by the rigorous application of the operator splitting technique, while second-order accuracy is maintained by the symmetric ordering of the operators. Between directional sweeps, the three-dimensional data are rotated in memory so that the sweep is always performed in a cache-efficient way along the direction of contiguous memory. Hence, the code only requires a one-dimensional description of the conservation equations to be solved. This approach also enables an elegant novel parallelization of the code that is based on persistent communications with MPI for cubic domain decomposition on machines with distributed memory. This scheme is then combined with an additional OpenMP parallelization of different sweeps that can take advantage of clusters of shared memory. We document the detailed implementation of a second-order total variation diminishing advection scheme based on flux reconstruction. The magnetic fields are evolved by a constrained transport scheme. We show that the subtraction of a simple estimate of the hydrostatic gradient from the total gradients can significantly reduce the dissipation of the advection scheme in simulations of gravitationally bound hydrostatic objects. Through its simplicity and efficiency, FISH is as well suited for hydrodynamics classes as for large-scale astrophysical simulations on high-performance computer clusters. In preparation for the release of a public version, we demonstrate the performance of FISH in a suite of astrophysically orientated test cases.
Proceedings Article•10.1109/FPL.2011.70•
Heracles: Fully Synthesizable Parameterized MIPS-Based Multicore System

[...]

Michel A. Kinsy1, Michael Pellauer1, Srinivas Devadas1•
Massachusetts Institute of Technology1
5 Sep 2011
TL;DR: This work shows different topology configurations of the Heracles system, and their synthesis results on the Xilinx Virtex-5 LX330T FPGA board, and provides a small MIPS cross-compiler tool chain to assist in developing software for Heracles.
Abstract: Heracles is an open-source complete multicore system written in Verilog. It is fully parameterized and can be reconfigured and synthesized into different topologies and sizes. Each processing node has a fully bypassed, 7-stage pipelined microprocessor running the MIPS-III ISA, a 4-stage input-buffer, virtual-channel router, and a local variable-size shared memory. Our design is highly modular with clear interfaces between the core, the memory hierarchy, and the on-chip network. In the baseline design, the microprocessor is attached to two caches, one instruction cache and one data cache, which are oblivious to the global memory organization. The memory system in Heracles can be configured as one single global shared memory (SM), or distributed shared memory (DSM), or any combination thereof. Each core is connected to the rest of the network of processors by a parameterized, realistic, wormhole router. We show different topology configurations of the system, and their synthesis results on the Xilinx Virtex-5 LX330T FPGA board. We also provide a small MIPS cross-compiler tool chain to assist in developing software for Heracles.
Proceedings Article•10.1109/IPDPS.2011.197•
Using Memory Access Traces to Map Threads and Data on Hierarchical Multi-core Platforms

[...]

Eduardo H.M. Cruz1, Marco A. Z. Alves1, Alexandre Carissimi1, Philippe O. A. Navaux1, Christiane Pousa Ribeiro2, Jean-François Méhaut2 •
University of Rio Grande1, University of Grenoble2
16 May 2011
TL;DR: This work aims to provide a method for static mapping for NUMA architectures which does not require any prior knowledge of the application and shows performance gains of up to 75% compared to the native scheduler and memory allocator of the operating system.
Abstract: In parallel programs, the tasks of a given application must cooperate in order to accomplish the required computation. However, the communication time between the tasks may be different depending on which core they are executing and how the memory hierarchy and interconnection are used. The problem is even more important in multi-core machines with NUMA characteristics, since the remote access imposes high overhead, making them more sensitive to thread and data mapping. In this context, process mapping is a technique that provides performance gains by improving the use of resources such as interconnections, main memory and cache memory. The problem of detecting the best mapping is considered NP-Hard. Furthermore, in shared memory environments, there is an additional difficulty of finding the communication pattern, which is implicit and occurs through memory accesses. This work aims to provide a method for static mapping for NUMA architectures which does not require any prior knowledge of the application. Different metrics were adopted and an heuristic method based on the Edmonds matching algorithm was used to obtain the mapping. In order to evaluate our proposal, we use the NAS Parallel Benchmarks (NPB) and two modern multi-core NUMA machines. Results show performance gains of up to 75% compared to the native scheduler and memory allocator of the operating system.
Proceedings Article•
Low cost working set size tracking

[...]

Weiming Zhao1, Xinxin Jin2, Zhenlin Wang1, Xiaolin Wang2, Yingwei Luo2, Xiaoming Li2 •
Michigan Technological University1, Peking University2
15 Jun 2011
TL;DR: An intermittent tracking scheme, which can temporarily turn off memory tracking when memory demands are predicted to be stable, and is orthogonal to existing optimizing techniques, such as AVL-tree based LRU structure and dynamic hot set sizing.
Abstract: Efficient memory resource management requires knowledge of the memory demands of applications or systems at runtime. A widely proposed approach is to construct an LRU-based miss ratio curve (MRC), which provides not only the current working set size (WSS) but also the relationship between performance and target memory allocation size. Unfortunately, the cost of LRUMRC monitoring is nontrivial. Although optimized with AVL-tree based LRU structure and dynamic hot set sizing, the overhead is still as high as 16% on average. Based on a key insight that for most programs the WSSs are stable most of the time, we design an intermittent tracking scheme, which can temporarily turn off memory tracking when memory demands are predicted to be stable. With the assistance of hardware performance counters, memory tracking can be turned on again if a significant change in memory demands is expected. Experimental results show that, by using this intermittent tracking design, memory tracking can be turned off for 82% of the execution time while the accuracy loss is no more than 4%. More importantly, this design is orthogonal to existing optimizing techniques, such as AVL-tree based LRU structure and dynamic hot set sizing. By combining the three approaches, the mean overhead is lowered to only 2%. We show that when applied to memory balancing for virtual machines, our scheme brings a speedup of 1.85.
Journal Article•10.1016/J.PARCO.2011.05.006•
Tuning collective communication for Partitioned Global Address Space programming models

[...]

Rajesh Nishtala1, Yili Zheng2, Paul Hargrove2, Katherine Yelick2•
University of California, Berkeley1, Lawrence Berkeley National Laboratory2
1 Sep 2011
TL;DR: An implementation framework for PGAS collectives is presented as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids and supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned.
Abstract: Partitioned Global Address Space (PGAS) languages offer programmers the convenience of a shared memory programming style combined with locality control necessary to run on large-scale distributed memory systems. Even within a PGAS language programmers often need to perform global communication operations such as broadcasts or reductions, which are best performed as collective operations in which a group of threads work together to perform the operation. In this paper we consider the problem of implementing collective communication within PGAS languages and explore some of the design trade-offs in both the interface and implementation. In particular, PGAS collectives have semantic issues that are different than in send-receive style message passing programs, and different implementation approaches that take advantage of the one-sided communication style in these languages. We present an implementation framework for PGAS collectives as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids. The framework supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned. Finally, we demonstrate the benefit of optimized GASNet collectives using application benchmarks written in UPC, and demonstrate that the GASNet collectives can deliver scalable performance on a variety of state-of-the-art parallel machines including a Cray XT4, an IBM BlueGene/P, and a Sun Constellation system with InfiniBand interconnect.
Proceedings Article•10.1109/PDCAT.2011.18•
An Experimental Study on Memory Allocators in Multicore and Multithreaded Applications

[...]

Tais B. Ferreira1, Rivalino Matias1, Autran Macedo1, Lúcio Borges de Araújo•
Federal University of Uberlandia1
20 Oct 2011
TL;DR: This paper presents an experimental comparative study of seven largely adopted memory allocators using real-world multithreaded applications and considers the applications' response time, memory consumption, and memory fragmentation, in order to compare the performance of the investigated memory allocator running on a multicore machine.
Abstract: Memory allocations are one of the most omnipresent operations in computer programs. The performance of memory allocation operations is a very important aspect to be considered in software design, however it is frequently neglected. This paper presents an experimental comparative study of seven largely adopted memory allocators. Unlike other related works, we assess the selected memory allocators using real-world multithreaded applications. We consider the applications' response time, memory consumption, and memory fragmentation, in order to compare the performance of the investigated memory allocators running on a multicore machine. All test results are evaluated with respect to their statistical significance throughout the ANOVA method.
Proceedings Article•10.1109/IPDPS.2011.360•
Distributed-Memory Parallel Algorithms for Matching and Coloring

[...]

Ümit V. Çatalyürek1, Florin Dobrian, Assefaw H. Gebremedhin2, Mahantesh Halappanavar3, Alex Pothen2 •
Ohio State University1, Purdue University2, Pacific Northwest National Laboratory3
16 May 2011
TL;DR: The design and implementation of new highly-scalable distributed-memory parallel algorithms for two prototypical graph problems, edge-weighted matching and distance-1 vertex coloring, and the results show that the algorithms hold strong potential for computing at petascale.
Abstract: We discuss the design and implementation of new highly-scalable distributed-memory parallel algorithms for two prototypical graph problems, edge-weighted matching and distance-1 vertex coloring. Graph algorithms in general have low concurrency, poor data locality, and high ratio of data access to computation costs, making it challenging to achieve scalability on massively parallel machines. We overcome this challenge by employing a variety of techniques, including speculation and iteration, optimized communication, and randomization. We present preliminary results on weak and strong scalability studies conducted on an IBM Blue Gene/P machine employing up to tens of thousands of processors. The results show that the algorithms hold strong potential for computing at petascale.
Proceedings Article•10.1145/1950365.1950372•
Hardware acceleration of transactional memory on commodity systems

[...]

Jared Casper1, Tayo Oguntebi1, Sungpack Hong1, Nathan G. Bronson1, Christos Kozyrakis1, Kunle Olukotun1 •
Stanford University1
5 Mar 2011
TL;DR: It is demonstrated that hardware can substantially accelerate the performance of an STM on unmodified commodity processors, and it is shown that, for all but short transactions, it is not necessary to modify the processor to obtain substantial improvement in TM performance.
Abstract: The adoption of transactional memory is hindered by the high overhead of software transactional memory and the intrusive design changes required by previously proposed TM hardware. We propose that hardware to accelerate software transactional memory (STM) can reside outside an unmodified commodity processor core, thereby substantially reducing implementation costs. This paper introduces Transactional Memory Acceleration using Commodity Cores (TMACC), a hardware-accelerated TM system that does not modify the processor, caches, or coherence protocol.We present a complete hardware implementation of TMACC using a rapid prototyping platform. Using this hardware, we implement two unique conflict detection schemes which are accelerated using Bloom filters on an FPGA. These schemes employ novel techniques for tolerating the latency of fine-grained asynchronous communication with an out-of-core accelerator. We then conduct experiments to explore the feasibility of accelerating TM without modifying existing system hardware. We show that, for all but short transactions, it is not necessary to modify the processor to obtain substantial improvement in TM performance. In these cases, TMACC outperforms an STM by an average of 69% in applications using moderate-length transactions, showing maximum speedup within 8% of an upper bound on TM acceleration. Overall, we demonstrate that hardware can substantially accelerate the performance of an STM on unmodified commodity processors.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve