Top 353 papers published in the topic of Distributed memory in 2012

Showing papers on "Distributed memory published in 2012"

Proceedings Article•

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

[...]

Matei Zaharia¹, Mosharaf Chowdhury¹, Tathagata Das¹, Ankur Dave¹, Justin Ma¹, Murphy McCauley¹, Michael J. Franklin¹, Scott Shenker¹, Ion Stoica¹ - Show less +5 more•Institutions (1)

University of California, Berkeley¹

25 Apr 2012

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.

...read moreread less

Abstract: We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.

...read moreread less

4,605 citations

Journal Article•10.1016/J.CPC.2011.10.012•

Implementing Molecular Dynamics on Hybrid High Performance Computers - Particle-Particle Particle-Mesh

[...]

W. Michael Brown¹, Axel Kohlmeyer², Steven J. Plimpton³, Arnold N. Tharrington¹•Institutions (3)

National Center for Computational Sciences¹, Temple University², Sandia National Laboratories³

01 Mar 2012-Computer Physics Communications

TL;DR: This paper presents an efficient implementation of the particle–particle particle-mesh method based on the work by Harvey and De Fabritiis, and provides a performance comparison of the same kernels compiled with both CUDA and OpenCL.

...read moreread less

499 citations

Journal Article•10.1109/TPDS.2011.308•

A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era

[...]

J. Diaz¹, Camelia Muñoz-Caro, Alfonso Niño•Institutions (1)

Indiana University¹

01 Aug 2012-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A survey of the different parallel programming models and tools available today with special consideration to their suitability for high-performance computing finds that hybrid parallel programming is the current way of harnessing the capabilities of computer clusters with multi-core nodes.

...read moreread less

Abstract: In this work, we present a survey of the different parallel programming models and tools available today with special consideration to their suitability for high-performance computing. Thus, we review the shared and distributed memory approaches, as well as the current heterogeneous parallel programming model. In addition, we analyze how the partitioned global address space (PGAS) and hybrid parallel programming models are used to combine the advantages of shared and distributed memory systems. The work is completed by considering languages with specific parallel support and the distributed programming paradigm. In all cases, we present characteristics, strengths, and weaknesses. The study shows that the availability of multi-core CPUs has given new impulse to the shared memory parallel programming approach. In addition, we find that hybrid parallel programming is the current way of harnessing the capabilities of computer clusters with multi-core nodes. On the other hand, heterogeneous programming is found to be an increasingly popular paradigm, as a consequence of the availability of multi-core CPUs+GPUs systems. The use of open industry standards like OpenMP, MPI, or OpenCL, as opposed to proprietary solutions, seems to be the way to uniformize and extend the use of parallel programming models.

...read moreread less

268 citations

Journal Article•10.1137/110848244•

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments

[...]

Aydin Buluc, John R. Gilbert

26 Jul 2012-SIAM Journal on Scientific Computing

TL;DR: It is demonstrated that the parallel SpGEMM methods, which use two-dimensional block data distributions with serial hypersparse kernels, are indeed highly flexible, scalable, and memory-efficient in the general case.

...read moreread less

Abstract: Generalized sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. Here we show that SpGEMM also yields efficient algorithms for general sparse-matrix indexing in distributed memory, provided that the underlying SpGEMM implementation is sufficiently flexible and scalable. We demonstrate that our parallel SpGEMM methods, which use two-dimensional block data distributions with serial hypersparse kernels, are indeed highly flexible, scalable, and memory-efficient in the general case. This algorithm is the first to yield increasing speedup on an unbounded number of processors; our experiments show scaling up to thousands of processors in a variety of test scenarios.

...read moreread less

265 citations

Proceedings Article•10.1109/INPAR.2012.6339594•

OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures

[...]

Gihan R. Mudalige¹, Michael B. Giles¹, Istvan Z. Reguly², Carlo Bertolli³, Paul H. J. Kelly³ - Show less +1 more•Institutions (3)

University of Oxford¹, Pázmány Péter Catholic University², Imperial College London³

13 May 2012

TL;DR: It is demonstrated that an application written once at a high-level using the OP2 API can be easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.

...read moreread less

Abstract: OP2 is an “active” library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into different parallel implementations for execution on different back-end hardware platforms. In this paper we present the design of the current OP2 library, and investigate its capabilities in achieving performance portability, near-optimal performance, and scaling on modern multi-core and many-core processor based systems. A key feature of this work is OP2's recent extension facilitating the development and execution of applications on a distributed memory cluster of GPUs. We discuss the main design issues in parallelizing unstructured mesh based applications on heterogeneous platforms. These include handling data dependencies in accessing indirectly referenced data, the impact of unstructured mesh data layouts (array of structs vs. struct of arrays) and design considerations in generating code for execution on a cluster of GPUs. A representative CFD application written using the OP2 framework is utilized to provide a contrasting benchmarking and performance analysis study on a range of multi-core/many-core systems. These include multi-core CPUs from Intel (Westmere and Sandy Bridge) and AMD (Magny-Cours), GPUs from NVIDIA (GTX560Ti, Tesla C2070), a distributed memory CPU cluster (Cray XE6) and a distributed memory GPU cluster (Tesla C2050 GPUs with InfiniBand). OP2's design choices are explored with quantitative insights into their contributions to performance. We demonstrate that an application written once at a high-level using the OP2 API can be easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.

...read moreread less

115 citations

Patent•

Apparatus and method for controlling distributed memory cluster

[...]

Ki-Young Jung

14 Feb 2012

TL;DR: A distributed computing system may include a computing node cluster, a distributed memory cluster, and a controlling node as mentioned in this paper, where the controlling node is configured to select memory blocks of the associated data for distribution on the distributed memory clusters based on a node selection rule and memory cluster structure information.

...read moreread less

Abstract: Provided are an apparatus and method for controlling a distributed memory cluster A distributed computing system may include a computing node cluster, a distributed memory cluster, and a controlling node The computing node cluster may include a plurality of computing nodes including first computing nodes that each generates associated data The distributed memory cluster may be configured to store the associated data of the first computing nodes The controlling node may be configured to select memory blocks of the associated data for distribution on the distributed memory cluster based on a node selection rule and memory cluster structure information, and to select second computing nodes from the computing node cluster based on a location selection rule and the memory cluster structure information

...read moreread less

94 citations

Proceedings Article•10.1145/2287076.2287104•

Highly scalable graph search for the Graph500 benchmark

[...]

Koji Ueno¹, Toyotaro Suzumura¹•Institutions (1)

Tokyo Institute of Technology¹

18 Jun 2012

TL;DR: An optimized method based on 2D partitioning and other methods such as communication compression and vertex sorting is devised to handle BFS (Breadth First Search) of a large graph with 236 and 240 edges in 10.58 seconds, which corresponds to 103.9 GE/s.

...read moreread less

Abstract: Graph500 is a new benchmark to rank supercomputers with a large-scale graph search problem. We found that the provided reference implementations are not scalable in a large distributed environment. We devised an optimized method based on 2D partitioning and other methods such as communication compression and vertex sorting. Our optimized implementation can handle BFS (Breadth First Search) of a large graph with 236 (68.7 billion vertices) and 240 (1.1 trillion) edges in 10.58 seconds while using 1366 nodes and 16,392 CPU cores. This performance corresponds to 103.9 GE/s. We also studied the performance characteristics of our optimized implementation and reference implementations on a large distributed memory supercomputer with a Fat-Tree-based Infiniband network.

...read moreread less

90 citations

Proceedings Article•10.1145/2254064.2254115•

Dynamic synthesis for relaxed memory models

[...]

Feng Liu¹, Nayden Nedev¹, Nedyalko Prisadnikov², Martin Vechev³, Eran Yahav⁴ - Show less +1 more•Institutions (4)

Princeton University¹, Sofia University², ETH Zurich³, Technion – Israel Institute of Technology⁴

11 Jun 2012

TL;DR: This paper presents the first scalable framework for handling real-world concurrent algorithms running on relaxed architectures, and performs the first in-depth study of the interaction between fences in real- world concurrent C programs, correctness criteria such as sequential consistency and linearizability, and memory models such as TSO and PSO.

...read moreread less

Abstract: Modern architectures implement relaxed memory models which may reorder memory operations or execute them non-atomically. Special instructions called memory fences are provided, allowing control of this behavior.To implement a concurrent algorithm for a modern architecture, the programmer is forced to manually reason about subtle relaxed behaviors and figure out ways to control these behaviors by adding fences to the program. Not only is this process time consuming and error-prone, but it has to be repeated every time the implementation is ported to a different architecture.In this paper, we present the first scalable framework for handling real-world concurrent algorithms running on relaxed architectures. Given a concurrent C program, a safety specification, and a description of the memory model, our framework tests the program on the memory model to expose violations of the specification, and synthesizes a set of necessary ordering constraints that prevent these violations. The ordering constraints are then realized as additional fences in the program.We implemented our approach in a tool called DFence based on LLVM and used it to infer fences in a number of concurrent algorithms. Using DFence, we perform the first in-depth study of the interaction between fences in real-world concurrent C programs, correctness criteria such as sequential consistency and linearizability, and memory models such as TSO and PSO, yielding many interesting observations. We believe that this is the first tool that can handle programs at the scale and complexity of a lock-free memory allocator.

...read moreread less

90 citations

Journal Article•10.1145/2366231.2337222•

The dynamic granularity memory system

[...]

Doe Hyun Yoon¹, Min Kyu Jeong², Michael J. Sullivan², Mattan Erez²•Institutions (2)

Hewlett-Packard¹, University of Texas at Austin²

9 Jun 2012

TL;DR: A memory system, the dynamic granularity memory system (DGMS), which avoids unnecessary data transfers, saves power, and improves system performance by dynamically changing between fine and coarse-grained memory accesses.

...read moreread less

Abstract: Chip multiprocessors enable continued performance scaling with increasingly many cores per chip. As the throughput of computation outpaces available memory bandwidth, however, the system bottleneck will shift to main memory. We present a memory system, the dynamic granularity memory system (DGMS), which avoids unnecessary data transfers, saves power, and improves system performance by dynamically changing between fine and coarse-grained memory accesses. DGMS predicts memory access granularities dynamically in hardware, and does not require software or OS support. The dynamic operation of DGMS gives it superior ease of implementation and power efficiency relative to prior multi-granularity memory systems, while maintaining comparable levels of system performance.

...read moreread less

78 citations

The Trinity Graph Engine

[...]

Bin Shao, Haixun Wang, Yatao Li

1 Mar 2012

TL;DR: Trinity as discussed by the authors is a general purpose graph engine over a distributed memory cloud, which leverages graph access patterns in both online and offline computation to optimize memory and communication for best performance.

...read moreread less

Abstract: Computations performed by graph algorithms are data driven, and require a high degree of random data access. Despite the great progresses made in disk technology, it still cannot provide the level of efficient random access required by graph computation. On the other hand, memory-based approaches usually do not scale due to the capacity limit of single machines. In this paper, we introduce Trinity, a general purpose graph engine over a distributed memory cloud. Through optimized memory management and network communication, Trinity supports fast graph exploration as well as efficient parallel computing. In particular, Trinity leverages graph access patterns in both online and offline computation to optimize memory and communication for best performance. These enable Trinity to support efficient online query processing and offline analytics on large graphs with just a few commodity machines. Furthermore, Trinity provides a high level specification language called TSL for users to declare data schema and communication protocols, which brings great ease-of-use for general purpose graph management and computing. Our experiments show Trinity’s performance in both low latency graph queries as well as high throughput graph analytics on web-scale, billion-node graphs.

...read moreread less

76 citations

Proceedings Article•10.1109/IPDPS.2012.64•

ShyLU: A Hybrid-Hybrid Solver for Multicore Platforms

[...]

Sivasankaran Rajamanickam¹, Erik G. Boman¹, Michael A. Heroux¹•Institutions (1)

Sandia National Laboratories¹

21 May 2012

TL;DR: ShyLU is presented, a “hybrid-hybrid” solver for general sparse linear systems that is hybrid in two ways: first, it combines direct and iterative methods, and second, the solver uses two levels of parallelism via hybrid programming (MPI+threads).

...read moreread less

Abstract: With the ubiquity of multicore processors, it is crucial that solvers adapt to the hierarchical structure of modern architectures. We present ShyLU, a “hybrid-hybrid” solver for general sparse linear systems that is hybrid in two ways: First, it combines direct and iterative methods. The iterative part is based on approximate Schur complements where we compute the approximate Schur complement using a value-based dropping strategy or structure-based probing strategy. Second, the solver uses two levels of parallelism via hybrid programming (MPI+threads). ShyLU is useful both in shared-memory environments and on large parallel computers with distributed memory. In the latter case, it should be used as a subdomain solver. We argue that with the increasing complexity of compute nodes, it is important to exploit multiple levels of parallelism even within a single compute node. We show the robustness of ShyLU against other algebraic preconditioners. ShyLU scales well up to 384 cores for a given problem size. We also study the MPI-only performance of ShyLU against a hybrid implementation and conclude that on present multicore nodes MPI-only implementation is better. However, for future multicore machines (96 or more cores) hybrid/ hierarchical algorithms and implementations are important for sustained performance.

...read moreread less

Proceedings Article•10.1145/2287076.2287103•

Work stealing and persistence-based load balancers for iterative overdecomposed applications

[...]

Jonathan Lifflander¹, Sriram Krishnamoorthy², Laxmikant V. Kale¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, Pacific Northwest National Laboratory²

18 Jun 2012

TL;DR: This paper presents a hierarchical persistence-based rebalancing algorithm that performs localized incremental rebalanced load balancing and an active-message-based retentive work stealing algorithm optimized for iterative applications on distributed memory machines.

...read moreread less

Abstract: Applications often involve iterative execution of identical or slowly evolving calculations. Such applications require incremental rebalancing to improve load balance across iterations. In this paper, we consider the design and evaluation of two distinct approaches to addressing this challenge: persistence-based load balancing and work stealing. The work to be performed is overdecomposed into tasks, enabling automatic rebalancing by the middleware. We present a hierarchical persistence-based rebalancing algorithm that performs localized incremental rebalancing. We also present an active-message-based retentive work stealing algorithm optimized for iterative applications on distributed memory machines. We demonstrate low overheads and high efficiencies on the full NERSC Hopper (146,400 cores) and ALCF Intrepid systems (163,840 cores), and on up to 128,000 cores on OLCF Titan.

...read moreread less

Proceedings Article•10.1109/IPDPSW.2012.204•

An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture

[...]

Erik Saule¹, Ümit V. Çatalyürek¹•Institutions (1)

Ohio State University¹

21 May 2012

TL;DR: This paper presents scalability results of a parallel graph coloring algorithm, three variations of a breadth-first search algorithm and a micro benchmark for irregular computations using these three programming models namely OpenMP, Cilk Plus and Intel's TBB.

...read moreread less

Abstract: Graph algorithms are notorious for not getting good speedup on parallel architectures. These algorithms tend to suffer from irregular dependencies and a high synchronization cost that prevent an efficient execution on distributed memory machines. Hence such algorithms are mostly parallelized on shared memory machines. However, current commodity shared memory machines do not typically offer enough parallelism to process these problems. In this paper, we are presenting an early investigation of the scalability of such algorithms on Intel's upcoming Many Integrated Core (Intel MIC) architecture which, when it will be released in 2012, is expected to provide more than 50 physical cores with SMT capability. The Intel MIC architecture can be programmed through many programming models, here we investigate the three most popular of these models namely OpenMP, Cilk Plus and Intel's TBB. We present scalability results of a parallel graph coloring algorithm, three variations of a breadth-first search algorithm and a micro benchmark for irregular computations using these three programming models. Our results on a prototype board show that the multi-threaded architecture of Intel MIC can be effectively used for hiding latencies in irregular applications to achieve almost perfect speedup.

...read moreread less

Journal Article•10.1145/2366231.2337163•

BOOM: enabling mobile memory based low-power server DIMMs

[...]

Doe Hyun Yoon¹, Jichuan Chang¹, Naveen Muralimanohar¹, Parthasarathy Ranganathan¹•Institutions (1)

Hewlett-Packard¹

9 Jun 2012

TL;DR: Buffered Output On Module (BOOM) is proposed that buffers the data outputs from multiple ranks of low-frequency mobile DRAM devices, which in aggregation provide high bandwidth and achieve chipkill-correct or even stronger reliability.

...read moreread less

Abstract: To address the real-time processing needs of large and growing amounts of data, modern software increasingly uses main memory as the primary data store for critical information. This trend creates a new emphasis on high-capacity, high-bandwidth, and high-reliability main memory systems. Conventional and recently-proposed server memory techniques can satisfy these requirements, but at the cost of significantly increased memory power, a key constraint for future memory systems. In this paper, we exploit the low-power nature of another high volume memory component---mobile DRAM---while improving its bandwidth and reliability shortcomings with a new DIMM architecture. We propose Buffered Output On Module (BOOM) that buffers the data outputs from multiple ranks of low-frequency mobile DRAM devices, which in aggregation provide high bandwidth and achieve chipkill-correct or even stronger reliability. Our evaluation shws that BOOM can reduce main memory power by more than 73% relative to the baseline chipkill system, while improving average performance by 5% and providing strong reliability. For memory-intensive applications, BOOM can improve performance by 30--40%.

...read moreread less

Proceedings Article•10.1145/2259016.2259046•

Matching memory access patterns and data placement for NUMA systems

[...]

Zoltan Majo¹, Thomas R. Gross¹•Institutions (1)

ETH Zurich¹

31 Mar 2012

TL;DR: A small set of language-level primitives for memory allocation and loop scheduling is described, which eliminates mutually incompatible access patterns from OpenMP-style parallel programs.

...read moreread less

Abstract: Many recent multicore multiprocessors are based on a nonuniform memory architecture (NUMA). A mismatch between the data access patterns of programs and the mapping of data to memory incurs a high overhead, as remote accesses have higher latency and lower throughput than local accesses. This paper reports on a limit study that shows that many scientific loop-parallel programs include multiple, mutually incompatible data access patterns, therefore these programs encounter a high fraction of costly remote memory accesses. Matching the data distribution of a program to the individual data access patterns is possible, however it is difficult to find a data distribution that matches all access patterns.Directives as included in, e.g., OpenMP provide a way to distribute the computation, but the induced data partitioning does not take into account the placement of data into the processors' memory. To alleviate this problem we describe a small set of language-level primitives for memory allocation and loop scheduling. Using the primitives together with simple program-level transformations eliminates mutually incompatible access patterns from OpenMP-style parallel programs. This result represents an improvement of up to 3.3X over the default setup, and the programs obtain a speedup of up to 33.6X over single-core execution (19X on average) on a 4-processor 32-core machine.

...read moreread less

Proceedings Article•10.1109/CCGRID.2012.141•

Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports

[...]

Jithin Jose¹, Hari Subramoni¹, Krishna Kandalla¹, Md. Wasi-ur-Rahman¹, Hao Wang¹, S. Narravula¹, Dhabaleswar K. Panda¹ - Show less +3 more•Institutions (1)

Ohio State University¹

13 May 2012

TL;DR: This work introduces a hybrid transport model which takes advantage of the best features of RC and UD to deliver scalability and performance higher than that of a single-transport, and presents comprehensive performance analysis using micro benchmarks, application benchmarks and realistic industry workloads.

...read moreread less

Abstract: Mem cached is a general-purpose key-value based distributed memory object caching system. It is widely used in data-center domain for caching results of database calls, API calls or page rendering. An efficient Mem cached design is critical to achieve high transaction throughput and scalability. Previous research in the field has shown that the use of high performance interconnects like InfiniBand can dramatically improve the performance of Mem cached. The Reliable Connection (RC) is the most commonly used transport model for InfiniBand implementations. However, it has been shown that RC transport imposes scalability issues due to high memory consumption per connection. Such a characteristic is not favorable for middle wares like Mem cached, where the server is required to serve thousands of clients. The Unreliable Datagram (UD) transport offers higher scalability, but has several other limitations, which need to be efficiently handled. In this context, we introduce a hybrid transport model which takes advantage of the best features of RC and UD to deliver scalability and performance higher than that of a single-transport. To the best of our knowledge, this is the first effort aimed at studying the impact of using a hybrid of multiple transport protocols on Mem cached performance. We present comprehensive performance analysis using micro benchmarks, application benchmarks and realistic industry workloads. Our performance evaluations reveal that our Hybrid transport delivers performance comparable to that of RC, while maintaining a steady memory footprint. Mem cached Get latency for 4byte data size, is 4.28µs and 4.86µs for RC and hybrid transports, respectively. This represents a factor of twelve improvement over the performance of SDP. In evaluations using Apache Olio benchmark with 1,024 clients, Mem cached execution time using RC, UD and hybrid transports are 1.61, 1.96 and 1.70 seconds, respectively. Further, our scalability analysis with 4,096 client connections reveal that our proposed hybrid transport achieves good memory scalability.

...read moreread less

Patent•

Methods and systems for processing graphs using distributed memory and set operations

[...]

Arun Kumar Jagota¹, Matthew Fuchs¹•Institutions (1)

Salesforce.com¹

11 Jan 2012

TL;DR: In this article, the authors present a distributed memory system using a key/value schema wherein each node is stored as a key and a set of neighbor nodes are stored as the corresponding value.

...read moreread less

Abstract: Systems and methods for operating a database using distributed memory and set operations, and for evaluating graphs stored in the database. Any system may be represented as a graph structure having nodes and edges. The graph structure is stored in a distributed memory system using a key/value schema wherein each node is stored as a key and a set of neighbor nodes is stored as the corresponding value. A short path from one node to another may be determined by traversing the graph in stages, moving outward from each node in stages until common midpoint nodes are found indicating connections between the nodes. When the midpoint nodes are found, the paths connecting the nodes may be reconstructed

...read moreread less

Proceedings Article•10.1145/2370816.2370858•

Shared memory multiplexing: a novel way to improve GPGPU throughput

[...]

Yi Yang¹, Ping Xiang¹, Mike Mantor², Norm Rubin², Huiyang Zhou¹ - Show less +1 more•Institutions (2)

North Carolina State University¹, Advanced Micro Devices²

19 Sep 2012

TL;DR: This paper proposes three software approaches to enable shared memory multiplexing and implements them using a source-to-source compiler, and proposes hardware support for shared memoryMultiplexing, which incurs minor hardware changes to existing hardware and enables significant performance improvements to be achieved with very little change in GPGPU code.

...read moreread less

Abstract: On-chip shared memory (a.k.a. local data share) is a critical resource to many GPGPU applications. In current GPUs, the shared memory is allocated when a thread block (also called a workgroup) is dispatched to a streaming multiprocessor (SM) and is released when the thread block is completed. As a result, the limited capacity of shared memory becomes a bottleneck for a GPU to host a high number of thread blocks, limiting the otherwise available thread-level parallelism (TLP). In this paper, we propose software and/or hardware approaches to multiplex the shared memory among multiple thread blocks. Our proposed approaches are based on our observation that the current shared memory management reserves shared memory too conservatively, for the entire lifetime of a thread block. If the shared memory is allocated only when it is actually used and freed immediately after, more thread blocks can be hosted in an SM without increasing the shared memory capacity. We propose three software approaches to enable shared memory multiplexing and implement them using a source-to-source compiler. The experimental results show that our proposed software approaches effectively improve the throughput of many GPGPU applications on both NVIDIA GTX285 and GTX480 GPUs (an average of 1.44X on GTX285, 1.70X on GTX480 with 16kB shared memory, and 1.26X on GTX480 with 48kB shared memory). We also propose hardware support for shared memory multiplexing, which incurs minor hardware changes to existing hardware and enables significant performance improvements (an average of 1.53X) to be achieved with very little change in GPGPU code.

...read moreread less

Proceedings Article•10.1145/2312005.2312025•

A scalable framework for heterogeneous GPU-based clusters

[...]

Fengguang Song¹, Jack Dongarra¹•Institutions (1)

University of Tennessee¹

25 Jun 2012

TL;DR: A distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently, and is able to attain high performance on distributed-memory clusters without GPUs, and shared-system multiGPUs.

...read moreread less

Abstract: GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multi-level partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We then devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework is able to attain high performance on distributed-memory clusters without GPUs, and shared-system multiGPUs.

...read moreread less

Journal Article•10.1007/S11227-009-0270-0•

F-MPJ: scalable Java message-passing communications on parallel systems

[...]

Guillermo L. Taboada¹, Juan Touriño¹, Ramón Doallo¹•Institutions (1)

University of A Coruña¹

01 Apr 2012-The Journal of Supercomputing

TL;DR: F-MPJ significantly improves the scalability of current MPJ implementations by providing efficient non-blocking communication, taking advantage of shared memory systems and high-performance networks, andoptimizing MPJ collective primitives.

...read moreread less

Abstract: This paper presents F-MPJ (Fast MPJ), a scalable and efficient Message-Passing in Java (MPJ) communication middleware for parallel computing. The increasing interest in Java as the programming language of the multi-core era demands scalable performance on hybrid architectures (with both shared and distributed memory spaces). However, current Java communication middleware lacks efficient communication support. F-MPJ boosts this situation by: (1) providing efficient non-blocking communication, which allows communication overlapping and thus scalable performance; (2) taking advantage of shared memory systems and high-performance networks through the use of our high-performance Java sockets implementation (named JFS, Java Fast Sockets); (3) avoiding the use of communication buffers; and (4) optimizing MPJ collective primitives. Thus, F-MPJ significantly improves the scalability of current MPJ implementations. A performance evaluation on an InfiniBand multi-core cluster has shown that F-MPJ communication primitives outperform representative MPJ libraries up to 60 times. Furthermore, the use of F-MPJ in communication-intensive MPJ codes has increased their performance up to seven times.

...read moreread less

Patent•

Generation of a disparity result with low latency

[...]

John Iselin Woodfill, H. Harlyn Baker, Brian Von Herzen, Robert Dale Alkire

1 Aug 2012

TL;DR: In this paper, a reconfigurable image processing system with a toroidal topology, distributed memory, and wide bandwidth I/O is described, which is capable of solving real applications at real-time speeds.

...read moreread less

Abstract: A powerful, scaleable, and reconfigurable image processing system and method of processing data therein is described. This general purpose, reconfigurable engine with toroidal topology, distributed memory, and wide bandwidth I/O are capable of solving real applications at real-time speeds. The reconfigurable image processing system can be optimized to efficiently perform specialized computations, such as real-time video and audio processing. This reconfigurable image processing system provides high performance via high computational density, high memory bandwidth, and high I/O bandwidth. Generally, the reconfigurable image processing system and its control structure include a homogeneous array of 16 field programmable gate arrays (FPGA) and 16 static random access memories (SRAM) arranged in a partial torus configuration. The reconfigurable image processing system also includes a PCI bus interface chip, a clock control chip, and a datapath chip. It can be implemented in a single board. It receives data from its external environment, computes correspondence, and uses the results of the correspondence computations for various post-processing industrial applications. The reconfigurable image processing system determines correspondence by using non-parametric local transforms followed by correlation. These non-parametric local transforms include the census and rank transforms. Other embodiments involve a combination of correspondence, rectification, a left-right consistency check, and the application of an interest operator.

...read moreread less

Other•10.1090/CONM/588/11704•

UMPa: A Multi-objective, multi-level partitioner for communication minimization

[...]

Ümit V. Çatalyürek, Mehmet Deveci, Kamer Kaya, Bora Uçar¹•Institutions (1)

Centre national de la recherche scientifique¹

1 Jan 2012

TL;DR: This work proposes a directed hypergraph model and a refinement heuristic to distribute communicating tasks among the processing units in a distributed memory setting and shows that this model improves the maximum data sent by a processor up to 34% for parallel environments with 4, 16, 64 and 256 processing units.

...read moreread less

Abstract: We propose a directed hypergraph model and a refinement heuristic to distribute communicating tasks among the processing units in a distributed memory setting. The aim is to achieve load balance and minimize the maximum data sent by a processing unit. We also take two other communication metrics into account with a tie-breaking scheme. With this approach, task distributions causing an excessive use of network or a bottleneck processor which participates to almost all of the communication are avoided. We show on a large number of problem instances that our model improves the maximum data sent by a processor up to 34% for parallel environments with 4, 16, 64 and 256 processing units compared to the state of the art which only minimizes the total communication volume.

...read moreread less

Patent•

Usage aware numa process scheduling

[...]

Alexander G. Gounares

19 Jun 2012

TL;DR: In this paper, the authors propose to assign processes to specific processors when memory objects consumed by the processes are located in memory banks closely associated with the processors, which may minimize the memory access time of the process.

...read moreread less

Abstract: Processes may be assigned to specific processors when memory objects consumed by the processes are located in memory banks closely associated with the processors. When assigning processes to threads operating in a multiple processor NUMA architecture system, an analysis of the memory objects accessed by a process may identify processor or group of processors that may minimize the memory access time of the process. The selection may take into account the connections between memory banks and processors to identify the shortest communication path between the memory objects and the process. The processes may be pre-identified as functional processes that make little or no changes to memory objects other than information passed to or from the processes.

...read moreread less

Proceedings Article•10.1109/MICRO.2012.35•

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

[...]

Akbar Sharifi¹, Emre Kultursay¹, Mahmut Kandemir¹, Chita R. Das¹•Institutions (1)

Pennsylvania State University¹

1 Dec 2012

TL;DR: This work proposes two network prioritization schemes that can cooperatively improve performance by reducing end-to-end memory access latencies in Network-on-Chip (NoC) based multicores and prioritizes the request messages that are destined for idle memory banks over others, which lead to uniform memory access Latencies with a low average value.

...read moreread less

Abstract: To achieve high performance in emerging multicores, it is crucial to reduce the number of memory accesses that suffer from very high latencies. However, this should be done with care as improving latency of an access can worsen the latency of another as a result of resource sharing. Therefore, the goal should be to balance latencies of memory accesses issued by an application in an execution phase, while ensuring a low average latency value. Targeting Network-on-Chip (NoC) based multicores, we propose two network prioritization schemes that can cooperatively improve performance by reducing end-to-end memory access latencies. Our first scheme prioritizes memory response messages such that, in a given period of time, messages of an application that experience higher latencies than the average message latency for that application are expedited and a more uniform memory latency pattern is achieved. Our second scheme prioritizes the request messages that are destined for idle memory banks over others, with the goal of improving bank utilization and preventing long queues from being built in front of the memory banks. These two network prioritization-based optimizations together lead to uniform memory access latencies with a low average value. Our experiments with a 4x8 mesh network-based multicore show that, when applied together, our schemes can achieve 15%, 10% and 13% performance improvement on memory intensive, memory non-intensive, and mixed multiprogrammed workloads, respectively.

...read moreread less

Book•

Automatic Parallelization: An Overview of Fundamental Compiler Techniques

[...]

Samuel P. Midkiff¹•Institutions (1)

Purdue University¹

1 Jan 2012

TL;DR: In this paper, the authors describe the fundamental principles of compiling regular numerical programs for parallelism and discuss transformations that expose parallelism to target shared memory multicore and vector processors, and discuss some problems that arise when parallelizing programs for execution on distributed memory machines.

...read moreread less

Abstract: Compiling for parallelism is a longstanding topic of compiler research. This book describes the fundamental principles of compiling "regular" numerical programs for parallelism. We begin with an explanation of analyses that allow a compiler to understand the interaction of data reads and writes in different statements and loop iterations during program execution. These analyses include dependence analysis, use-def analysis and pointer analysis. Next, we describe how the results of these analyses are used to enable transformations that make loops more amenable to parallelization, and discuss transformations that expose parallelism to target shared memory multicore and vector processors. We then discuss some problems that arise when parallelizing programs for execution on distributed memory machines. Finally, we conclude with an overview of solving Diophantine equations and suggestions for further readings in the topics of this book to enable the interested reader to delve deeper into the field. Table of Contents: Introduction and overview / Dependence analysis, dependence graphs and alias analysis / Program parallelization / Transformations to modify and eliminate dependences / Transformation of iterative and recursive constructs / Compiling for distributed memory machines / Solving Diophantine equations / A guide to further reading

...read moreread less

Journal Article•10.1109/TCAD.2011.2160348•

Memory-Efficient On-Chip Network With Adaptive Interfaces

[...]

Masoud Daneshtalab¹, Masoumeh Ebrahimi¹, Pasi Liljeberg¹, Juha Plosila¹, Hannu Tenhunen¹ - Show less +1 more•Institutions (1)

Information Technology University¹

01 Jan 2012-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: The proposed on-chip network architecture exploits a streamlined reordering mechanism to handle the in-order delivery and utilizes the advance extensible interface transaction-based protocol to maintain compatibility with existing intellectual property cores.

...read moreread less

Abstract: To achieve higher memory bandwidth in network-based multiprocessor architectures, multiple dynamic random access memories can be accessed simultaneously. In such architectures, not only resource utilization and latency are the critical issues but also a reordering mechanism is required to deliver the response transactions of concurrent memory accesses in-order. In this paper, we present a memory-efficient on-chip network architecture to cope with these issues efficiently. Each node of the network is equipped with a novel network interface (NI) to deal with out-of-order delivery, and a priority-based router to decrease the network latency. The proposed NI exploits a streamlined reordering mechanism to handle the in-order delivery and utilizes the advance extensible interface transaction-based protocol to maintain compatibility with existing intellectual property cores. To improve the memory utilization and reduce the memory latency, an optimized memory controller is integrated in the presented NI. Experimental results with synthetic test cases demonstrate that the proposed on-chip network architecture provides significant improvements in average network latency (16%), average memory access latency (19%), and average memory utilization (22%).

...read moreread less

Book Chapter•10.1007/978-3-642-33460-3_35•

CC-MR --- finding connected components in huge graphs with mapreduce

[...]

Thomas Seidl¹, Brigitte Boden¹, Sergej Fries¹•Institutions (1)

RWTH Aachen University¹

24 Sep 2012

TL;DR: The algorithm CC-MR significantly outperforms the existing approaches for the MapReduce framework in terms of the number of necessary iterations, communication costs and execution runtime, as well as in the experimental evaluation on synthetic and real-world data.

...read moreread less

Abstract: The detection of connected components in graphs is a well-known problem arising in a large number of applications including data mining, analysis of social networks, image analysis and a lot of other related problems. In spite of the existing very efficient serial algorithms, this problem remains a subject of research due to increasing data amounts produced by modern information systems which cannot be handled by single workstations. Only highly parallelized approaches on multi-core-servers or computer clusters are able to deal with these large-scale data sets. In this work we present a solution for this problem for distributed memory architectures, and provide an implementation for the well-known MapReduce framework developed by Google. Our algorithm CC-MR significantly outperforms the existing approaches for the MapReduce framework in terms of the number of necessary iterations, communication costs and execution runtime, as we show in our experimental evaluation on synthetic and real-world data. Furthermore, we present a technique for accelerating our implementation for datasets with very heterogeneous component sizes as they often appear in real data sets.

...read moreread less

Journal Article•

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Mode

[...]

Ziming Zhong, Vladimir Rychkov, Alexey Lastovetsky

01 Jan 2012-Cluster Computing

TL;DR: In this article, the authors consider a hybrid platform as a heterogeneous distributed-memory system and apply the approach of functional performance models, which was originally designed for uniprocessor machines.

...read moreread less

Abstract: Transition to hybrid CPU/GPU platforms in high performance computing is challenging in the aspect of efficient utilisation of the heterogeneous hardware and existing optimised software. During recent years, scientific software has been ported to multicore and GPU architectures and now should be reused on hybrid platforms. In this paper, we model the performance of such scientific applications in order to execute them efficiently on hybrid platforms. We consider a hybrid platform as a heterogeneous distributed-memory system and apply the approach of functional performance models, which was originally designed for uniprocessor machines. The functional performance model (FPM) represents the processor speed by a function of problem size and integrates many important features characterising the performance of the architecture and the application. We demonstrate that FPMs facilitate performance evaluation of scientific applications on hybrid platforms. FPM-based data partitioning algorithms have been proved to be accurate for load balancing on heterogeneous networks of uniprocessor computers. We apply FPM-based data partitioning to balance the load between cores and GPUs in the hybrid architecture. In our experiments with parallel matrix multiplication, we couple the existing software optimised for multicores and GPUs and achieve high performance of the whole hybrid system.

...read moreread less

Book Chapter•10.4324/9780203052242-11•

A Distributed Memory Model of Context Effects in Word Identification

[...]

Michael E. J. Masson

12 Nov 2012

Journal Article•10.1109/TPDS.2012.19•

pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs

[...]

Changjun Wu¹, Ananth Kalyanaraman², William R. Cannon³•Institutions (3)

Xerox¹, Washington State University², Pacific Northwest National Laboratory³

01 Oct 2012-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation.

...read moreread less

Abstract: Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting the homology between two protein sequences is relatively inexpensive, detecting pairwise homology for a large number of protein sequences can become computationally prohibitive for modern inputs, often requiring millions of CPU hours. Yet, there is currently no robust support to parallelize this kernel. In this paper, we identify the key characteristics that make this problem particularly hard to parallelize, and then propose a new parallel algorithm that is suited for detecting homology on large data sets using distributed memory parallel computers. Our method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation. Experimental results show that pGraph achieves linear scaling on a 2,048 processor distributed memory cluster for a wide range of inputs ranging from as small as 20,000 sequences to 2,560,000 sequences. In addition to demonstrating strong scaling, we present an extensive report on the performance of the various system components and related parametric studies.

...read moreread less

...

Expand