Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Distributed memory
  4. 2012
  1. Home
  2. Topics
  3. Distributed memory
  4. 2012
Showing papers on "Distributed memory published in 2012"
Proceedings Article•
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

[...]

Matei Zaharia1, Mosharaf Chowdhury1, Tathagata Das1, Ankur Dave1, Justin Ma1, Murphy McCauley1, Michael J. Franklin1, Scott Shenker1, Ion Stoica1 •
University of California, Berkeley1
25 Apr 2012
TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Abstract: We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.

4,605 citations

Journal Article•10.1016/J.CPC.2011.10.012•
Implementing Molecular Dynamics on Hybrid High Performance Computers - Particle-Particle Particle-Mesh

[...]

W. Michael Brown1, Axel Kohlmeyer2, Steven J. Plimpton3, Arnold N. Tharrington1•
National Center for Computational Sciences1, Temple University2, Sandia National Laboratories3
01 Mar 2012-Computer Physics Communications
TL;DR: This paper presents an efficient implementation of the particle–particle particle-mesh method based on the work by Harvey and De Fabritiis, and provides a performance comparison of the same kernels compiled with both CUDA and OpenCL.

499 citations

Journal Article•10.1109/TPDS.2011.308•
A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era

[...]

J. Diaz1, Camelia Muñoz-Caro, Alfonso Niño•
Indiana University1
01 Aug 2012-IEEE Transactions on Parallel and Distributed Systems
TL;DR: A survey of the different parallel programming models and tools available today with special consideration to their suitability for high-performance computing finds that hybrid parallel programming is the current way of harnessing the capabilities of computer clusters with multi-core nodes.
Abstract: In this work, we present a survey of the different parallel programming models and tools available today with special consideration to their suitability for high-performance computing. Thus, we review the shared and distributed memory approaches, as well as the current heterogeneous parallel programming model. In addition, we analyze how the partitioned global address space (PGAS) and hybrid parallel programming models are used to combine the advantages of shared and distributed memory systems. The work is completed by considering languages with specific parallel support and the distributed programming paradigm. In all cases, we present characteristics, strengths, and weaknesses. The study shows that the availability of multi-core CPUs has given new impulse to the shared memory parallel programming approach. In addition, we find that hybrid parallel programming is the current way of harnessing the capabilities of computer clusters with multi-core nodes. On the other hand, heterogeneous programming is found to be an increasingly popular paradigm, as a consequence of the availability of multi-core CPUs+GPUs systems. The use of open industry standards like OpenMP, MPI, or OpenCL, as opposed to proprietary solutions, seems to be the way to uniformize and extend the use of parallel programming models.

268 citations

Journal Article•10.1137/110848244•
Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments

[...]

Aydin Buluc, John R. Gilbert
26 Jul 2012-SIAM Journal on Scientific Computing
TL;DR: It is demonstrated that the parallel SpGEMM methods, which use two-dimensional block data distributions with serial hypersparse kernels, are indeed highly flexible, scalable, and memory-efficient in the general case.
Abstract: Generalized sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. Here we show that SpGEMM also yields efficient algorithms for general sparse-matrix indexing in distributed memory, provided that the underlying SpGEMM implementation is sufficiently flexible and scalable. We demonstrate that our parallel SpGEMM methods, which use two-dimensional block data distributions with serial hypersparse kernels, are indeed highly flexible, scalable, and memory-efficient in the general case. This algorithm is the first to yield increasing speedup on an unbounded number of processors; our experiments show scaling up to thousands of processors in a variety of test scenarios.

265 citations

Proceedings Article•10.1109/INPAR.2012.6339594•
OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures

[...]

Gihan R. Mudalige1, Michael B. Giles1, Istvan Z. Reguly2, Carlo Bertolli3, Paul H. J. Kelly3 •
University of Oxford1, Pázmány Péter Catholic University2, Imperial College London3
13 May 2012
TL;DR: It is demonstrated that an application written once at a high-level using the OP2 API can be easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.
Abstract: OP2 is an “active” library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into different parallel implementations for execution on different back-end hardware platforms. In this paper we present the design of the current OP2 library, and investigate its capabilities in achieving performance portability, near-optimal performance, and scaling on modern multi-core and many-core processor based systems. A key feature of this work is OP2's recent extension facilitating the development and execution of applications on a distributed memory cluster of GPUs. We discuss the main design issues in parallelizing unstructured mesh based applications on heterogeneous platforms. These include handling data dependencies in accessing indirectly referenced data, the impact of unstructured mesh data layouts (array of structs vs. struct of arrays) and design considerations in generating code for execution on a cluster of GPUs. A representative CFD application written using the OP2 framework is utilized to provide a contrasting benchmarking and performance analysis study on a range of multi-core/many-core systems. These include multi-core CPUs from Intel (Westmere and Sandy Bridge) and AMD (Magny-Cours), GPUs from NVIDIA (GTX560Ti, Tesla C2070), a distributed memory CPU cluster (Cray XE6) and a distributed memory GPU cluster (Tesla C2050 GPUs with InfiniBand). OP2's design choices are explored with quantitative insights into their contributions to performance. We demonstrate that an application written once at a high-level using the OP2 API can be easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.

115 citations

Patent•
Apparatus and method for controlling distributed memory cluster

[...]

Ki-Young Jung
14 Feb 2012
TL;DR: A distributed computing system may include a computing node cluster, a distributed memory cluster, and a controlling node as mentioned in this paper, where the controlling node is configured to select memory blocks of the associated data for distribution on the distributed memory clusters based on a node selection rule and memory cluster structure information.
Abstract: Provided are an apparatus and method for controlling a distributed memory cluster A distributed computing system may include a computing node cluster, a distributed memory cluster, and a controlling node The computing node cluster may include a plurality of computing nodes including first computing nodes that each generates associated data The distributed memory cluster may be configured to store the associated data of the first computing nodes The controlling node may be configured to select memory blocks of the associated data for distribution on the distributed memory cluster based on a node selection rule and memory cluster structure information, and to select second computing nodes from the computing node cluster based on a location selection rule and the memory cluster structure information

94 citations

Proceedings Article•10.1145/2287076.2287104•
Highly scalable graph search for the Graph500 benchmark

[...]

Koji Ueno1, Toyotaro Suzumura1•
Tokyo Institute of Technology1
18 Jun 2012
TL;DR: An optimized method based on 2D partitioning and other methods such as communication compression and vertex sorting is devised to handle BFS (Breadth First Search) of a large graph with 236 and 240 edges in 10.58 seconds, which corresponds to 103.9 GE/s.
Abstract: Graph500 is a new benchmark to rank supercomputers with a large-scale graph search problem. We found that the provided reference implementations are not scalable in a large distributed environment. We devised an optimized method based on 2D partitioning and other methods such as communication compression and vertex sorting. Our optimized implementation can handle BFS (Breadth First Search) of a large graph with 236 (68.7 billion vertices) and 240 (1.1 trillion) edges in 10.58 seconds while using 1366 nodes and 16,392 CPU cores. This performance corresponds to 103.9 GE/s. We also studied the performance characteristics of our optimized implementation and reference implementations on a large distributed memory supercomputer with a Fat-Tree-based Infiniband network.

90 citations

Proceedings Article•10.1145/2254064.2254115•
Dynamic synthesis for relaxed memory models

[...]

Feng Liu1, Nayden Nedev1, Nedyalko Prisadnikov2, Martin Vechev3, Eran Yahav4 •
Princeton University1, Sofia University2, ETH Zurich3, Technion – Israel Institute of Technology4
11 Jun 2012
TL;DR: This paper presents the first scalable framework for handling real-world concurrent algorithms running on relaxed architectures, and performs the first in-depth study of the interaction between fences in real- world concurrent C programs, correctness criteria such as sequential consistency and linearizability, and memory models such as TSO and PSO.
Abstract: Modern architectures implement relaxed memory models which may reorder memory operations or execute them non-atomically. Special instructions called memory fences are provided, allowing control of this behavior.To implement a concurrent algorithm for a modern architecture, the programmer is forced to manually reason about subtle relaxed behaviors and figure out ways to control these behaviors by adding fences to the program. Not only is this process time consuming and error-prone, but it has to be repeated every time the implementation is ported to a different architecture.In this paper, we present the first scalable framework for handling real-world concurrent algorithms running on relaxed architectures. Given a concurrent C program, a safety specification, and a description of the memory model, our framework tests the program on the memory model to expose violations of the specification, and synthesizes a set of necessary ordering constraints that prevent these violations. The ordering constraints are then realized as additional fences in the program.We implemented our approach in a tool called DFence based on LLVM and used it to infer fences in a number of concurrent algorithms. Using DFence, we perform the first in-depth study of the interaction between fences in real-world concurrent C programs, correctness criteria such as sequential consistency and linearizability, and memory models such as TSO and PSO, yielding many interesting observations. We believe that this is the first tool that can handle programs at the scale and complexity of a lock-free memory allocator.

90 citations

Journal Article•10.1145/2366231.2337222•
The dynamic granularity memory system

[...]

Doe Hyun Yoon1, Min Kyu Jeong2, Michael J. Sullivan2, Mattan Erez2•
Hewlett-Packard1, University of Texas at Austin2
9 Jun 2012
TL;DR: A memory system, the dynamic granularity memory system (DGMS), which avoids unnecessary data transfers, saves power, and improves system performance by dynamically changing between fine and coarse-grained memory accesses.
Abstract: Chip multiprocessors enable continued performance scaling with increasingly many cores per chip. As the throughput of computation outpaces available memory bandwidth, however, the system bottleneck will shift to main memory. We present a memory system, the dynamic granularity memory system (DGMS), which avoids unnecessary data transfers, saves power, and improves system performance by dynamically changing between fine and coarse-grained memory accesses. DGMS predicts memory access granularities dynamically in hardware, and does not require software or OS support. The dynamic operation of DGMS gives it superior ease of implementation and power efficiency relative to prior multi-granularity memory systems, while maintaining comparable levels of system performance.

78 citations

The Trinity Graph Engine

[...]

Bin Shao, Haixun Wang, Yatao Li
1 Mar 2012
TL;DR: Trinity as discussed by the authors is a general purpose graph engine over a distributed memory cloud, which leverages graph access patterns in both online and offline computation to optimize memory and communication for best performance.
Abstract: Computations performed by graph algorithms are data driven, and require a high degree of random data access. Despite the great progresses made in disk technology, it still cannot provide the level of efficient random access required by graph computation. On the other hand, memory-based approaches usually do not scale due to the capacity limit of single machines. In this paper, we introduce Trinity, a general purpose graph engine over a distributed memory cloud. Through optimized memory management and network communication, Trinity supports fast graph exploration as well as efficient parallel computing. In particular, Trinity leverages graph access patterns in both online and offline computation to optimize memory and communication for best performance. These enable Trinity to support efficient online query processing and offline analytics on large graphs with just a few commodity machines. Furthermore, Trinity provides a high level specification language called TSL for users to declare data schema and communication protocols, which brings great ease-of-use for general purpose graph management and computing. Our experiments show Trinity’s performance in both low latency graph queries as well as high throughput graph analytics on web-scale, billion-node graphs.

76 citations

Proceedings Article•10.1109/IPDPS.2012.64•
ShyLU: A Hybrid-Hybrid Solver for Multicore Platforms

[...]

Sivasankaran Rajamanickam1, Erik G. Boman1, Michael A. Heroux1•
Sandia National Laboratories1
21 May 2012
TL;DR: ShyLU is presented, a “hybrid-hybrid” solver for general sparse linear systems that is hybrid in two ways: first, it combines direct and iterative methods, and second, the solver uses two levels of parallelism via hybrid programming (MPI+threads).
Abstract: With the ubiquity of multicore processors, it is crucial that solvers adapt to the hierarchical structure of modern architectures. We present ShyLU, a “hybrid-hybrid” solver for general sparse linear systems that is hybrid in two ways: First, it combines direct and iterative methods. The iterative part is based on approximate Schur complements where we compute the approximate Schur complement using a value-based dropping strategy or structure-based probing strategy. Second, the solver uses two levels of parallelism via hybrid programming (MPI+threads). ShyLU is useful both in shared-memory environments and on large parallel computers with distributed memory. In the latter case, it should be used as a subdomain solver. We argue that with the increasing complexity of compute nodes, it is important to exploit multiple levels of parallelism even within a single compute node. We show the robustness of ShyLU against other algebraic preconditioners. ShyLU scales well up to 384 cores for a given problem size. We also study the MPI-only performance of ShyLU against a hybrid implementation and conclude that on present multicore nodes MPI-only implementation is better. However, for future multicore machines (96 or more cores) hybrid/ hierarchical algorithms and implementations are important for sustained performance.
Proceedings Article•10.1145/2287076.2287103•
Work stealing and persistence-based load balancers for iterative overdecomposed applications

[...]

Jonathan Lifflander1, Sriram Krishnamoorthy2, Laxmikant V. Kale1•
University of Illinois at Urbana–Champaign1, Pacific Northwest National Laboratory2
18 Jun 2012
TL;DR: This paper presents a hierarchical persistence-based rebalancing algorithm that performs localized incremental rebalanced load balancing and an active-message-based retentive work stealing algorithm optimized for iterative applications on distributed memory machines.
Abstract: Applications often involve iterative execution of identical or slowly evolving calculations. Such applications require incremental rebalancing to improve load balance across iterations. In this paper, we consider the design and evaluation of two distinct approaches to addressing this challenge: persistence-based load balancing and work stealing. The work to be performed is overdecomposed into tasks, enabling automatic rebalancing by the middleware. We present a hierarchical persistence-based rebalancing algorithm that performs localized incremental rebalancing. We also present an active-message-based retentive work stealing algorithm optimized for iterative applications on distributed memory machines. We demonstrate low overheads and high efficiencies on the full NERSC Hopper (146,400 cores) and ALCF Intrepid systems (163,840 cores), and on up to 128,000 cores on OLCF Titan.
Proceedings Article•10.1109/IPDPSW.2012.204•
An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture

[...]

Erik Saule1, Ümit V. Çatalyürek1•
Ohio State University1
21 May 2012
TL;DR: This paper presents scalability results of a parallel graph coloring algorithm, three variations of a breadth-first search algorithm and a micro benchmark for irregular computations using these three programming models namely OpenMP, Cilk Plus and Intel's TBB.
Abstract: Graph algorithms are notorious for not getting good speedup on parallel architectures. These algorithms tend to suffer from irregular dependencies and a high synchronization cost that prevent an efficient execution on distributed memory machines. Hence such algorithms are mostly parallelized on shared memory machines. However, current commodity shared memory machines do not typically offer enough parallelism to process these problems. In this paper, we are presenting an early investigation of the scalability of such algorithms on Intel's upcoming Many Integrated Core (Intel MIC) architecture which, when it will be released in 2012, is expected to provide more than 50 physical cores with SMT capability. The Intel MIC architecture can be programmed through many programming models, here we investigate the three most popular of these models namely OpenMP, Cilk Plus and Intel's TBB. We present scalability results of a parallel graph coloring algorithm, three variations of a breadth-first search algorithm and a micro benchmark for irregular computations using these three programming models. Our results on a prototype board show that the multi-threaded architecture of Intel MIC can be effectively used for hiding latencies in irregular applications to achieve almost perfect speedup.
Journal Article•10.1145/2366231.2337163•
BOOM: enabling mobile memory based low-power server DIMMs

[...]

Doe Hyun Yoon1, Jichuan Chang1, Naveen Muralimanohar1, Parthasarathy Ranganathan1•
Hewlett-Packard1
9 Jun 2012
TL;DR: Buffered Output On Module (BOOM) is proposed that buffers the data outputs from multiple ranks of low-frequency mobile DRAM devices, which in aggregation provide high bandwidth and achieve chipkill-correct or even stronger reliability.
Abstract: To address the real-time processing needs of large and growing amounts of data, modern software increasingly uses main memory as the primary data store for critical information. This trend creates a new emphasis on high-capacity, high-bandwidth, and high-reliability main memory systems. Conventional and recently-proposed server memory techniques can satisfy these requirements, but at the cost of significantly increased memory power, a key constraint for future memory systems. In this paper, we exploit the low-power nature of another high volume memory component---mobile DRAM---while improving its bandwidth and reliability shortcomings with a new DIMM architecture. We propose Buffered Output On Module (BOOM) that buffers the data outputs from multiple ranks of low-frequency mobile DRAM devices, which in aggregation provide high bandwidth and achieve chipkill-correct or even stronger reliability. Our evaluation shws that BOOM can reduce main memory power by more than 73% relative to the baseline chipkill system, while improving average performance by 5% and providing strong reliability. For memory-intensive applications, BOOM can improve performance by 30--40%.
Proceedings Article•10.1145/2259016.2259046•
Matching memory access patterns and data placement for NUMA systems

[...]

Zoltan Majo1, Thomas R. Gross1•
ETH Zurich1
31 Mar 2012
TL;DR: A small set of language-level primitives for memory allocation and loop scheduling is described, which eliminates mutually incompatible access patterns from OpenMP-style parallel programs.
Abstract: Many recent multicore multiprocessors are based on a nonuniform memory architecture (NUMA). A mismatch between the data access patterns of programs and the mapping of data to memory incurs a high overhead, as remote accesses have higher latency and lower throughput than local accesses. This paper reports on a limit study that shows that many scientific loop-parallel programs include multiple, mutually incompatible data access patterns, therefore these programs encounter a high fraction of costly remote memory accesses. Matching the data distribution of a program to the individual data access patterns is possible, however it is difficult to find a data distribution that matches all access patterns.Directives as included in, e.g., OpenMP provide a way to distribute the computation, but the induced data partitioning does not take into account the placement of data into the processors' memory. To alleviate this problem we describe a small set of language-level primitives for memory allocation and loop scheduling. Using the primitives together with simple program-level transformations eliminates mutually incompatible access patterns from OpenMP-style parallel programs. This result represents an improvement of up to 3.3X over the default setup, and the programs obtain a speedup of up to 33.6X over single-core execution (19X on average) on a 4-processor 32-core machine.
Proceedings Article•10.1109/CCGRID.2012.141•
Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports

[...]

Jithin Jose1, Hari Subramoni1, Krishna Kandalla1, Md. Wasi-ur-Rahman1, Hao Wang1, S. Narravula1, Dhabaleswar K. Panda1 •
Ohio State University1
13 May 2012
TL;DR: This work introduces a hybrid transport model which takes advantage of the best features of RC and UD to deliver scalability and performance higher than that of a single-transport, and presents comprehensive performance analysis using micro benchmarks, application benchmarks and realistic industry workloads.
Abstract: Mem cached is a general-purpose key-value based distributed memory object caching system. It is widely used in data-center domain for caching results of database calls, API calls or page rendering. An efficient Mem cached design is critical to achieve high transaction throughput and scalability. Previous research in the field has shown that the use of high performance interconnects like InfiniBand can dramatically improve the performance of Mem cached. The Reliable Connection (RC) is the most commonly used transport model for InfiniBand implementations. However, it has been shown that RC transport imposes scalability issues due to high memory consumption per connection. Such a characteristic is not favorable for middle wares like Mem cached, where the server is required to serve thousands of clients. The Unreliable Datagram (UD) transport offers higher scalability, but has several other limitations, which need to be efficiently handled. In this context, we introduce a hybrid transport model which takes advantage of the best features of RC and UD to deliver scalability and performance higher than that of a single-transport. To the best of our knowledge, this is the first effort aimed at studying the impact of using a hybrid of multiple transport protocols on Mem cached performance. We present comprehensive performance analysis using micro benchmarks, application benchmarks and realistic industry workloads. Our performance evaluations reveal that our Hybrid transport delivers performance comparable to that of RC, while maintaining a steady memory footprint. Mem cached Get latency for 4byte data size, is 4.28µs and 4.86µs for RC and hybrid transports, respectively. This represents a factor of twelve improvement over the performance of SDP. In evaluations using Apache Olio benchmark with 1,024 clients, Mem cached execution time using RC, UD and hybrid transports are 1.61, 1.96 and 1.70 seconds, respectively. Further, our scalability analysis with 4,096 client connections reveal that our proposed hybrid transport achieves good memory scalability.
Patent•
Methods and systems for processing graphs using distributed memory and set operations

[...]

Arun Kumar Jagota1, Matthew Fuchs1•
Salesforce.com1
11 Jan 2012
TL;DR: In this article, the authors present a distributed memory system using a key/value schema wherein each node is stored as a key and a set of neighbor nodes are stored as the corresponding value.
Abstract: Systems and methods for operating a database using distributed memory and set operations, and for evaluating graphs stored in the database. Any system may be represented as a graph structure having nodes and edges. The graph structure is stored in a distributed memory system using a key/value schema wherein each node is stored as a key and a set of neighbor nodes is stored as the corresponding value. A short path from one node to another may be determined by traversing the graph in stages, moving outward from each node in stages until common midpoint nodes are found indicating connections between the nodes. When the midpoint nodes are found, the paths connecting the nodes may be reconstructed
Proceedings Article•10.1145/2370816.2370858•
Shared memory multiplexing: a novel way to improve GPGPU throughput

[...]

Yi Yang1, Ping Xiang1, Mike Mantor2, Norm Rubin2, Huiyang Zhou1 •
North Carolina State University1, Advanced Micro Devices2
19 Sep 2012
TL;DR: This paper proposes three software approaches to enable shared memory multiplexing and implements them using a source-to-source compiler, and proposes hardware support for shared memoryMultiplexing, which incurs minor hardware changes to existing hardware and enables significant performance improvements to be achieved with very little change in GPGPU code.
Abstract: On-chip shared memory (a.k.a. local data share) is a critical resource to many GPGPU applications. In current GPUs, the shared memory is allocated when a thread block (also called a workgroup) is dispatched to a streaming multiprocessor (SM) and is released when the thread block is completed. As a result, the limited capacity of shared memory becomes a bottleneck for a GPU to host a high number of thread blocks, limiting the otherwise available thread-level parallelism (TLP). In this paper, we propose software and/or hardware approaches to multiplex the shared memory among multiple thread blocks. Our proposed approaches are based on our observation that the current shared memory management reserves shared memory too conservatively, for the entire lifetime of a thread block. If the shared memory is allocated only when it is actually used and freed immediately after, more thread blocks can be hosted in an SM without increasing the shared memory capacity. We propose three software approaches to enable shared memory multiplexing and implement them using a source-to-source compiler. The experimental results show that our proposed software approaches effectively improve the throughput of many GPGPU applications on both NVIDIA GTX285 and GTX480 GPUs (an average of 1.44X on GTX285, 1.70X on GTX480 with 16kB shared memory, and 1.26X on GTX480 with 48kB shared memory). We also propose hardware support for shared memory multiplexing, which incurs minor hardware changes to existing hardware and enables significant performance improvements (an average of 1.53X) to be achieved with very little change in GPGPU code.
Proceedings Article•10.1145/2312005.2312025•
A scalable framework for heterogeneous GPU-based clusters

[...]

Fengguang Song1, Jack Dongarra1•
University of Tennessee1
25 Jun 2012
TL;DR: A distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently, and is able to attain high performance on distributed-memory clusters without GPUs, and shared-system multiGPUs.
Abstract: GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multi-level partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We then devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework is able to attain high performance on distributed-memory clusters without GPUs, and shared-system multiGPUs.
Journal Article•10.1007/S11227-009-0270-0•
F-MPJ: scalable Java message-passing communications on parallel systems

[...]

Guillermo L. Taboada1, Juan Touriño1, Ramón Doallo1•
University of A Coruña1
01 Apr 2012-The Journal of Supercomputing
TL;DR: F-MPJ significantly improves the scalability of current MPJ implementations by providing efficient non-blocking communication, taking advantage of shared memory systems and high-performance networks, andoptimizing MPJ collective primitives.
Abstract: This paper presents F-MPJ (Fast MPJ), a scalable and efficient Message-Passing in Java (MPJ) communication middleware for parallel computing. The increasing interest in Java as the programming language of the multi-core era demands scalable performance on hybrid architectures (with both shared and distributed memory spaces). However, current Java communication middleware lacks efficient communication support. F-MPJ boosts this situation by: (1) providing efficient non-blocking communication, which allows communication overlapping and thus scalable performance; (2) taking advantage of shared memory systems and high-performance networks through the use of our high-performance Java sockets implementation (named JFS, Java Fast Sockets); (3) avoiding the use of communication buffers; and (4) optimizing MPJ collective primitives. Thus, F-MPJ significantly improves the scalability of current MPJ implementations. A performance evaluation on an InfiniBand multi-core cluster has shown that F-MPJ communication primitives outperform representative MPJ libraries up to 60 times. Furthermore, the use of F-MPJ in communication-intensive MPJ codes has increased their performance up to seven times.
Patent•
Generation of a disparity result with low latency

[...]

John Iselin Woodfill, H. Harlyn Baker, Brian Von Herzen, Robert Dale Alkire
1 Aug 2012
TL;DR: In this paper, a reconfigurable image processing system with a toroidal topology, distributed memory, and wide bandwidth I/O is described, which is capable of solving real applications at real-time speeds.
Abstract: A powerful, scaleable, and reconfigurable image processing system and method of processing data therein is described. This general purpose, reconfigurable engine with toroidal topology, distributed memory, and wide bandwidth I/O are capable of solving real applications at real-time speeds. The reconfigurable image processing system can be optimized to efficiently perform specialized computations, such as real-time video and audio processing. This reconfigurable image processing system provides high performance via high computational density, high memory bandwidth, and high I/O bandwidth. Generally, the reconfigurable image processing system and its control structure include a homogeneous array of 16 field programmable gate arrays (FPGA) and 16 static random access memories (SRAM) arranged in a partial torus configuration. The reconfigurable image processing system also includes a PCI bus interface chip, a clock control chip, and a datapath chip. It can be implemented in a single board. It receives data from its external environment, computes correspondence, and uses the results of the correspondence computations for various post-processing industrial applications. The reconfigurable image processing system determines correspondence by using non-parametric local transforms followed by correlation. These non-parametric local transforms include the census and rank transforms. Other embodiments involve a combination of correspondence, rectification, a left-right consistency check, and the application of an interest operator.
Other•10.1090/CONM/588/11704•
UMPa: A Multi-objective, multi-level partitioner for communication minimization

[...]

Ümit V. Çatalyürek, Mehmet Deveci, Kamer Kaya, Bora Uçar1•
Centre national de la recherche scientifique1
1 Jan 2012
TL;DR: This work proposes a directed hypergraph model and a refinement heuristic to distribute communicating tasks among the processing units in a distributed memory setting and shows that this model improves the maximum data sent by a processor up to 34% for parallel environments with 4, 16, 64 and 256 processing units.
Abstract: We propose a directed hypergraph model and a refinement heuristic to distribute communicating tasks among the processing units in a distributed memory setting. The aim is to achieve load balance and minimize the maximum data sent by a processing unit. We also take two other communication metrics into account with a tie-breaking scheme. With this approach, task distributions causing an excessive use of network or a bottleneck processor which participates to almost all of the communication are avoided. We show on a large number of problem instances that our model improves the maximum data sent by a processor up to 34% for parallel environments with 4, 16, 64 and 256 processing units compared to the state of the art which only minimizes the total communication volume.
Patent•
Usage aware numa process scheduling

[...]

Alexander G. Gounares
19 Jun 2012
TL;DR: In this paper, the authors propose to assign processes to specific processors when memory objects consumed by the processes are located in memory banks closely associated with the processors, which may minimize the memory access time of the process.
Abstract: Processes may be assigned to specific processors when memory objects consumed by the processes are located in memory banks closely associated with the processors. When assigning processes to threads operating in a multiple processor NUMA architecture system, an analysis of the memory objects accessed by a process may identify processor or group of processors that may minimize the memory access time of the process. The selection may take into account the connections between memory banks and processors to identify the shortest communication path between the memory objects and the process. The processes may be pre-identified as functional processes that make little or no changes to memory objects other than information passed to or from the processes.
Proceedings Article•10.1109/MICRO.2012.35•
Addressing End-to-End Memory Access Latency in NoC-Based Multicores

[...]

Akbar Sharifi1, Emre Kultursay1, Mahmut Kandemir1, Chita R. Das1•
Pennsylvania State University1
1 Dec 2012
TL;DR: This work proposes two network prioritization schemes that can cooperatively improve performance by reducing end-to-end memory access latencies in Network-on-Chip (NoC) based multicores and prioritizes the request messages that are destined for idle memory banks over others, which lead to uniform memory access Latencies with a low average value.
Abstract: To achieve high performance in emerging multicores, it is crucial to reduce the number of memory accesses that suffer from very high latencies. However, this should be done with care as improving latency of an access can worsen the latency of another as a result of resource sharing. Therefore, the goal should be to balance latencies of memory accesses issued by an application in an execution phase, while ensuring a low average latency value. Targeting Network-on-Chip (NoC) based multicores, we propose two network prioritization schemes that can cooperatively improve performance by reducing end-to-end memory access latencies. Our first scheme prioritizes memory response messages such that, in a given period of time, messages of an application that experience higher latencies than the average message latency for that application are expedited and a more uniform memory latency pattern is achieved. Our second scheme prioritizes the request messages that are destined for idle memory banks over others, with the goal of improving bank utilization and preventing long queues from being built in front of the memory banks. These two network prioritization-based optimizations together lead to uniform memory access latencies with a low average value. Our experiments with a 4x8 mesh network-based multicore show that, when applied together, our schemes can achieve 15%, 10% and 13% performance improvement on memory intensive, memory non-intensive, and mixed multiprogrammed workloads, respectively.
Book•
Automatic Parallelization: An Overview of Fundamental Compiler Techniques

[...]

Samuel P. Midkiff1•
Purdue University1
1 Jan 2012
TL;DR: In this paper, the authors describe the fundamental principles of compiling regular numerical programs for parallelism and discuss transformations that expose parallelism to target shared memory multicore and vector processors, and discuss some problems that arise when parallelizing programs for execution on distributed memory machines.
Abstract: Compiling for parallelism is a longstanding topic of compiler research. This book describes the fundamental principles of compiling "regular" numerical programs for parallelism. We begin with an explanation of analyses that allow a compiler to understand the interaction of data reads and writes in different statements and loop iterations during program execution. These analyses include dependence analysis, use-def analysis and pointer analysis. Next, we describe how the results of these analyses are used to enable transformations that make loops more amenable to parallelization, and discuss transformations that expose parallelism to target shared memory multicore and vector processors. We then discuss some problems that arise when parallelizing programs for execution on distributed memory machines. Finally, we conclude with an overview of solving Diophantine equations and suggestions for further readings in the topics of this book to enable the interested reader to delve deeper into the field. Table of Contents: Introduction and overview / Dependence analysis, dependence graphs and alias analysis / Program parallelization / Transformations to modify and eliminate dependences / Transformation of iterative and recursive constructs / Compiling for distributed memory machines / Solving Diophantine equations / A guide to further reading
Journal Article•10.1109/TCAD.2011.2160348•
Memory-Efficient On-Chip Network With Adaptive Interfaces

[...]

Masoud Daneshtalab1, Masoumeh Ebrahimi1, Pasi Liljeberg1, Juha Plosila1, Hannu Tenhunen1 •
Information Technology University1
01 Jan 2012-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
TL;DR: The proposed on-chip network architecture exploits a streamlined reordering mechanism to handle the in-order delivery and utilizes the advance extensible interface transaction-based protocol to maintain compatibility with existing intellectual property cores.
Abstract: To achieve higher memory bandwidth in network-based multiprocessor architectures, multiple dynamic random access memories can be accessed simultaneously. In such architectures, not only resource utilization and latency are the critical issues but also a reordering mechanism is required to deliver the response transactions of concurrent memory accesses in-order. In this paper, we present a memory-efficient on-chip network architecture to cope with these issues efficiently. Each node of the network is equipped with a novel network interface (NI) to deal with out-of-order delivery, and a priority-based router to decrease the network latency. The proposed NI exploits a streamlined reordering mechanism to handle the in-order delivery and utilizes the advance extensible interface transaction-based protocol to maintain compatibility with existing intellectual property cores. To improve the memory utilization and reduce the memory latency, an optimized memory controller is integrated in the presented NI. Experimental results with synthetic test cases demonstrate that the proposed on-chip network architecture provides significant improvements in average network latency (16%), average memory access latency (19%), and average memory utilization (22%).
Book Chapter•10.1007/978-3-642-33460-3_35•
CC-MR --- finding connected components in huge graphs with mapreduce

[...]

Thomas Seidl1, Brigitte Boden1, Sergej Fries1•
RWTH Aachen University1
24 Sep 2012
TL;DR: The algorithm CC-MR significantly outperforms the existing approaches for the MapReduce framework in terms of the number of necessary iterations, communication costs and execution runtime, as well as in the experimental evaluation on synthetic and real-world data.
Abstract: The detection of connected components in graphs is a well-known problem arising in a large number of applications including data mining, analysis of social networks, image analysis and a lot of other related problems. In spite of the existing very efficient serial algorithms, this problem remains a subject of research due to increasing data amounts produced by modern information systems which cannot be handled by single workstations. Only highly parallelized approaches on multi-core-servers or computer clusters are able to deal with these large-scale data sets. In this work we present a solution for this problem for distributed memory architectures, and provide an implementation for the well-known MapReduce framework developed by Google. Our algorithm CC-MR significantly outperforms the existing approaches for the MapReduce framework in terms of the number of necessary iterations, communication costs and execution runtime, as we show in our experimental evaluation on synthetic and real-world data. Furthermore, we present a technique for accelerating our implementation for datasets with very heterogeneous component sizes as they often appear in real data sets.
Journal Article•
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Mode

[...]

Ziming Zhong, Vladimir Rychkov, Alexey Lastovetsky
01 Jan 2012-Cluster Computing
TL;DR: In this article, the authors consider a hybrid platform as a heterogeneous distributed-memory system and apply the approach of functional performance models, which was originally designed for uniprocessor machines.
Abstract: Transition to hybrid CPU/GPU platforms in high performance computing is challenging in the aspect of efficient utilisation of the heterogeneous hardware and existing optimised software. During recent years, scientific software has been ported to multicore and GPU architectures and now should be reused on hybrid platforms. In this paper, we model the performance of such scientific applications in order to execute them efficiently on hybrid platforms. We consider a hybrid platform as a heterogeneous distributed-memory system and apply the approach of functional performance models, which was originally designed for uniprocessor machines. The functional performance model (FPM) represents the processor speed by a function of problem size and integrates many important features characterising the performance of the architecture and the application. We demonstrate that FPMs facilitate performance evaluation of scientific applications on hybrid platforms. FPM-based data partitioning algorithms have been proved to be accurate for load balancing on heterogeneous networks of uniprocessor computers. We apply FPM-based data partitioning to balance the load between cores and GPUs in the hybrid architecture. In our experiments with parallel matrix multiplication, we couple the existing software optimised for multicores and GPUs and achieve high performance of the whole hybrid system.
Book Chapter•10.4324/9780203052242-11•
A Distributed Memory Model of Context Effects in Word Identification

[...]

Michael E. J. Masson
12 Nov 2012
Journal Article•10.1109/TPDS.2012.19•
pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs

[...]

Changjun Wu1, Ananth Kalyanaraman2, William R. Cannon3•
Xerox1, Washington State University2, Pacific Northwest National Laboratory3
01 Oct 2012-IEEE Transactions on Parallel and Distributed Systems
TL;DR: The method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation.
Abstract: Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting the homology between two protein sequences is relatively inexpensive, detecting pairwise homology for a large number of protein sequences can become computationally prohibitive for modern inputs, often requiring millions of CPU hours. Yet, there is currently no robust support to parallelize this kernel. In this paper, we identify the key characteristics that make this problem particularly hard to parallelize, and then propose a new parallel algorithm that is suited for detecting homology on large data sets using distributed memory parallel computers. Our method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation. Experimental results show that pGraph achieves linear scaling on a 2,048 processor distributed memory cluster for a wide range of inputs ranging from as small as 20,000 sequences to 2,560,000 sequences. In addition to demonstrating strong scaling, we present an extensive report on the performance of the various system components and related parametric studies.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve