Top 147 papers presented at Parallel Computing in 2015

Showing papers presented at "Parallel Computing in 2015"

Fast Greedy Algorithms in MapReduce and Streaming

[...]

Ravi Kumar¹, Benjamin Moseley², Sergei Vassilvitskii¹, Andrea Vattani³•Institutions (3)

Google¹, Washington University in St. Louis², University of California, San Diego³

8 Sep 2015

TL;DR: A powerful sampling technique that aids in parallelization of sequential algorithms and yields efficient algorithms that run in a logarithmic number of rounds while obtaining solutions that are arbitrarily close to those produced by the standard sequential greedy algorithm.

...read moreread less

Abstract: Greedy algorithms are practitioners’ best friends—they are intuitive, are simple to implement, and often lead to very good solutions. However, implementing greedy algorithms in a distributed setting is challenging since the greedy choice is inherently sequential, and it is not clear how to take advantage of the extra processing power. Our main result is a powerful sampling technique that aids in parallelization of sequential algorithms. Armed with this primitive, we then adapt a broad class of greedy algorithms to the MapReduce paradigm; this class includes maximum cover and submodular maximization subject to p-system constraint problems. Our method yields efficient algorithms that run in a logarithmic number of rounds while obtaining solutions that are arbitrarily close to those produced by the standard sequential greedy algorithm. We begin with algorithms for modular maximization subject to a matroid constraint and then extend this approach to obtain approximation algorithms for submodular maximization subject to knapsack or p-system constraints.

...read moreread less

294 citations

Journal Article•10.1016/J.PARCO.2015.03.003•

Parallel heuristics for scalable community detection

[...]

Hao Lu¹, Mahantesh Halappanavar², Ananth Kalyanaraman¹•Institutions (2)

Washington State University¹, Pacific Northwest National Laboratory²

1 Aug 2015

TL;DR: Compared to the serial Louvain implementation, the parallel implementation is able to produce community outputs with a higher modularity for most of the inputs tested, in comparable number or fewer iterations, while providing absolute speedups of up to 16 ?

...read moreread less

Abstract: Novel parallel heuristics for community detection in large-scale graphs.Multi-threaded implementations using OpenMP.Thorough experimental evaluation of parallel heuristics on a platform with 32 cores. Community detection has become a fundamental operation in numerous graph-theoretic applications. It is used to reveal natural divisions that exist within real world networks without imposing prior size or cardinality constraints on the set of communities. Despite its potential for application, there is only limited support for community detection on large-scale parallel computers, largely owing to the irregular and inherently sequential nature of the underlying heuristics. In this paper, we present parallelization heuristics for fast community detection using the Louvain method as the serial template. The Louvain method is a multi-phase, iterative heuristic for modularity optimization. Originally developed by Blondel et al. (2008), the method has become increasingly popular owing to its ability to detect high modularity community partitions in a fast and memory-efficient manner. However, the method is also inherently sequential, thereby limiting its scalability. Here, we observe certain key properties of this method that present challenges for its parallelization, and consequently propose heuristics that are designed to break the sequential barrier. For evaluation purposes, we implemented our heuristics using OpenMP multithreading, and tested them over real world graphs derived from multiple application domains (e.g., internet, citation, biological). Compared to the serial Louvain implementation, our parallel implementation is able to produce community outputs with a higher modularity for most of the inputs tested, in comparable number or fewer iterations, while providing absolute speedups of up to 16 ? using 32 threads.

...read moreread less

163 citations

Journal Article•10.1145/2780584•

Remote Memory Access Programming in MPI-3

[...]

Torsten Hoefler¹, James Dinan², Rajeev Thakur³, Brian W. Barrett⁴, Pavan Balaji³, William Gropp⁵, Keith D. Underwood² - Show less +3 more•Institutions (5)

ETH Zurich¹, Intel², Argonne National Laboratory³, Sandia National Laboratories⁴, University of Illinois at Urbana–Champaign⁵

29 Jun 2015

TL;DR: The new RMA interface is presented and formal axiomatic models for data consistency and access semantics are specified to help users reason about details of the semantics that are hard to extract from the English prose in the standard.

...read moreread less

Abstract: The Message Passing Interface (MPI) 3.0 standard, introduced in September 2012, includes a significant update to the one-sided communication interface, also known as remote memory access (RMA). In particular, the interface has been extended to better support popular one-sided and global-address-space parallel programming models to provide better access to hardware performance features and enable new data-access modes. We present the new RMA interface and specify formal axiomatic models for data consistency and access semantics. Such models can help users reason about details of the semantics that are hard to extract from the English prose in the standard. It also fosters the development of tools and compilers, enabling them to automatically analyze, optimize, and debug RMA programs.

...read moreread less

114 citations

Journal Article•10.1145/2686884•

Lock Cohorting: A General Technique for Designing NUMA Locks

[...]

David Dice¹, Virendra J. Marathe¹, Nir Shavit²•Institutions (2)

Oracle Corporation¹, Massachusetts Institute of Technology²

18 Feb 2015

TL;DR: This article presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful, and allows one to transform any spin-lock algorithm, with minimal nonintrusive changes, into a scalable NUma-aware spin-locks.

...read moreread less

Abstract: Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machine's nonuniform memory and caching hierarchy, ever more important. This article presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful. Lock cohorting allows one to transform any spin-lock algorithm, with minimal nonintrusive changes,into a scalable NUMA-aware spin-lock. Our new cohorting technique allows us to easily create NUMA-aware versions of the TATAS-Backoff, CLH, MCS, and ticket locks, to name a few. Moreover, it allows us to derive a CLH-based cohort abortable lock, the first NUMA-aware queue lock to support abortability. We empirically compared the performance of cohort locks with prior NUMA-aware and classic NUMA-oblivious locks on a synthetic micro-benchmark, a real world key-value store application memcached, as well as the libc memory allocator. Our results demonstrate that cohort locks perform as well or better than known locks when the load is low and significantly out-perform them as the load increases.

...read moreread less

110 citations

Journal Article•10.1016/J.PARCO.2014.12.001•

ScaffCC: Scalable compilation and analysis of quantum programs

[...]

Ali Javadi-Abhari¹, Shruti Patil¹, Daniel Kudrow², Jeff Heckey², Alexey Lvov³, Frederic T. Chong², Margaret Martonosi¹ - Show less +3 more•Institutions (3)

Princeton University¹, University of California, Santa Barbara², IBM³

1 Jun 2015

TL;DR: ScaffCC as mentioned in this paper is a scalable compilation and analysis framework based on LLVM which can be used for compiling quantum computing applications at the logical level, and integrates a reversible-logic synthesis tool in the compiler to facilitate coding of quantum circuits.

...read moreread less

Abstract: We present ScaffCC, a scalable compilation and analysis framework based on LLVM (Lattner and Adve, 2004), which can be used for compiling quantum computing applications at the logical level. Drawing upon mature compiler technologies, we discuss similarities and differences between compilation of classical and quantum programs, and adapt our methods to optimizing the compilation time and output for the quantum case. Our work also integrates a reversible-logic synthesis tool in the compiler to facilitate coding of quantum circuits. Lastly, we present some useful quantum program analysis scenarios and discuss their implications, specifically with an elaborate discussion of timing analysis for critical path estimation. Our work focuses on bridging the gap between high-level quantum algorithm specifications and low-level physical implementations, while providing good scalability to larger and more interesting problems.

...read moreread less

104 citations

Journal Article•10.1145/2717511•

High-Performance and Scalable GPU Graph Traversal

[...]

Duane Merrill¹, Michael Garland¹, Andrew S. Grimshaw²•Institutions (2)

Nvidia¹, University of Virginia²

18 Feb 2015

TL;DR: This work presents a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum computations that achieves an asymptotically optimal O(|V| + |E|) gd work complexity.

...read moreread less

Abstract: Breadth-First Search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with nontrivial diameter. We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum computations that achieves an asymptotically optimal O(|V| + |E|) gd work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single- and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations on both CPU and GPU platforms.

...read moreread less

102 citations

Journal Article•10.1016/J.PARCO.2014.04.008•

A scalable parallel genetic algorithm for the Generalized Assignment Problem

[...]

Yan Liu¹, Shaowen Wang¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

1 Jul 2015

TL;DR: A scalable coarse-grained PGA-PGAP is developed to enable efficient deme interactions and significantly improve the overlapping of computation and communication for NP-hard optimization problem: Generalized Assignment Problem (GAP).

...read moreread less

Abstract: A scalable parallel genetic algorithm (PGA) is developed for the NP-hard GAP problem.High scalability is achieved through a novel asynchronous migration strategy.Our algorithmic analysis resolves the buffer overflow issue in asynchronous PGAs.Strong and weak scaling tests demonstrate the superiority of our PGA approach.Our approach scales to 16,384 processors with super-linear speedups observed. Known as an effective heuristic for finding optimal or near-optimal solutions to difficult optimization problems, a genetic algorithm (GA) is inherently parallel for exploiting high performance and parallel computing resources for randomized iterative evolutionary computation. It remains to be a significant challenge, however, to devise parallel genetic algorithms (PGAs) that can scale to massively parallel computer architecture (also known as the mainstream supercomputer architecture) primarily because: (1) a common PGA design adopts synchronized migration, which becomes increasingly costly as more processor cores are involved in global synchronization; and (2) asynchronous PGA design and associated performance evaluation are intricate due to the fact that PGA is a type of stochastic algorithm and the amount of computation work needed to solve a problem is not simply dependent on the problem size. To address the challenge, this paper describes a scalable coarse-grained PGA-PGAP, for a well-known NP-hard optimization problem: Generalized Assignment Problem (GAP). Specifically, an asynchronous migration strategy is developed to enable efficient deme interactions and significantly improve the overlapping of computation and communication. Buffer overflow and its relationship with migration parameters were investigated to resolve the issues of observed message buffer overflow and the loss of good solutions obtained from migration. Two algorithmic conditions were then established to detect these issues caused by communication delays and improper configuration of migration parameters and, thus, guide the dynamic tuning of PGA parameters to detect and avoid these issues. A set of computational experiments is designed to evaluate the scalability and numerical performance of PGAP. These experiments were conducted for large GAP instances on multiple supercomputers as part of the National Science Foundation Extreme Science and Engineering Discovery Environment (XSEDE). Results showed that, PGAP exhibited desirable scalability by achieving low communication cost when using up to 16,384 processor cores. Near-linear and super-linear speedups on large GAP instances were obtained in strong scaling tests. Desirable scalability to both population size and the number of processor cores were observed in weak scaling tests. The design strategies applied in PGAP are applicable to general asynchronous PGA development.

...read moreread less

81 citations

Journal Article•10.1145/2742343•

Near-Optimal Scheduling Mechanisms for Deadline-Sensitive Jobs in Large Computing Clusters

[...]

Navendu Jain¹, Ishai Menache¹, Joseph (Seffi) Naor², Jonathan Yaniv²•Institutions (2)

Microsoft¹, Technion – Israel Institute of Technology²

13 Apr 2015

TL;DR: This work considers a market-based resource allocation model for batch jobs in cloud computing clusters that incorporates the importance of the due date of a job rather than the number of servers allocated to it at any given time and constructs truthful allocation and pricing mechanisms.

...read moreread less

Abstract: We consider a market-based resource allocation model for batch jobs in cloud computing clusters. In our model, we incorporate the importance of the due date of a job rather than the number of servers allocated to it at any given time. Each batch job is characterized by the work volume of total computing units (e.g., CPU hours) along with a bound on maximum degree of parallelism. Users specify, along with these job characteristics, their desired due date and a value for finishing the job by its deadline. Given this specification, the primary goal is to determine the scheduling of cloud computing instances under capacity constraints in order to maximize the social welfare (i.e., sum of values gained by allocated users). Our main result is a new (CC-kcss-1)-approximation algorithm for this objective, where C denotes cloud capacity, k is the maximal bound on parallelized execution (in practical settings, k C) and s is the slackness on the job completion time, that is, the minimal ratio between a specified deadline and the earliest finish time of a job. Our algorithm is based on utilizing dual fitting arguments over a strengthened linear program to the problem. Based on the new approximation algorithm, we construct truthful allocation and pricing mechanisms, in which reporting the true value and other properties of the job (deadline, work volume, and the parallelism bound) is a dominant strategy for all users. To that end, we extend known results for single-value settings to provide a general framework for transforming allocation algorithms into truthful mechanisms in domains of single-value and multi-properties. We then show that the basic mechanism can be extended under proper Bayesian assumptions to the objective of maximizing revenues, which is important for public clouds. We empirically evaluate the benefits of our approach through simulations on data-center job traces, and show that the revenues obtained under our mechanism are comparable with an ideal fixed-price mechanism, which sets an on-demand price using oracle knowledge of users’ valuations. Finally, we discuss how our model can be extended to accommodate uncertainties in job work volumes, which is a practical challenge in cloud settings.

...read moreread less

76 citations

Journal Article•10.1145/2809808•

On-the-Fly Pipeline Parallelism

[...]

I-Ting Angelina Lee¹, Charles E. Leiserson¹, Tao B. Schardl¹, Zhunping Zhang¹, Jim Sukha² - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Intel²

8 Sep 2015

TL;DR: A provably efficient scheduling algorithm, the Piper algorithm, is described, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested and automatically throttles the parallelism, precluding “runaway” pipelines.

...read moreread less

Abstract: Pipeline parallelism organizes a parallel program as a linear sequence of stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a “construct-and-run” approach, this article investigates “on-the-fly” pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding “runaway” pipelines. Given a pipeline computation with T1 work and T∞ span (critical-path length), Piper executes the computation on P processors in TP ≤ T1/P+O(T∞+lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as “lazy enabling” and “dependency folding.” We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.

...read moreread less

71 citations

Journal Article•10.1016/J.PARCO.2015.09.001•

Numerical reproducibility for the parallel reduction on multi- and many-core architectures

[...]

Sylvain Collange¹, David Defour², Stef Graillat³, Roman Iakymchuk⁴•Institutions (4)

French Institute for Research in Computer Science and Automation¹, University of Perpignan², Centre national de la recherche scientifique³, University of Paris⁴

1 Nov 2015

TL;DR: It is shown that numerical reproducibility and bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to 90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms.

...read moreread less

Abstract: A parallel algorithm to compute correctly-rounded floating-point sumsHighly-optimized implementations for modern CPUs, GPUs and Xeon PhiAs fast as memory bandwidth allows for large sums with moderate dynamic rangeScales well with the problem size and resources used on a cluster of compute nodes On modern multi-core, many-core, and heterogeneous architectures, floating-point computations, especially reductions, may become non-deterministic and, therefore, non-reproducible mainly due to the non-associativity of floating-point operations. We introduce an approach to compute the correctly rounded sums of large floating-point vectors accurately and efficiently, achieving deterministic results by construction. Our multi-level algorithm consists of two main stages: first, a filtering stage that relies on fast vectorized floating-point expansion; second, an accumulation stage based on superaccumulators in a high-radix carry-save representation. We present implementations on recent Intel desktop and server processors, Intel Xeon Phi co-processors, and both AMD and NVIDIA GPUs. We show that numerical reproducibility and bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to 90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms.

...read moreread less

67 citations

Journal Article•10.1016/J.PARCO.2015.04.003•

Enhancing the performance of malleable MPI applications by using performance-aware dynamic reconfiguration

[...]

Gonzalo Martín¹, David E. Singh¹, Maria-Cristina Marinescu², Jesus Carretero¹•Institutions (2)

Charles III University of Madrid¹, Barcelona Supercomputing Center²

1 Jul 2015

TL;DR: The design and implementation of Flex-MPI is described, an MPI library extension which can automatically monitor and predict the performance of applications, balance and redistribute the workload, and reconfigure the application at runtime by changing the number of processes.

...read moreread less

Abstract: We present Flex-MPI, an MPI extension which supports malleability for MPI programs.We introduce a performance-aware dynamic reconfiguring technique for malleability.We implement a prediction model to estimate the performance of MPI programs.We present a simple API to access Flex-MPI functionalities from MPI applications.We demonstrate Flex-MPI capabilities to enhance performance via malleability. The work in this paper focuses on providing malleability to MPI applications by using a novel performance-aware dynamic reconfiguration technique. This paper describes the design and implementation of Flex-MPI, an MPI library extension which can automatically monitor and predict the performance of applications, balance and redistribute the workload, and reconfigure the application at runtime by changing the number of processes. Unlike existent approaches, our reconfiguring policy is guided by user-defined performance criteria. We focus on iterative SPMD programs, a class of applications with critical mass within the scientific community. Extensive experiments show that Flex-MPI can improve the performance, parallel efficiency, and cost-efficiency of MPI programs with a minimal effort from the programmer.

...read moreread less

Journal Article•10.1145/2742359•

The Shape of the Search Tree for the Maximum Clique Problem and the Implications for Parallel Branch and Bound

[...]

Ciaran McCreesh¹, Patrick Prosser¹•Institutions (1)

University of Glasgow¹

13 Apr 2015

TL;DR: In this paper, the authors compare two multicore thread-parallel adaptations of a state-of-the-art branch-and-bound algorithm for the maximum clique problem and provide a novel explanation as to why they are successful.

...read moreread less

Abstract: Finding a maximum clique in a given graph is one of the fundamental NP-hard problems. We compare two multicore thread-parallel adaptations of a state-of-the-art branch-and-bound algorithm for the maximum clique problem and provide a novel explanation as to why they are successful. We show that load balance is sometimes a problem but that the interaction of parallel search order and the most likely location of solutions within the search space is often the dominating consideration. We use this explanation to propose a new low-overhead, scalable work-splitting mechanism. Our approach uses explicit early diversity to avoid strong commitment to the weakest heuristic advice and late resplitting for balance. More generally, we argue that, for branch-and-bound, parallel algorithm design should not be performed independently of the underlying sequential algorithm.

...read moreread less

Journal Article•10.1016/J.PARCO.2015.04.004•

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors

[...]

Weifeng Liu¹, Brian Vinter¹•Institutions (1)

University of Copenhagen¹

1 Nov 2015

TL;DR: In this paper, a speculative segmented sum strategy for the CSR-based SpMV algorithm is proposed, where the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector.

...read moreread less

Abstract: A speculative segmented sum strategy for the CSR-based SpMV.Utilizing both GPU cores and CPU cores in a heterogeneous processor.No format conversion or tuning overhead for input sparse matrices in the CSR format.High speedup over the CSR-vector algorithm running irregular matrices.No performance penalty for most regular matrices. Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect result. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms.

...read moreread less

Journal Article•10.1016/J.PARCO.2015.03.004•

Spectral clustering for divide-and-conquer graph matching

[...]

Vince Lyzinski¹, Daniel L. Sussman², Donniell E. Fishkind¹, Henry Pao¹, Li Chen¹, Joshua T. Vogelstein¹, Youngser Park¹, Carey E. Priebe¹ - Show less +4 more•Institutions (2)

Johns Hopkins University¹, Harvard University²

1 Aug 2015

TL;DR: A divide-and-conquer bijective graph matching algorithm was proposed in this article, which combines spectral graph embedding with existing state-of-the-art seeded graph matching procedures.

...read moreread less

Abstract: We present a novel divide-and-conquer bijective graph matching algorithm.The algorithm is fully parallelizable, and scales to match "big data" graphs.We demonstrate the effectiveness of the algorithm by matching DTMRI human connectomes. We present a parallelized bijective graph matching algorithm that leverages seeds and is designed to match very large graphs. Our algorithm combines spectral graph embedding with existing state-of-the-art seeded graph matching procedures. We justify our approach by proving that modestly correlated, large stochastic block model random graphs are correctly matched utilizing very few seeds through our divide-and-conquer procedure. We also demonstrate the effectiveness of our approach in matching very large graphs in simulated and real data examples, showing up to a factor of 8 improvement in runtime with minimal sacrifice in accuracy.

...read moreread less

Journal Article•10.1145/2742345•

Hedonic Clustering Games

[...]

Moran Feldman¹, Liane Lewin-Eytan², Joseph (Seffi) Naor¹•Institutions (2)

Technion – Israel Institute of Technology¹, Yahoo!²

21 May 2015

TL;DR: A thorough analysis of hedonic clustering games in two different models, characterizing Nash equilibria, and proving upper and lower bounds on the price of anarchy and price of stability are provided.

...read moreread less

Abstract: Clustering, the partitioning of objects with respect to a similarity measure, has been extensively studied as a global optimization problem. We investigate clustering from a game-theoretic approach, and consider the class of hedonic clustering games. Here, a self-organized clustering is obtained via decisions made by independent players, corresponding to the elements clustered. Being a hedonic setting, the utility of each player is determined by the identity of the other members of her cluster. This class of games seems to be quite robust, as it fits with rather different, yet commonly used, clustering criteria. Specifically, we investigate hedonic clustering games in two different models: fixed clustering, which subdivides into k-median and k-center, and correlation clustering. We provide a thorough analysis of these games, characterizing Nash equilibria, and proving upper and lower bounds on the price of anarchy and price of stability. For fixed clustering we focus on the existence of a Nash equilibrium, as it is a rather nontrivial issue in this setting. We study it both for general metrics and special cases, such as line and tree metrics. In the correlation clustering model, we study both minimization and maximization variants, and provide almost tight bounds on both the price of anarchy and price of stability.

...read moreread less

Journal Article•10.1145/2686892•

Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy

[...]

Aurelien Bouteiller¹, Thomas Herault¹, George Bosilca¹, Peng Du¹, Jack Dongarra¹ - Show less +1 more•Institutions (1)

University of Tennessee¹

18 Feb 2015

TL;DR: This article proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures and presents a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations.

...read moreread less

Abstract: Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This article proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable node and the possibility of losing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead decreases inversely to the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors. Applicability to tolerate multiple failures and accuracy after multiple recovery is also considered.

...read moreread less

Journal Article•10.1145/2779052•

Parallel Scheduling of Task Trees with Limited Memory

[...]

Lionel Eyraud-Dubois¹, Loris Marchal², Oliver Sinnen³, Frédéric Vivien²•Institutions (3)

University of Bordeaux¹, University of Lyon², University of Auckland³

29 Jun 2015

TL;DR: In this article, the authors consider the problem of tree-shaped task graphs with multiple processors and design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan.

...read moreread less

Abstract: This article investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. Such trees arise in the multifrontal method of sparse matrix factorization. The peak memory needed for the processing of the entire tree depends on the execution order of the tasks. With one processor, the objective of the tree traversal is to minimize the required memory. This problem was well studied, and optimal polynomial algorithms were proposed.Here, we extend the problem by considering multiple processors, which is of obvious interest in the application area of matrix factorization. With multiple processors comes the additional objective to minimize the time needed to traverse the tree—that is, to minimize the makespan. Not surprisingly, this problem proves to be much harder than the sequential one. We study the computational complexity of this problem and provide inapproximability results even for unit weight trees. We design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan. Some of these heuristics are able to process a tree while keeping the memory usage under a given memory limit. The different heuristics are evaluated in an extensive experimental evaluation using realistic trees.

...read moreread less

Proceedings Article•

Simulation of external aerodynamics of the DrivAer model with the LBM on GPGPUs.

[...]

Andrea Pasquali¹, Martin Schönherr¹, Martin Geier¹, Manfred Krafczyk¹•Institutions (1)

Braunschweig University of Technology¹

1 Jan 2015

Proceedings Article•10.3233/978-1-61499-621-7-489•

Static and Dynamic Big Data Partitioning on Apache Spark

[...]

Massimiliano Bertolucci¹, Emanuele Carlini², Patrizio Dazzi², Alessandro Lulli¹, Laura Ricci³ - Show less +1 more•Institutions (3)

University of Pisa¹, Istituto di Scienza e Tecnologie dell'Informazione², University of Chieti-Pescara³

1 Jan 2015

TL;DR: This paper implemented different graph algorithms and compared their performances using a naive partitioning solution against more elaborate strategies, both static and dynamic.

...read moreread less

Abstract: Many of today’s large datasets are organized as a graph. Due to their size it is often infeasible to process these graphs using a single machine. Therefore, many software frameworks and tools have been proposed to process graph on top of distributed infrastructures. This software is often bundled with generic data decomposition strategies that are not optimised for specific algorithms. In this paper we study how a specific data partitioning strategy affects the performances of graph algorithms executing on Apache Spark. To this end, we implemented different graph algorithms and we compared their performances using a naive partitioning solution against more elaborate strategies, both static and dynamic.

...read moreread less

Journal Article•10.1016/J.PARCO.2015.02.005•

Evaluation of connected-component labeling algorithms for distributed-memory systems

[...]

Jeremy Iverson¹, Chandrika Kamath¹, George Karypis²•Institutions (2)

Lawrence Livermore National Laboratory¹, University of Minnesota²

1 May 2015

TL;DR: Under weak scaling and for certain classes of graphs, the graph contraction algorithm scales consistently better than the four other algorithms and is of the same order in terms of memory as the other two.

...read moreread less

Abstract: Introduces a graph contraction based distributed-memory connected component algorithm.Four alternative distributed-memory connected component algorithms are presented.Theoretical and experimental analysis is presented for the five algorithms.Classes of problems under which the algorithms are most applicable are identified.Novel algorithm shows better scalability across the range of scientific computing graphs used herein. Connected component labeling is a key step in a wide-range of applications, such as community detection in social networks and coherent structure identification in massively-parallel scientific simulations. There have been several distributed-memory connected component algorithms described in literature; however, little has been done regarding their scalability analysis. Theoretical and experimental results are presented for five algorithms: three that are direct implementations of previous approaches, one that is an implementation of a previous approach that is optimized to reduce communication, and one that is a novel approach based on graph contraction. Under weak scaling and for certain classes of graphs, the graph contraction algorithm scales consistently better than the four other algorithms. Furthermore, it uses significantly less memory than two of the alternative methods and is of the same order in terms of memory as the other two.

...read moreread less

Journal Article•10.1016/J.PARCO.2015.03.008•

A direct tridiagonal solver based on Givens rotations for GPU architectures

[...]

Ioannis E. Venetis¹, Alexandros Kouris¹, Alexandros Sobczyk¹, Efstratios Gallopoulos¹, Ahmed H. Sameh² - Show less +1 more•Institutions (2)

University of Patras¹, Purdue University²

1 Nov 2015

TL;DR: A parallel solver for general tridiagonal irreducible systems and its CUDA implementation are described, indicating that g-Spike is competitive in runtime with existing GPU methods, and can provide acceptable results when other methods cannot be applied or fail.

...read moreread less

Abstract: A parallel solver for general tridiagonal irreducible systems is described.Solver based on Spike framework and Givens-QR with occasional low-rank modification.Modifications handle singularities exposed by QR in blocks of the parallel partition.The GPU implementation has similar performance to existing methods.Method returns accurate results when current GPU tridiagonal solvers fail. g-Spike, a parallel algorithm for solving general nonsymmetric tridiagonal systems for the GPU, and its CUDA implementation are described. The solver is based on the Spike framework, applying Givens rotations and QR factorization without pivoting. It also implements a low-rank modification strategy to compute the Spike DS decomposition even when the partitioning defines singular submatrices along the diagonal. The method is also used to solve the reduced system resulting from the Spike partitioning. Numerical experiments with problems of high order indicate that g-Spike is competitive in runtime with existing GPU methods, and can provide acceptable results when other methods cannot be applied or fail.

...read moreread less

Journal Article•10.1016/J.PARCO.2015.01.004•

A cost-optimal parallel algorithm for the 0-1 knapsack problem and its performance on multicore CPU and GPU implementations

[...]

Kenli Li, Jing Liu, Lanjun Wan, Shu Yin, Keqin Li¹ - Show less +1 more•Institutions (1)

State University of New York System¹

1 Mar 2015

TL;DR: The experimental results show that COPA could reduce a significant amount of execution time, and the approach achieves the speedups of up to 10.26 on multicore CPU implementations and 17.53 on GPU implementations when the sequential dynamic programming algorithm for KP01 is considered as a baseline.

...read moreread less

Abstract: A parallel algorithm COPA is proposed for the 0-1 knapsack problem.There are no memory conflicts for the proposed algorithm.The proposed algorithm has better scalability.The proposed algorithm can achieve good results on general GPU. The 0-1 knapsack problem has been extensively studied in the past years due to its immediate applications in industry and financial management, such as cargo loading, stock cutting, and budget control. Many algorithms have been proposed to solve this problem, most of which are heuristic, as the problem is well-known to be NP-hard. Only a few optimal algorithms have been designed to solve this problem but with high time complexity. This paper proposes the cost-optimal parallel algorithm (COPA) on an EREW PRAM model with shared memory to solve this problem. COPA is scalable and yields optimal solutions consuming less computational time. Furthermore, this paper implements COPA on two scenarios - multicore CPU based architectures using Open MP and GPU based configurations using CUDA. A series of experiments are conducted to examine the performance of COPA under two different test platforms. The experimental results show that COPA could reduce a significant amount of execution time. Our approach achieves the speedups of up to 10.26 on multicore CPU implementations and 17.53 on GPU implementations when the sequential dynamic programming algorithm for KP01 is considered as a baseline. Importantly, GPU implementations outstand themselves in the experimental results.

...read moreread less

Journal Article•10.1145/2809810•

SybilCast: Broadcast on the Open Airwaves

[...]

Seth Gilbert¹, Chaodong Zheng¹•Institutions (1)

National University of Singapore¹

8 Sep 2015

TL;DR: SybilCast is a protocol for multichannel wireless networks that limits the number of fake identities and, in doing so, ensures that each honest user gets at least a constant fraction of his or her fair share of the bandwidth.

...read moreread less

Abstract: Consider a scenario where many wireless users are attempting to download data from a single base station. While most of the users are honest, some users may be malicious and attempt to obtain more than their fair share of the bandwidth. One possible strategy for attacking the system is to simulate multiple fake identities, each of which is given its own equal share of the bandwidth. Such an attack is often referred to as a sybil attack. To counter such behavior, we propose SybilCast, a protocol for multichannel wireless networks that limits the number of fake identities and, in doing so, ensures that each honest user gets at least a constant fraction of his or her fair share of the bandwidth. As a result, each honest user can complete his or her data download in asymptotically optimal time. A key aspect of this protocol is balancing the rate at which new identities are admitted and the maximum number of fake identities that can coexist while keeping the overhead low. Besides sybil attacks, our protocol can also tolerate spoofing and jamming.

...read moreread less

Proceedings Article•10.3233/978-1-61499-621-7-431•

Flexible and Generic Workflow Management

[...]

Sebastian Lührs, Daniel Rohe, Alexander Schnurpfeil, Kay Thust, Wolfgang Frings¹ - Show less +1 more•Institutions (1)

Forschungszentrum Jülich¹

1 Jan 2015

Journal Article•10.1145/2742341•

Linear and Competitive Strategies for Continuous Robot Formation Problems

[...]

Bastian Degener¹, Barbara Kempkes¹, Peter Kling¹, Friedhelm Meyer auf der Heide¹•Institutions (1)

University of Paderborn¹

21 May 2015

TL;DR: This work focuses on a continuous time model, where n mobile robots continuously sense the positions of other robots within their viewing range and continuously adapt their speed and direction according to some simple, local rules, and shows that this idealized idea of continuous sensing allows this model to solve formation problems in linear time O(n) (which, given the maximum speed of one, immediately yields a maximum traveled distance of O( n).

...read moreread less

Abstract: We study a scenario in which n mobile robots with a limited viewing range are distributed in the Euclidean plane and have to solve a formation problem. The formation problems we consider are the Gathering problem and the Chain-Formation problem. In the Gathering problem, the robots have to gather in one (not predefined) point, while in the Chain-Formation problem they have to form a connected communication chain of minimal length between two stationary base stations. Each robot may base its decisions where to move only on the current relative positions of neighboring robots (that are within its viewing range); that is, besides having a limited viewing range, the robots are oblivious (they do not use information from the past), have none or only very limited identities, and they do not have a common sense of direction. Variants of these problems (especially for the Gathering problem) have been studied extensively in different discrete time models. In contrast, our work focuses on a continuous time model; that is, the robots continuously sense the positions of other robots within their viewing range and continuously adapt their speed and direction according to some simple, local rules. Hereby, we assume that the robots have a maximum movement speed of one. We show that this idealized idea of continuous sensing allows us to solve the mentioned formation problems in linear time O(n) (which, given the maximum speed of one, immediately yields a maximum traveled distance of O(n)). Note that in the more classical discrete time models, the best known strategies need at least Θ(n2) or even Θ(n2logn) timesteps to solve these problems. For the Gathering problem, our analysis solves a problem left open by Gordon et al. [2004], where the authors could prove that gathering in a continuous model is possible in finite time, but were not able to give runtime bounds. Apart from these linear bounds, we also provide runtime bounds for both formation problems that relate the runtime of our strategies to the runtime of an optimal, global algorithm. Specifically, we show that our strategy for the Gathering problem is log OPT-competitive and the strategy for the Chain-Formation problem is log n-competitive. Here, by c-competitive, we mean that our (local) strategy is asymptotically by at most a factor of c slower than an optimal, global strategy.

...read moreread less

Journal Article•10.1016/J.PARCO.2015.01.005•

Communication-aware process and thread mapping using online communication detection

[...]

Matthias Diener¹, Eduardo H. M. Cruz², Philippe O. A. Navaux², Anselm Busse¹, Hans-Ulrich Heiíß¹ - Show less +1 more•Institutions (2)

Technical University of Berlin¹, Universidade Federal do Rio Grande do Sul²

1 Mar 2015

TL;DR: CDSM, Communication Detection in Shared Memory, a mechanism that detects communication in from page faults and uses this information to perform the communication-aware mapping of parallel applications to the hardware topology, improving their performance and energy efficiency.

...read moreread less

Abstract: We perform online detection of inter-process and inter-thread communication.Detected communication pattern is used to migrate processes and threads.Operating System-based mechanism, no changes to applications or runtime libraries.We reduce execution time and energy consumption.Evaluation on shared memory machines and a cluster show substantial improvements. The rising complexity of memory hierarchies and interconnections in parallel shared memory architectures leads to differences in the communication performance. These differences can be exploited to perform a communication-aware mapping of parallel applications to the hardware topology, improving their performance and energy efficiency. To perform the mapping, it is necessary to determine the communication behavior of the processes and threads of the application. Previous methods rely on static communication traces to detect communication, require hardware changes or support only a subset of parallelization models.We propose CDSM, Communication Detection in Shared Memory, a mechanism that detects communication in from page faults and uses this information to perform the mapping. CDSM works on the operating system level during the execution of the parallel application and supports all parallelization models that use shared memory for communication. It does not require modifications to the applications, previous knowledge about their behavior, or changes to the hardware and runtime libraries. Experiments with the MPI, MPI+OpenMP and OpenMP implementations of the NAS parallel benchmarks, the HPCC benchmark and the PARSEC benchmark suite on a shared memory machine show that CDSM has a high detection accuracy with a negligible overhead. Execution time and processor energy consumption were reduced by up to 35.9% and 18.9%, respectively (10.2% and 7.3%, on average). Experiments on a cluster system, where CDSM optimizes the communication within each node, showed an average execution time reduction of 10.4%.

...read moreread less

Journal Article•10.1016/J.PARCO.2015.09.004•

On the scalability of inexact balancing domain decomposition by constraints with overlapped coarse/fine corrections

[...]

Santiago Badia¹, Alberto F. Martín¹, Javier Principe¹•Institutions (1)

Polytechnic University of Catalonia¹

1 Dec 2015

TL;DR: This work analyzes the scalability of inexact two-level balancing domain decomposition by constraints preconditioners for Krylov subspace iterative solvers using a highly scalable asynchronous parallel implementation where fine and coarse correction computations are overlapped in time.

...read moreread less

Abstract: In this work, we analyze the scalability of inexact two-level balancing domain decomposition by constraints (BDDC) preconditioners for Krylov subspace iterative solvers, when using a highly scalable asynchronous parallel implementation where fine and coarse correction computations are overlapped in time. This way, the coarse-grid problem can be fully overlapped by fine-grid computations (which are embarrassingly parallel) in a wide range of cases. Further, we consider inexact solvers to reduce the computational cost/complexity and memory consumption of coarse and local problems and boost the scalability of the solver. Out of our numerical experimentation, we conclude that the BDDC preconditioner is quite insensitive to inexact solvers. In particular, one cycle of algebraic multigrid (AMG) is enough to attain algorithmic scalability. Further, the clear reduction of computing time and memory requirements of inexact solvers compared to sparse direct ones makes possible to scale far beyond state-of-the-art BDDC implementations. Excellent weak scalability results have been obtained with the proposed inexact/overlapped implementation of the two-level BDDC preconditioner, up to 93,312 cores and 20 billion unknowns on JUQUEEN. Further, we have also applied the proposed setting to unstructured meshes and partitions for the pressure Poisson solver in the backward-facing step benchmark domain.

...read moreread less

Journal Article•10.1145/2687001•

Power Management of Extreme-Scale Networks with On/Off Links in Runtime Systems

[...]

Ehsan Totoni¹, Nikhil Jain¹, Laxmikant V. Kale¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

18 Feb 2015

TL;DR: This work proposes a runtime system based adaptive approach to turn off unused links, which has various advantages over the previously proposed hardware and compiler based approaches.

...read moreread less

Abstract: Networks are among major power consumers in large-scale parallel systems. During execution of common parallel applications, a sizeable fraction of the links in the high-radix interconnects are either never used or are underutilized. We propose a runtime system based adaptive approach to turn off unused links, which has various advantages over the previously proposed hardware and compiler based approaches. We discuss why the runtime system is the best system component to accomplish this task, and test the effectiveness of our approach using real applications (including NAMD, MILC), and application benchmarks (including NAS Parallel Benchmarks, Stencil). These codes are simulated on representative topologies such as 6-D Torus and multilevel directly connected network (similar to IBM PERCS in Power 775 and Dragonfly in Cray Aries). For common applications with near-neighbor communication pattern, our approach can save up to 20p of total machine's power and energy, without any performance penalty.

...read moreread less

Journal Article•10.1016/J.PARCO.2014.12.002•

Extending the BSP model for multi-core and out-of-core computing

[...]

Alexandros V. Gerbessiotis¹•Institutions (1)

New Jersey Institute of Technology¹

1 Jan 2015

TL;DR: An extension of the bulk-synchronous parallel (BSP) model to abstract and model parallelism in the presence of multiple memory hierarchies and multiple cores is presented, called the new model MBSP for multi-memory BSP.

...read moreread less

Abstract: Introduction of MBSP a new model of multi-core and out-of-core computing.Comparison of MBSP to other models of computation multi-core or out-of-core.Justification of design choices in selecting parameter set of MBSP.Analysis of a parallel deterministic regular oversampling sorting algorithm.Design and analysis of a sorting algorithm under MBSP. We present an extension of the bulk-synchronous parallel (BSP) model to abstract and model parallelism in the presence of multiple memory hierarchies and multiple cores. We call the new model MBSP for multi-memory BSP. The BSP model has been used to model internal memory parallel computers; MBSP retains the properties of BSP and in addition can abstract not only traditional external memory-supported parallelism (e.g. that uses another level of slower memory) but also multi-level cache-based memory hierarchies such as those present in multi-core systems. Present day multi-core systems are limited parallelism architectures with fast inter-core communication but limited fast memory availability. Abstracting the programming requirements of such architectures in a useful and usable manner is the objective of introducing MBSP. We propose multi-core program and algorithm design that measures resource utilization through a septuplet ( p , l , g , m , L , G , M ) in which ( p , l , g ) are the BSP parameters for modeling processor component size and interprocessor communication through latency-based and throughput-based cost mechanisms, and ( m , L , G , M ) are the new parameters that abstract additional memory hierarchies. Each processor component is attached to a memory of size M, and there are also m memory-units accessing a slower memory of unlimited size of latency-based and throughput-based cost ( L , G ) . A deterministic sorting algorithm is described on this model that is potentially both usable and useful.

...read moreread less

Journal Article•10.1016/J.PARCO.2015.02.004•

Parallel color-coding

[...]

George M. Slota¹, Kamesh Madduri¹•Institutions (1)

Pennsylvania State University¹

1 Aug 2015

TL;DR: By efficiently parallelizing steps in color-coding, this work creates two new biological protein interaction network analysis tools: Fascia for subgraph counting and motif finding and FastPath for signaling pathway detection.

...read moreread less

Abstract: New shared- and distributed-memory parallelizations of the color-coding method.Bioinformatics applications: motif detection, signaling pathway discovery in protein interaction networks.Parallel performance analysis of optimized color-coding implementations. We present new parallelization and memory-reducing strategies for the graph-theoretic color-coding approximation technique, with applications to biological network analysis. Color-coding is a technique that gives fixed parameter tractable algorithms for several well-known NP-hard optimization problems. In this work, by efficiently parallelizing steps in color-coding, we create two new biological protein interaction network analysis tools: Fascia for subgraph counting and motif finding and FastPath for signaling pathway detection. We demonstrate considerable speedup over prior work, and the optimizations introduced in this paper can also be used for other problems where color-coding is applicable.

...read moreread less

...

Expand