TL;DR: A powerful sampling technique that aids in parallelization of sequential algorithms and yields efficient algorithms that run in a logarithmic number of rounds while obtaining solutions that are arbitrarily close to those produced by the standard sequential greedy algorithm.
Abstract: Greedy algorithms are practitioners’ best friends—they are intuitive, are simple to implement, and often lead to very good solutions. However, implementing greedy algorithms in a distributed setting is challenging since the greedy choice is inherently sequential, and it is not clear how to take advantage of the extra processing power. Our main result is a powerful sampling technique that aids in parallelization of sequential algorithms. Armed with this primitive, we then adapt a broad class of greedy algorithms to the MapReduce paradigm; this class includes maximum cover and submodular maximization subject to p-system constraint problems. Our method yields efficient algorithms that run in a logarithmic number of rounds while obtaining solutions that are arbitrarily close to those produced by the standard sequential greedy algorithm. We begin with algorithms for modular maximization subject to a matroid constraint and then extend this approach to obtain approximation algorithms for submodular maximization subject to knapsack or p-system constraints.
TL;DR: Compared to the serial Louvain implementation, the parallel implementation is able to produce community outputs with a higher modularity for most of the inputs tested, in comparable number or fewer iterations, while providing absolute speedups of up to 16 ?
Abstract: Novel parallel heuristics for community detection in large-scale graphs.Multi-threaded implementations using OpenMP.Thorough experimental evaluation of parallel heuristics on a platform with 32 cores. Community detection has become a fundamental operation in numerous graph-theoretic applications. It is used to reveal natural divisions that exist within real world networks without imposing prior size or cardinality constraints on the set of communities. Despite its potential for application, there is only limited support for community detection on large-scale parallel computers, largely owing to the irregular and inherently sequential nature of the underlying heuristics. In this paper, we present parallelization heuristics for fast community detection using the Louvain method as the serial template. The Louvain method is a multi-phase, iterative heuristic for modularity optimization. Originally developed by Blondel et al. (2008), the method has become increasingly popular owing to its ability to detect high modularity community partitions in a fast and memory-efficient manner. However, the method is also inherently sequential, thereby limiting its scalability. Here, we observe certain key properties of this method that present challenges for its parallelization, and consequently propose heuristics that are designed to break the sequential barrier. For evaluation purposes, we implemented our heuristics using OpenMP multithreading, and tested them over real world graphs derived from multiple application domains (e.g., internet, citation, biological). Compared to the serial Louvain implementation, our parallel implementation is able to produce community outputs with a higher modularity for most of the inputs tested, in comparable number or fewer iterations, while providing absolute speedups of up to 16 ? using 32 threads.
TL;DR: The new RMA interface is presented and formal axiomatic models for data consistency and access semantics are specified to help users reason about details of the semantics that are hard to extract from the English prose in the standard.
Abstract: The Message Passing Interface (MPI) 3.0 standard, introduced in September 2012, includes a significant update to the one-sided communication interface, also known as remote memory access (RMA). In particular, the interface has been extended to better support popular one-sided and global-address-space parallel programming models to provide better access to hardware performance features and enable new data-access modes. We present the new RMA interface and specify formal axiomatic models for data consistency and access semantics. Such models can help users reason about details of the semantics that are hard to extract from the English prose in the standard. It also fosters the development of tools and compilers, enabling them to automatically analyze, optimize, and debug RMA programs.
TL;DR: This article presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful, and allows one to transform any spin-lock algorithm, with minimal nonintrusive changes, into a scalable NUma-aware spin-locks.
Abstract: Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machine's nonuniform memory and caching hierarchy, ever more important. This article presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful. Lock cohorting allows one to transform any spin-lock algorithm, with minimal nonintrusive changes,into a scalable NUMA-aware spin-lock. Our new cohorting technique allows us to easily create NUMA-aware versions of the TATAS-Backoff, CLH, MCS, and ticket locks, to name a few. Moreover, it allows us to derive a CLH-based cohort abortable lock, the first NUMA-aware queue lock to support abortability. We empirically compared the performance of cohort locks with prior NUMA-aware and classic NUMA-oblivious locks on a synthetic micro-benchmark, a real world key-value store application memcached, as well as the libc memory allocator. Our results demonstrate that cohort locks perform as well or better than known locks when the load is low and significantly out-perform them as the load increases.
TL;DR: ScaffCC as mentioned in this paper is a scalable compilation and analysis framework based on LLVM which can be used for compiling quantum computing applications at the logical level, and integrates a reversible-logic synthesis tool in the compiler to facilitate coding of quantum circuits.
Abstract: We present ScaffCC, a scalable compilation and analysis framework based on LLVM (Lattner and Adve, 2004), which can be used for compiling quantum computing applications at the logical level. Drawing upon mature compiler technologies, we discuss similarities and differences between compilation of classical and quantum programs, and adapt our methods to optimizing the compilation time and output for the quantum case. Our work also integrates a reversible-logic synthesis tool in the compiler to facilitate coding of quantum circuits. Lastly, we present some useful quantum program analysis scenarios and discuss their implications, specifically with an elaborate discussion of timing analysis for critical path estimation. Our work focuses on bridging the gap between high-level quantum algorithm specifications and low-level physical implementations, while providing good scalability to larger and more interesting problems.
TL;DR: This work presents a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum computations that achieves an asymptotically optimal O(|V| + |E|) gd work complexity.
Abstract: Breadth-First Search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with nontrivial diameter. We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum computations that achieves an asymptotically optimal O(|V| + |E|) gd work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single- and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations on both CPU and GPU platforms.
TL;DR: A scalable coarse-grained PGA-PGAP is developed to enable efficient deme interactions and significantly improve the overlapping of computation and communication for NP-hard optimization problem: Generalized Assignment Problem (GAP).
Abstract: A scalable parallel genetic algorithm (PGA) is developed for the NP-hard GAP problem.High scalability is achieved through a novel asynchronous migration strategy.Our algorithmic analysis resolves the buffer overflow issue in asynchronous PGAs.Strong and weak scaling tests demonstrate the superiority of our PGA approach.Our approach scales to 16,384 processors with super-linear speedups observed. Known as an effective heuristic for finding optimal or near-optimal solutions to difficult optimization problems, a genetic algorithm (GA) is inherently parallel for exploiting high performance and parallel computing resources for randomized iterative evolutionary computation. It remains to be a significant challenge, however, to devise parallel genetic algorithms (PGAs) that can scale to massively parallel computer architecture (also known as the mainstream supercomputer architecture) primarily because: (1) a common PGA design adopts synchronized migration, which becomes increasingly costly as more processor cores are involved in global synchronization; and (2) asynchronous PGA design and associated performance evaluation are intricate due to the fact that PGA is a type of stochastic algorithm and the amount of computation work needed to solve a problem is not simply dependent on the problem size. To address the challenge, this paper describes a scalable coarse-grained PGA-PGAP, for a well-known NP-hard optimization problem: Generalized Assignment Problem (GAP). Specifically, an asynchronous migration strategy is developed to enable efficient deme interactions and significantly improve the overlapping of computation and communication. Buffer overflow and its relationship with migration parameters were investigated to resolve the issues of observed message buffer overflow and the loss of good solutions obtained from migration. Two algorithmic conditions were then established to detect these issues caused by communication delays and improper configuration of migration parameters and, thus, guide the dynamic tuning of PGA parameters to detect and avoid these issues. A set of computational experiments is designed to evaluate the scalability and numerical performance of PGAP. These experiments were conducted for large GAP instances on multiple supercomputers as part of the National Science Foundation Extreme Science and Engineering Discovery Environment (XSEDE). Results showed that, PGAP exhibited desirable scalability by achieving low communication cost when using up to 16,384 processor cores. Near-linear and super-linear speedups on large GAP instances were obtained in strong scaling tests. Desirable scalability to both population size and the number of processor cores were observed in weak scaling tests. The design strategies applied in PGAP are applicable to general asynchronous PGA development.
TL;DR: This work considers a market-based resource allocation model for batch jobs in cloud computing clusters that incorporates the importance of the due date of a job rather than the number of servers allocated to it at any given time and constructs truthful allocation and pricing mechanisms.
Abstract: We consider a market-based resource allocation model for batch jobs in cloud computing clusters. In our model, we incorporate the importance of the due date of a job rather than the number of servers allocated to it at any given time. Each batch job is characterized by the work volume of total computing units (e.g., CPU hours) along with a bound on maximum degree of parallelism. Users specify, along with these job characteristics, their desired due date and a value for finishing the job by its deadline. Given this specification, the primary goal is to determine the scheduling of cloud computing instances under capacity constraints in order to maximize the social welfare (i.e., sum of values gained by allocated users). Our main result is a new (CC-kcss-1)-approximation algorithm for this objective, where C denotes cloud capacity, k is the maximal bound on parallelized execution (in practical settings, k C) and s is the slackness on the job completion time, that is, the minimal ratio between a specified deadline and the earliest finish time of a job. Our algorithm is based on utilizing dual fitting arguments over a strengthened linear program to the problem. Based on the new approximation algorithm, we construct truthful allocation and pricing mechanisms, in which reporting the true value and other properties of the job (deadline, work volume, and the parallelism bound) is a dominant strategy for all users. To that end, we extend known results for single-value settings to provide a general framework for transforming allocation algorithms into truthful mechanisms in domains of single-value and multi-properties. We then show that the basic mechanism can be extended under proper Bayesian assumptions to the objective of maximizing revenues, which is important for public clouds. We empirically evaluate the benefits of our approach through simulations on data-center job traces, and show that the revenues obtained under our mechanism are comparable with an ideal fixed-price mechanism, which sets an on-demand price using oracle knowledge of users’ valuations. Finally, we discuss how our model can be extended to accommodate uncertainties in job work volumes, which is a practical challenge in cloud settings.
TL;DR: A provably efficient scheduling algorithm, the Piper algorithm, is described, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested and automatically throttles the parallelism, precluding “runaway” pipelines.
Abstract: Pipeline parallelism organizes a parallel program as a linear sequence of stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a “construct-and-run” approach, this article investigates “on-the-fly” pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding “runaway” pipelines. Given a pipeline computation with T1 work and T∞ span (critical-path length), Piper executes the computation on P processors in TP ≤ T1/P+O(T∞+lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as “lazy enabling” and “dependency folding.” We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.
TL;DR: It is shown that numerical reproducibility and bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to 90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms.
Abstract: A parallel algorithm to compute correctly-rounded floating-point sumsHighly-optimized implementations for modern CPUs, GPUs and Xeon PhiAs fast as memory bandwidth allows for large sums with moderate dynamic rangeScales well with the problem size and resources used on a cluster of compute nodes On modern multi-core, many-core, and heterogeneous architectures, floating-point computations, especially reductions, may become non-deterministic and, therefore, non-reproducible mainly due to the non-associativity of floating-point operations. We introduce an approach to compute the correctly rounded sums of large floating-point vectors accurately and efficiently, achieving deterministic results by construction. Our multi-level algorithm consists of two main stages: first, a filtering stage that relies on fast vectorized floating-point expansion; second, an accumulation stage based on superaccumulators in a high-radix carry-save representation. We present implementations on recent Intel desktop and server processors, Intel Xeon Phi co-processors, and both AMD and NVIDIA GPUs. We show that numerical reproducibility and bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to 90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms.
TL;DR: The design and implementation of Flex-MPI is described, an MPI library extension which can automatically monitor and predict the performance of applications, balance and redistribute the workload, and reconfigure the application at runtime by changing the number of processes.
Abstract: We present Flex-MPI, an MPI extension which supports malleability for MPI programs.We introduce a performance-aware dynamic reconfiguring technique for malleability.We implement a prediction model to estimate the performance of MPI programs.We present a simple API to access Flex-MPI functionalities from MPI applications.We demonstrate Flex-MPI capabilities to enhance performance via malleability. The work in this paper focuses on providing malleability to MPI applications by using a novel performance-aware dynamic reconfiguration technique. This paper describes the design and implementation of Flex-MPI, an MPI library extension which can automatically monitor and predict the performance of applications, balance and redistribute the workload, and reconfigure the application at runtime by changing the number of processes. Unlike existent approaches, our reconfiguring policy is guided by user-defined performance criteria. We focus on iterative SPMD programs, a class of applications with critical mass within the scientific community. Extensive experiments show that Flex-MPI can improve the performance, parallel efficiency, and cost-efficiency of MPI programs with a minimal effort from the programmer.
TL;DR: In this paper, the authors compare two multicore thread-parallel adaptations of a state-of-the-art branch-and-bound algorithm for the maximum clique problem and provide a novel explanation as to why they are successful.
Abstract: Finding a maximum clique in a given graph is one of the fundamental NP-hard problems. We compare two multicore thread-parallel adaptations of a state-of-the-art branch-and-bound algorithm for the maximum clique problem and provide a novel explanation as to why they are successful. We show that load balance is sometimes a problem but that the interaction of parallel search order and the most likely location of solutions within the search space is often the dominating consideration. We use this explanation to propose a new low-overhead, scalable work-splitting mechanism. Our approach uses explicit early diversity to avoid strong commitment to the weakest heuristic advice and late resplitting for balance. More generally, we argue that, for branch-and-bound, parallel algorithm design should not be performed independently of the underlying sequential algorithm.
TL;DR: In this paper, a speculative segmented sum strategy for the CSR-based SpMV algorithm is proposed, where the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector.
Abstract: A speculative segmented sum strategy for the CSR-based SpMV.Utilizing both GPU cores and CPU cores in a heterogeneous processor.No format conversion or tuning overhead for input sparse matrices in the CSR format.High speedup over the CSR-vector algorithm running irregular matrices.No performance penalty for most regular matrices. Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect result. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms.
TL;DR: A divide-and-conquer bijective graph matching algorithm was proposed in this article, which combines spectral graph embedding with existing state-of-the-art seeded graph matching procedures.
Abstract: We present a novel divide-and-conquer bijective graph matching algorithm.The algorithm is fully parallelizable, and scales to match "big data" graphs.We demonstrate the effectiveness of the algorithm by matching DTMRI human connectomes. We present a parallelized bijective graph matching algorithm that leverages seeds and is designed to match very large graphs. Our algorithm combines spectral graph embedding with existing state-of-the-art seeded graph matching procedures. We justify our approach by proving that modestly correlated, large stochastic block model random graphs are correctly matched utilizing very few seeds through our divide-and-conquer procedure. We also demonstrate the effectiveness of our approach in matching very large graphs in simulated and real data examples, showing up to a factor of 8 improvement in runtime with minimal sacrifice in accuracy.
TL;DR: A thorough analysis of hedonic clustering games in two different models, characterizing Nash equilibria, and proving upper and lower bounds on the price of anarchy and price of stability are provided.
Abstract: Clustering, the partitioning of objects with respect to a similarity measure, has been extensively studied as a global optimization problem. We investigate clustering from a game-theoretic approach, and consider the class of hedonic clustering games. Here, a self-organized clustering is obtained via decisions made by independent players, corresponding to the elements clustered. Being a hedonic setting, the utility of each player is determined by the identity of the other members of her cluster. This class of games seems to be quite robust, as it fits with rather different, yet commonly used, clustering criteria. Specifically, we investigate hedonic clustering games in two different models: fixed clustering, which subdivides into k-median and k-center, and correlation clustering. We provide a thorough analysis of these games, characterizing Nash equilibria, and proving upper and lower bounds on the price of anarchy and price of stability. For fixed clustering we focus on the existence of a Nash equilibrium, as it is a rather nontrivial issue in this setting. We study it both for general metrics and special cases, such as line and tree metrics. In the correlation clustering model, we study both minimization and maximization variants, and provide almost tight bounds on both the price of anarchy and price of stability.
TL;DR: This article proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures and presents a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations.
Abstract: Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This article proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable node and the possibility of losing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead decreases inversely to the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors. Applicability to tolerate multiple failures and accuracy after multiple recovery is also considered.
TL;DR: In this article, the authors consider the problem of tree-shaped task graphs with multiple processors and design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan.
Abstract: This article investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. Such trees arise in the multifrontal method of sparse matrix factorization. The peak memory needed for the processing of the entire tree depends on the execution order of the tasks. With one processor, the objective of the tree traversal is to minimize the required memory. This problem was well studied, and optimal polynomial algorithms were proposed.Here, we extend the problem by considering multiple processors, which is of obvious interest in the application area of matrix factorization. With multiple processors comes the additional objective to minimize the time needed to traverse the tree—that is, to minimize the makespan. Not surprisingly, this problem proves to be much harder than the sequential one. We study the computational complexity of this problem and provide inapproximability results even for unit weight trees. We design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan. Some of these heuristics are able to process a tree while keeping the memory usage under a given memory limit. The different heuristics are evaluated in an extensive experimental evaluation using realistic trees.
TL;DR: This paper implemented different graph algorithms and compared their performances using a naive partitioning solution against more elaborate strategies, both static and dynamic.
Abstract: Many of today’s large datasets are organized as a graph. Due to their size it is often infeasible to process these graphs using a single machine. Therefore, many software frameworks and tools have been proposed to process graph on top of distributed infrastructures. This software is often bundled with generic data decomposition strategies that are not optimised for specific algorithms. In this paper we study how a specific data partitioning strategy affects the performances of graph algorithms executing on Apache Spark. To this end, we implemented different graph algorithms and we compared their performances using a naive partitioning solution against more elaborate strategies, both static and dynamic.
TL;DR: Under weak scaling and for certain classes of graphs, the graph contraction algorithm scales consistently better than the four other algorithms and is of the same order in terms of memory as the other two.
Abstract: Introduces a graph contraction based distributed-memory connected component algorithm.Four alternative distributed-memory connected component algorithms are presented.Theoretical and experimental analysis is presented for the five algorithms.Classes of problems under which the algorithms are most applicable are identified.Novel algorithm shows better scalability across the range of scientific computing graphs used herein. Connected component labeling is a key step in a wide-range of applications, such as community detection in social networks and coherent structure identification in massively-parallel scientific simulations. There have been several distributed-memory connected component algorithms described in literature; however, little has been done regarding their scalability analysis. Theoretical and experimental results are presented for five algorithms: three that are direct implementations of previous approaches, one that is an implementation of a previous approach that is optimized to reduce communication, and one that is a novel approach based on graph contraction. Under weak scaling and for certain classes of graphs, the graph contraction algorithm scales consistently better than the four other algorithms. Furthermore, it uses significantly less memory than two of the alternative methods and is of the same order in terms of memory as the other two.
TL;DR: A parallel solver for general tridiagonal irreducible systems and its CUDA implementation are described, indicating that g-Spike is competitive in runtime with existing GPU methods, and can provide acceptable results when other methods cannot be applied or fail.
Abstract: A parallel solver for general tridiagonal irreducible systems is described.Solver based on Spike framework and Givens-QR with occasional low-rank modification.Modifications handle singularities exposed by QR in blocks of the parallel partition.The GPU implementation has similar performance to existing methods.Method returns accurate results when current GPU tridiagonal solvers fail. g-Spike, a parallel algorithm for solving general nonsymmetric tridiagonal systems for the GPU, and its CUDA implementation are described. The solver is based on the Spike framework, applying Givens rotations and QR factorization without pivoting. It also implements a low-rank modification strategy to compute the Spike DS decomposition even when the partitioning defines singular submatrices along the diagonal. The method is also used to solve the reduced system resulting from the Spike partitioning. Numerical experiments with problems of high order indicate that g-Spike is competitive in runtime with existing GPU methods, and can provide acceptable results when other methods cannot be applied or fail.
TL;DR: The experimental results show that COPA could reduce a significant amount of execution time, and the approach achieves the speedups of up to 10.26 on multicore CPU implementations and 17.53 on GPU implementations when the sequential dynamic programming algorithm for KP01 is considered as a baseline.
Abstract: A parallel algorithm COPA is proposed for the 0-1 knapsack problem.There are no memory conflicts for the proposed algorithm.The proposed algorithm has better scalability.The proposed algorithm can achieve good results on general GPU. The 0-1 knapsack problem has been extensively studied in the past years due to its immediate applications in industry and financial management, such as cargo loading, stock cutting, and budget control. Many algorithms have been proposed to solve this problem, most of which are heuristic, as the problem is well-known to be NP-hard. Only a few optimal algorithms have been designed to solve this problem but with high time complexity. This paper proposes the cost-optimal parallel algorithm (COPA) on an EREW PRAM model with shared memory to solve this problem. COPA is scalable and yields optimal solutions consuming less computational time. Furthermore, this paper implements COPA on two scenarios - multicore CPU based architectures using Open MP and GPU based configurations using CUDA. A series of experiments are conducted to examine the performance of COPA under two different test platforms. The experimental results show that COPA could reduce a significant amount of execution time. Our approach achieves the speedups of up to 10.26 on multicore CPU implementations and 17.53 on GPU implementations when the sequential dynamic programming algorithm for KP01 is considered as a baseline. Importantly, GPU implementations outstand themselves in the experimental results.
TL;DR: SybilCast is a protocol for multichannel wireless networks that limits the number of fake identities and, in doing so, ensures that each honest user gets at least a constant fraction of his or her fair share of the bandwidth.
Abstract: Consider a scenario where many wireless users are attempting to download data from a single base station. While most of the users are honest, some users may be malicious and attempt to obtain more than their fair share of the bandwidth. One possible strategy for attacking the system is to simulate multiple fake identities, each of which is given its own equal share of the bandwidth. Such an attack is often referred to as a sybil attack. To counter such behavior, we propose SybilCast, a protocol for multichannel wireless networks that limits the number of fake identities and, in doing so, ensures that each honest user gets at least a constant fraction of his or her fair share of the bandwidth. As a result, each honest user can complete his or her data download in asymptotically optimal time. A key aspect of this protocol is balancing the rate at which new identities are admitted and the maximum number of fake identities that can coexist while keeping the overhead low. Besides sybil attacks, our protocol can also tolerate spoofing and jamming.
TL;DR: This work focuses on a continuous time model, where n mobile robots continuously sense the positions of other robots within their viewing range and continuously adapt their speed and direction according to some simple, local rules, and shows that this idealized idea of continuous sensing allows this model to solve formation problems in linear time O(n) (which, given the maximum speed of one, immediately yields a maximum traveled distance of O( n).
Abstract: We study a scenario in which n mobile robots with a limited viewing range are distributed in the Euclidean plane and have to solve a formation problem. The formation problems we consider are the Gathering problem and the Chain-Formation problem. In the Gathering problem, the robots have to gather in one (not predefined) point, while in the Chain-Formation problem they have to form a connected communication chain of minimal length between two stationary base stations. Each robot may base its decisions where to move only on the current relative positions of neighboring robots (that are within its viewing range); that is, besides having a limited viewing range, the robots are oblivious (they do not use information from the past), have none or only very limited identities, and they do not have a common sense of direction. Variants of these problems (especially for the Gathering problem) have been studied extensively in different discrete time models. In contrast, our work focuses on a continuous time model; that is, the robots continuously sense the positions of other robots within their viewing range and continuously adapt their speed and direction according to some simple, local rules. Hereby, we assume that the robots have a maximum movement speed of one. We show that this idealized idea of continuous sensing allows us to solve the mentioned formation problems in linear time O(n) (which, given the maximum speed of one, immediately yields a maximum traveled distance of O(n)). Note that in the more classical discrete time models, the best known strategies need at least Θ(n2) or even Θ(n2logn) timesteps to solve these problems. For the Gathering problem, our analysis solves a problem left open by Gordon et al. [2004], where the authors could prove that gathering in a continuous model is possible in finite time, but were not able to give runtime bounds. Apart from these linear bounds, we also provide runtime bounds for both formation problems that relate the runtime of our strategies to the runtime of an optimal, global algorithm. Specifically, we show that our strategy for the Gathering problem is log OPT-competitive and the strategy for the Chain-Formation problem is log n-competitive. Here, by c-competitive, we mean that our (local) strategy is asymptotically by at most a factor of c slower than an optimal, global strategy.
TL;DR: CDSM, Communication Detection in Shared Memory, a mechanism that detects communication in from page faults and uses this information to perform the communication-aware mapping of parallel applications to the hardware topology, improving their performance and energy efficiency.
Abstract: We perform online detection of inter-process and inter-thread communication.Detected communication pattern is used to migrate processes and threads.Operating System-based mechanism, no changes to applications or runtime libraries.We reduce execution time and energy consumption.Evaluation on shared memory machines and a cluster show substantial improvements. The rising complexity of memory hierarchies and interconnections in parallel shared memory architectures leads to differences in the communication performance. These differences can be exploited to perform a communication-aware mapping of parallel applications to the hardware topology, improving their performance and energy efficiency. To perform the mapping, it is necessary to determine the communication behavior of the processes and threads of the application. Previous methods rely on static communication traces to detect communication, require hardware changes or support only a subset of parallelization models.We propose CDSM, Communication Detection in Shared Memory, a mechanism that detects communication in from page faults and uses this information to perform the mapping. CDSM works on the operating system level during the execution of the parallel application and supports all parallelization models that use shared memory for communication. It does not require modifications to the applications, previous knowledge about their behavior, or changes to the hardware and runtime libraries. Experiments with the MPI, MPI+OpenMP and OpenMP implementations of the NAS parallel benchmarks, the HPCC benchmark and the PARSEC benchmark suite on a shared memory machine show that CDSM has a high detection accuracy with a negligible overhead. Execution time and processor energy consumption were reduced by up to 35.9% and 18.9%, respectively (10.2% and 7.3%, on average). Experiments on a cluster system, where CDSM optimizes the communication within each node, showed an average execution time reduction of 10.4%.
TL;DR: This work analyzes the scalability of inexact two-level balancing domain decomposition by constraints preconditioners for Krylov subspace iterative solvers using a highly scalable asynchronous parallel implementation where fine and coarse correction computations are overlapped in time.
Abstract: In this work, we analyze the scalability of inexact two-level balancing domain decomposition by constraints (BDDC) preconditioners for Krylov subspace iterative solvers, when using a highly scalable asynchronous parallel implementation where fine and coarse correction computations are overlapped in time. This way, the coarse-grid problem can be fully overlapped by fine-grid computations (which are embarrassingly parallel) in a wide range of cases. Further, we consider inexact solvers to reduce the computational cost/complexity and memory consumption of coarse and local problems and boost the scalability of the solver. Out of our numerical experimentation, we conclude that the BDDC preconditioner is quite insensitive to inexact solvers. In particular, one cycle of algebraic multigrid (AMG) is enough to attain algorithmic scalability. Further, the clear reduction of computing time and memory requirements of inexact solvers compared to sparse direct ones makes possible to scale far beyond state-of-the-art BDDC implementations. Excellent weak scalability results have been obtained with the proposed inexact/overlapped implementation of the two-level BDDC preconditioner, up to 93,312 cores and 20 billion unknowns on JUQUEEN. Further, we have also applied the proposed setting to unstructured meshes and partitions for the pressure Poisson solver in the backward-facing step benchmark domain.
TL;DR: This work proposes a runtime system based adaptive approach to turn off unused links, which has various advantages over the previously proposed hardware and compiler based approaches.
Abstract: Networks are among major power consumers in large-scale parallel systems. During execution of common parallel applications, a sizeable fraction of the links in the high-radix interconnects are either never used or are underutilized. We propose a runtime system based adaptive approach to turn off unused links, which has various advantages over the previously proposed hardware and compiler based approaches. We discuss why the runtime system is the best system component to accomplish this task, and test the effectiveness of our approach using real applications (including NAMD, MILC), and application benchmarks (including NAS Parallel Benchmarks, Stencil). These codes are simulated on representative topologies such as 6-D Torus and multilevel directly connected network (similar to IBM PERCS in Power 775 and Dragonfly in Cray Aries). For common applications with near-neighbor communication pattern, our approach can save up to 20p of total machine's power and energy, without any performance penalty.
TL;DR: An extension of the bulk-synchronous parallel (BSP) model to abstract and model parallelism in the presence of multiple memory hierarchies and multiple cores is presented, called the new model MBSP for multi-memory BSP.
Abstract: Introduction of MBSP a new model of multi-core and out-of-core computing.Comparison of MBSP to other models of computation multi-core or out-of-core.Justification of design choices in selecting parameter set of MBSP.Analysis of a parallel deterministic regular oversampling sorting algorithm.Design and analysis of a sorting algorithm under MBSP. We present an extension of the bulk-synchronous parallel (BSP) model to abstract and model parallelism in the presence of multiple memory hierarchies and multiple cores. We call the new model MBSP for multi-memory BSP. The BSP model has been used to model internal memory parallel computers; MBSP retains the properties of BSP and in addition can abstract not only traditional external memory-supported parallelism (e.g. that uses another level of slower memory) but also multi-level cache-based memory hierarchies such as those present in multi-core systems. Present day multi-core systems are limited parallelism architectures with fast inter-core communication but limited fast memory availability. Abstracting the programming requirements of such architectures in a useful and usable manner is the objective of introducing MBSP. We propose multi-core program and algorithm design that measures resource utilization through a septuplet ( p , l , g , m , L , G , M ) in which ( p , l , g ) are the BSP parameters for modeling processor component size and interprocessor communication through latency-based and throughput-based cost mechanisms, and ( m , L , G , M ) are the new parameters that abstract additional memory hierarchies. Each processor component is attached to a memory of size M, and there are also m memory-units accessing a slower memory of unlimited size of latency-based and throughput-based cost ( L , G ) . A deterministic sorting algorithm is described on this model that is potentially both usable and useful.
TL;DR: By efficiently parallelizing steps in color-coding, this work creates two new biological protein interaction network analysis tools: Fascia for subgraph counting and motif finding and FastPath for signaling pathway detection.
Abstract: New shared- and distributed-memory parallelizations of the color-coding method.Bioinformatics applications: motif detection, signaling pathway discovery in protein interaction networks.Parallel performance analysis of optimized color-coding implementations. We present new parallelization and memory-reducing strategies for the graph-theoretic color-coding approximation technique, with applications to biological network analysis. Color-coding is a technique that gives fixed parameter tractable algorithms for several well-known NP-hard optimization problems. In this work, by efficiently parallelizing steps in color-coding, we create two new biological protein interaction network analysis tools: Fascia for subgraph counting and motif finding and FastPath for signaling pathway detection. We demonstrate considerable speedup over prior work, and the optimizations introduced in this paper can also be used for other problems where color-coding is applicable.