TL;DR: It is observed that for n/p ≥ p, which is usually the case in practice, there exists a very simple, deterministic, optimal coarse grained parallel integer sorting algorithm with 24 communication rounds with very good performance.
Abstract: We observe that for n/p ≥ p, which is usually the case in practice, there exists a very simple, deterministic, optimal coarse grained parallel integer sorting algorithm with 24 communication rounds (6 n/p-relations and 18 (p-relations), O(n/p) memory per processor and O(n/p) local computation. Experimental data indicates that the algorithm has very good performance in practice.
TL;DR: This paper presents two simple all-to-all broadcasting algorithms on honeycomb mesh based on a Hamiltonian path and has a 2n communication time complexity and shows how they can be used to get parallel solutions to a class of problems on Honeycomb networks.
Abstract: This paper presents two simple all-to-all broadcasting algorithms on honeycomb mesh. Consider a network with n processors, one has personalized routing strategy at each node and it requires a 3n communication time complexity. This communication time can be reduced to n because the computation time is always assumed to be much lower than the communication time. The other is based on a Hamiltonian path and has a 2n communication time complexity. We show how they can be used to get parallel solutions to a class of problems on honeycomb networks, among others Prefix Sums, Maximal Vectors, Maximal Sum Subsegment, Parenthesis Matching, Decoding Binary Tree, and Sorting. In our knowledge, these all-to-all broadcast algorithms are the only ones so far exhibited on a honeycomb.
TL;DR: This paper studies the solution of stable generalized Lyapunov matrix equations with large-scale, dense coefficient matrices with iterative algorithms, based on the matrix sign function, which only require scalable matrix algebra kernels which are highly efficient on parallel distributed architectures.
Abstract: In this paper we study the solution of stable generalized Lyapunov matrix equations with large-scale, dense coefficient matrices. Our iterative algorithms, based on the matrix sign function, only require scalable matrix algebra kernels which are highly efficient on parallel distributed architectures. This approach avoids therefore the difficult parallelization of direct methods based on the QZ algorithm. The experimental analtsis reports a remarkable performance of our solvers on an IBM SP2 platform.
TL;DR: This paper considers the model in which each operation is on exactly one object and shows that Normality is strictly weaker than Linearizability, i.e., history may be normal but not linearizable.
Abstract: This paper is focused on concurrent objects (objects shared by concurrent processes). It introduces a consistency condition called Normality whose definition is based only on local orders of operations as perceived by processes and by objects. First we consider the model in which each operation is on exactly one object. In this model we show that a history is linearizable iff it is normal. However, the definition of Normality is less constraining in the sense that there are strictly more legal sequential histories which are considered equivalent to the given history when Normality is used. We next consider a more general model where operations can span multiple objects. In this model we show that Normality is strictly weaker than Linearizability, i.e., history may be normal but not linearizable. As Normality refers only to local orders (process order and object order) it appears to be well-suited to objects supported by asynchronous distributed systems and accessed by RPC-like mechanisms.
TL;DR: Since matching-based heuristics are simple, and do not require intensive computation, they appear to be the best candidates to solve broadcasting and gossiping problems in multi-user mesh architectures.
Abstract: Broadcasting (one-to-all) and gossiping (all-to-all) are two major communication paradigms that were considered from both practical and theoretical points of view. Indeed, such communication patterns frequently appear in parallel programming, and therefore are included in most of the communication libraries (e.g., MPI or PVM). Also, broadcasting and gossiping times of graphs are important parameters yielding lots of fundamental results. For most of the communication models, the corresponding decision problems are NP-complete in general. Therefore, in this paper, we consider broadcasting and gossiping heuristics. We study the performances of several heuristics applied to partial meshes, that is to connected subgraphs of the mesh. This choice of topology is motivated by the fact that a regularly connected multicomputer can be shared by many users, each of them dealing with an irregular sub-topology of the original network. The result of our comparison is that, although they were defined for arbitrary topologies, the matching-based heuristics offer the best performances for partial meshes compared to sophisticated heuristics defined specifically for these networks (we derive a new heuristic for this class of topologies), or for graphs with -diameter. Since matching-based heuristics are simple, and do not require intensive computation, they appear to be the best candidates to solve broadcasting and gossiping problems in multi-user mesh architectures.
TL;DR: An infrastructure, the Active Data Repository (ADR), is described, which provides framework for building databases that enables integration of storage, retrieval and processing of multi-dimensional datasets on a parallel machine.
Abstract: As computational power and storage capacity increase, processing and analyzing large volumes of multi-dimensional datasets play an increasingly important role in many domains of scientific research Scientific applications that make use of very large scientific datasets have several important characteristics: datasets consist of complex data and are usually multi-dimensional; applications usually retrieve a subset of all the data available in the dataset; various application-specific operations are performed on the data items retrieved Such applications can be supported by object-relational database management systems (OR-DBMSs) In addition to providing functionality to define new complex datatypes and user-defined functions, an OR-DBMS for scientific datasets should contain runtime support that will provide optimized storage for very large datasets and an execution environment for user-defined functions involving expensive operations In this paper we describe an infrastructure, the Active Data Repository (ADR), which provides framework for building databases that enables integration of storage, retrieval and processing of multi-dimensional datasets on a parallel machine The system architecture of ADR provides the functionality required from runtime support for an OR-DBMS that stores and processes scientific multi-dimensional datasets We present the system architecture of the ADR, and experimental performance results for three applications implemented using ADR
TL;DR: It is established that there exists a hamiltonian cycle in a honeycomb torus with two adjacent faulty nodes and that with a single fault a ring embedding with one less node than the fault free torus can be found.
Abstract: Honeycomb torus networks have been recognised as an attractive alternative to existing torus interconnection networks in parallel and distributed applications. In this paper we establish that there exists a hamiltonian cycle in a honeycomb torus with two adjacent faulty nodes and that with a single fault a ring embedding with one less node than the fault free torus can be found.
TL;DR: This paper presents efficient parallel algorithms for finding a minimum weighted connected dominating set, a Minimum weighted Steiner tree, which take O(log n) time using O(n + m) processors on CRCW PRAM, where n and m are the number of vertices and edges of a given graph, respectively.
Abstract: Distance hereditary graphs are graphs in which every two vertices have the same distance in every connected induced subgraph containing them. In this paper, we study properties of distance hereditary graphs from the view point of parallel computations. We present efficient parallel algorithms for finding a minimum weighted connected dominating set, a minimum weighted Steiner tree, which take O(log n) time using O(n + m) processors on CRCW PRAM, where n and m are the number of vertices and edges of a given graph, respectively. We also find a maximum weighted clique of a distance hereditary graph in O(log2 n) time using O(n + m) processors on a CREW PRAM.
TL;DR: A new algorithm for deterministic sorting on the Bulk-Synchronous Parallel (BSP) model of computation is presented that is one-optimal for a wide range of the BSP parameters in the sense that its speedup on p processors is asymptotically (1 - o(1))p.
Abstract: We present a new algorithm for deterministic sorting on the Bulk-Synchronous Parallel (BSP) model of computation. We sort n keys using a partitioning scheme that achieves the requirements of efficiency (one-optimality) and insensitivity against initial key distribution. Although we employ sampling to realize efficiency, we give a precise worst-case estimation of the maximum imbalance which might occur. The algorithm is one-optimal for a wide range of the BSP parameters in the sense that its speedup on p processors is asymptotically (1 - o(1))p.
TL;DR: This paper describes an analytic approach to model both regular and irregular access patterns for sparse matrix-dense matrix product algorithms and results are shown for several current microprocessor architectures.
Abstract: Nowadays the performance gap between processors and main memory makes an efficient usage of the memory hierarchy necessary for good program performance. Several techniques have been proposed for this purpose. Nevertheless most of them consider only regular access patterns, while many scientific and numerical applications give place to irregular patterns. A typical case is that of indirect accesses due to the use of compressed storage formats for sparse matrices. This paper describes an analytic approach to model both regular and irregular access patterns. The application modeled is an optimized sparse matrix-dense matrix product algorithm with several levels of blocking. Our model can be directly applied to any memory hierarchy consisting of K-way associative caches. Results are shown for several current microprocessor architectures.
TL;DR: A parameterized protocol applicable to directed acyclic graph (DAG) topologies that is resilient to transient failures and guarantee system recovery in a finite time linear in the network diameter.
Abstract: This paper describes a parameterized protocol applicable to directed acyclic graph (DAG) topologies. The function parameter of the protocol is instantiated twice to design two specific protocols: (i) the topological sorting of the successor list at every node, and (ii) a shortest path routing table construction. Both protocols are self-stabilizing and thus they are resilient to transient failures and guarantee system recovery in a finite time linear in the network diameter. From the fact that a DAG topology can be imposed on a more general topology through graph labeling protocols, the solutions presented in this paper are expected to be quite useful for a large class of distributed systems, where an optimal routing along with the robustness and fault tolerance are key factors.
TL;DR: This paper presents a randomized parallel algorithm for the Maximal Independent Set problem that uses a BSPlike computer with p processors and requires that $\frac{n+m}{p} =\Omega(p)$ for...
Abstract: This paper presents a randomized parallel algorithm for the Maximal Independent Set problem. Our algorithm uses a BSPlike computer with p processors and requires that $\frac{n+m}{p} =\Omega(p)$ for...
TL;DR: This work presents the first efficient deterministic algorithm to solve the branch-and-bound problem for a tree T of constant degree on a p-processor parallel machine and matches the optimal performance of existing randomized strategies.
Abstract: The branch-and-bound problem involves determining the minimum cost leaf in a cost-labelled tree, subject to the constraint that only the root is known initially and that children are revealed only by visiting thier parent. We present the first efficient deterministic algorithm to solve the branch-and-bound problem for a tree T of constant degree on a p-processor parallel machine. Let c* be the cost of the minimum-cost leaf in T, and let n and h be the number of nodes and the height, respectively, of the subtree T* ⊆ T of nodes of cost less than or equal to c*. Our algorithm runs in O(n/p + h log2(np)) time on an EREW-PRAM. Moreover, the running time faithfully reflects both communication and computation costs, unlike most of the previous results where the cost of local computation is ignored. For large ranges of the parameters, our algorithm matches the optimal performance of existing randomized strategies. The algorithm can be ported to any architecture for which an efficient implementation of Parallel P...
TL;DR: The salient features of Opus and its implementation are described, emphasizing the issues related to the coordination of data-parallel HPF programs in a heterogeneous environment.
Abstract: The coordination language Opus is an object-based extension of High Performance Fortran (HPF) that supports the integration of coarse-grain task parallelism with HPF-style data parallelism In this paper we discuss Opus in the context of multidisciplinary applications (MDAs) which execute in a heterogeneous environment After outlining the major properties of such applications and a number of different approaches towards providing language and tool support for MDAs we describe the salient features of Opus and its implementation, emphasizing the issues related to the coordination of data-parallel HPF programs in a heterogeneous environment
TL;DR: Harness is an adaptable, reliable virtual machine environment being built as a follow-on to PVM, which provides the support framework for coordinating and applying parallel plugins that allow applications to customize or tune their operating environment on the fly.
Abstract: This paper describes ongoing work on the Harness system for next-generation hetergeneous distributed computing. Harness is an adaptable, reliable virtual machine environment being built as a follow-on to PVM. The three fundamental concepts presented here are parallel plugins, fault-tolerant distributed control, and dynamically merging and splitting virtual machines. The distributed control mechanisms provide the support framework necessary for coordinating and applying parallel plugins that allow applications to customize or tune their operating environment on-the-fly. In the spirit of CUMULVS, Harness applications can plug into each other to couple for collaborative computing. Virtual machines that merge and split can assist applications in dynamically utilizing different computing resources to suit changing computational needs.
TL;DR: This paper gives a parallel generating algorithm for k-ary trees represented by generalized P–sequences for execution on a linear array multicomputer.
Abstract: The only parallel generating algorithms for k-ary trees are those of Akl and Stojmenovic in 1996 and of Vajnovszki and Phillips in 1997. In the first of them, trees are represented by an inversion table and the processor model is a linear aray multicomputer. In the second, trees are represented by bitstrings and the algorithm executes on a shared memory multiprocessor. In this paper we give a parallel generating algorithm for k-ary trees represented by generalized P–sequences for execution on a linear array multicomputer.
TL;DR: Two modifications of the n-process mutual exclusion algorithm by Peterson are proposed for the asynchronous multi-writer/reader shared memory model and improvements on the running time, O(n2c + n4 l) of the original n- process algorithm for the same asynchronous shared memorymodel are proposed.
Abstract: We propose two modifications of the n-process mutual exclusion algorithm by Peterson for the asynchronous multi-writer/reader shared memory model. By any of the modifications we can speed up the original n-process algorithm. The running times for the trying regions of the first modified algorithm and the second modified algorithm are (2n - 3)c + O(n3 l) and (n - 1)c + O(n3 l), respectively, where n is the number of processes, l is an upper bound on the time between two steps, and c is an upper bound on the time that any user spends in the critical region. These running times are improvements on the running time, O(n2c + n4 l) of the original n-process algorithm for the same asynchronous shared memory model.
TL;DR: A new method for the ILU(0) factorization of sparse systems in distributed memory multiprocessor architectures using a symbolic reordering technique, so the final system can be grouped in blocks where the rows are independent and the factorization can be carried out in parallel.
Abstract: In this paper we describe a new method for the ILU(0) factorization of sparse systems in distributed memory multiprocessor architectures. This method uses a symbolic reordering technique, so the final system can be grouped in blocks where the rows are independent and the factorization of these entries can be carried out in parallel. The parallel ILU(0) factorization has been tested on the Cray T3E multicomputer using the MPI communication library. The performance was analysed using matrices from the Harwell–Boeing collection.
TL;DR: This is the first constant time solution to the detection of repetitions (with overlapping) problem using the one criterion BSR model, and it is shown that this problem requires no more resources (asymptotically) for implementation than even EREW PRAM.
Abstract: Recently Akl and al. introduced a new model of parallel computation, called BSR (broadcasting with selective reduction) and showed that it is more powerful than any CRCW PRAM and yet requires no more resources (asymptotically) for implementation than even EREW PRAM [2,3,4]. The model allows constant time solutions to sorting, parallel prefix and other problems. In this paper, we describe constant time solution to the detection of repetitions (with overlapping) problem using the one criterion BSR model. This is the first constant time solution to this problem on any model of computation. If the problem is only to detect the existence of any repetition then n processors suffice, where n is the length of the string. If all repetitions are to be found then $n [\frac{n}{2}]$ processors suffice in our algorithm.
TL;DR: The i-p sequence is one of the most common encodings for binary tree.
Abstract: The i-p sequence is one of the most common encodings for a binary tree. This paper gives constant time BSR parallel algorithms for the decoding and drawing of a binary tree from its i-p sequence respectively.
TL;DR: This work shows that even when the algorithm is properly implemented with respect to the authors' first observation, a problem might still arise when a multiprocessor system is used, and offers a slight modification to the algorithm which eliminates the problem and requires little synchronization.
Abstract: Replication-based incremental garbage collection is one of the more appealing concurrent garbage collection algorithms known today. It allows continuous operation of the application (the mutator) with very short pauses for garbage collection. There is a growing need for such garbage collectors suitable for a multithreaded environments such as the Java Virtual Machine. Furthermore, it is desirable to construct collectors that also work on multiprocessor computers. We begin by pointing out an important, yet subtle point, which arises when implementing the replication-based garbage collector for a multithreaded environment. We first show that a simple and natural implementation of the algorithm may lead to an incorrect behavior of multithreaded applications. We then show that another simple and natural implementation eliminates the problem completely. Thus, the contribution of this part is in stressing this warning to future implementors. Next, we address the effects of the memory coherence model on this algorithm. We show that even when the algorithm is properly implemented with respect to our first observation, a problem might still arise when a multiprocessor system is used. Adopting a naive solution to this problem results in very frequent (and expensive) synchronization. We offer a slight modification to the algorithm which eliminates the problem and requires little synchronization.
TL;DR: This paper gives constant time BSR parallel algorithms for the decoding and drawing of a binary tree from its i-p sequence respectively.
Abstract: The i-p sequence is one of the most common encodings for a binary tree. This paper gives constant time BSR parallel algorithms for the decoding and drawing of a binary tree from its i-p sequence respectively.
TL;DR: This paper presents optimal sequential and parallel algorithms to compute a minimum rectilinear link path in a trapezoided region R and provides an optimal solution to the minimum nested polygon problem and the minimum polygon separation problem.
Abstract: The link metric, defined on a constrained region R of the plane, sets the distance between a pair of points in R to equal the minimum number of line segments or links needed to construct a path in R between the point pair The minimum rectilinear link path problem considered here is to compute a rectilinear path consisting of the minimum number of links between two points in R, when R is inside an n-sided rectilinear simple polygon In this paper we present optimal sequential and parallel algorithms to compute a minimum rectilinear link path in a trapezoided region R Our parallel algorithm requires O(log n) time using a total of O(n) operations The complexity of our algorithm matches that of the algorithm of McDonald and Peters [19] By exploiting the dual structure of the trapezoidation of R, we obtain a conceptually simple and easy to implement algorithm As applications of our techniques we provide an optimal solution to the minimum nested polygon problem and the minimum polygon separation problem The minimum nested polygon problem asks for finding a rectilinear polygon, with minimum number of sides, that is nested between two given rectilinear polygons one of which is contained in the other The minimum polygon separation problem asks for computing a minimum number of orthogonal lines and line segments that separate two given non-intersecting simple rectilinear polygons All parallel algorithms are deterministic, designed to run on the exclusive read exclusive write parallel random access machine (EREW PRAM), and are optimal
TL;DR: Strategies for computing the continuous wavelet transform on massively parallel SIMD arrays are introduced and the results of implementations on a MasPar MP-2 are compared.
Abstract: Strategies for computing the continuous wavelet transform on massively parallel SIMD arrays are introduced and discussed. The different approaches are theoretically assessed and the results of implementations on a MasPar MP-2 are compared.
TL;DR: This paper introduces an observation relation as an abstraction of point-to-point communication in distributed architectures and considers the description and the validation of a distributed mutual exclusion algorithm.
Abstract: In this paper, we introduce an observation relation as an abstraction of point-to-point communication in distributed architectures. After showing how its semantics and syntax can be embedded within the UNITY approach, we state general observation properties. Finally, we consider the description and the validation of a distributed mutual exclusion algorithm. The relevant aspect of such a validation is the exclusive use of refinements and observations properties for the proof of these refinements.
TL;DR: This paper presents a compound method to construct sparse, time-relaxed, minimal broadcast networks (t-mbn), in which broadcasting can be accomplished in slightly more than the minimum time.
Abstract: In broadcasting, or one-to-all communication, a message originally held in one node of the network must be transmitted to all the other nodes. A minimal broadcast network is a communication network that can transmit a message originated at any node to all other nodes of the network in minimum time. In this paper, we present a compound method to construct sparse, time-relaxed, minimal broadcast networks (t-mbn), in which broadcasting can be accomplished in slightly more than the minimum time. The proposed method generates a new network by connecting a subset of nodes from several copies of a t1-mbn using the structure of another t2-mbn. The objective is to construct a network as sparse as possible satisfying the desired broadcasting time constraint. Computational results illustrate the effectiveness of the proposed method.
TL;DR: Provably fastest parallel algorithms for a number of problems on bipartite permutation graphs are presented here, including connectivity, recognition, isomorphism detection, Hamiltonian path, and shortest path.
Abstract: Provably fastest parallel algorithms for a number of problems on bipartite permutation graphs are presented here. These problems include, among others, connectivity, recognition, isomorphism detection, Hamiltonian path, and shortest path. The algorithms here all run in logarithmic time on CREW PRAM. The processor bound is the same as that for multiplying two matrices in logarithmic time on the model and is subcubic.
TL;DR: This work uses the Network Weather Service — a distributed system that monitors and forecasts resource performance in Computational Grid environments — to measure and predict CPU availability and shows that in the environments it has studied, the availability of clusted interactive resources is significantly more predictable that of the batch system.
Abstract: In this paper, we focus on the problem of predicting CPU availability for Computational Grid settings in which individual machines may be either time-shared or batch-controlled. We use the Network Weather Service — a distributed system that monitors and forecasts resource performance in Computational Grid environments — to measure and predict CPU availability. We examine the accurancy with which CPU availability can be predicted in an interactive cluster computing environment under production load conditions, and compare these results with a similar study of a production batch system. Our work shows that in the environments we have studied, the availability of clusted interactive resources is significantly more predictable that of the batch system.
TL;DR: This paper evaluates JPVM, with and without JIT compilation, using some well-known parallel processing benchmarks, and provides some performance indications of using Java in network-based parallel computing environments.
Abstract: PVM for Java (JPVM) is a parallel programming environment that provides a set of Parallel Virtual Machine (PVM)-like class libraries developed using Java. JPVM inherits the attributes of Java, especially Java support of heterogeneous system. However, it also has the disadvantage of poor performance when compared to native codes. Java performance has been considerably improved with the recent introduction of Just in Time (JIT) compilers. This paper evaluates JPVM, with and without JIT compilation, using some well-known parallel processing benchmarks. It will also highlight some of the areas that limit the use of Java in parallel processing on distributed systems. The objective of the paper is to provide some performance indications of using Java in network-based parallel computing environments.
TL;DR: This paper intends to clarify this point by demonstrating a formal derivation of a correct but efficient homomorphic parallel algorithm for a simple language recognition problem known as bracket matching by leading to a novel divide-and-conquer parallel algorithm.
Abstract: It is widely recognized that a key problem of parallel computation is in the development of both efficient and correct parallel software. Although many advanced language features and compilation techniques have been proposed to alleviate the complexity of parallel programming, much effort is still required to develop parallelism in a formal and systematic way. In this paper, we intend to clarify this point by demonstrating a formal derivation of a correct but efficient homomorphic parallel algorithm for a simple language recognition problem known as bracket matching. To the best of our knowledge, our formal derivation leads to a novel divide-and-conquer parallel algorithm for bracket matching.