TL;DR: This paper presents a simple scheduling algorithm based on list-scheduling and task-duplication on a bounded number of heterogeneous machines, called Heterogeneous Critical Parents with Fast Duplicator (HCPFD), which outperforms on average all other higher complexity algorithms.
Abstract: Heterogeneous computing systems are an interesting computing platforms due to the fact that a single parallel architecture may not be adequate for exploiting all of a program's available parallelism. In some cases, heterogeneous systems have been shown to produce higher performance for lower cost than a single large machine. Task scheduling is the key issue when aiming at high performance in these kind of systems. A large number of scheduling heuristics have been presented in the literature, most of them target only homogeneous computing systems. In this paper we present a simple scheduling algorithm based on list-scheduling and task-duplication on a bounded number of heterogeneous machines, called Heterogeneous Critical Parents with Fast Duplicator (HCPFD). The analysis and experiments have shown that HCPFD outperforms on average all other higher complexity algorithms.
TL;DR: This article describes data replication on data grids as a static optimization problem, and shows that this problem is NP-hard and non-approximable.
Abstract: Data replication is a well-known technique used in distributed computing to improve access to data and/or system fault-tolerance. Recently, studies of its applications to grid computing have also been initiated. In this article we describe data replication on data grids as a static optimization problem. We show that this problem is NP-hard and non-approximable. We discuss two approaches to solving it, i.e. integer programming and simplifications.
TL;DR: It is proved that AQ"n-F is hamiltonian if |F|=<2n-3 and that AQ’s-f isHamiltonian connected if|F| =<2 n-4 and these bounds are tight.
Abstract: In this paper, we consider the fault hamiltonicity and the fault hamiltonian connectivity of the augmented cubes AQ"n. Assume that [email protected]?V(AQ"n)@?E(AQ"n) and n>=4. We prove that AQ"n-F is hamiltonian if |F|=<2n-3 and that AQ"n-F is hamiltonian connected if |F|=<2n-4. Moreover, these bounds are tight.
TL;DR: This paper analyzes major recent trends and changes in the High Performance Computing (HPC) market place and indicates renewed broad interest in the scientific HPC community for new hardware architectures and new programming paradigms.
Abstract: In this paper we analyze major recent trends and changes in the High Performance Computing (HPC) market place The introduction of vector computers started the area of 'Supercomputing' The initial success of vector computers in the seventies was driven by raw performance Massive parallel systems (MPP) became successful in the early nineties due to their better price/performance ratios, which was enabled by the attack of the 'killer-micros' The success of microprocessor based on the shared memory concept (referred to as symmetric multiprocessors (SMP)) even for the very high-end systems, was the basis for the emerging cluster concepts in the early 2000s Within the first half of this decade clusters of PC's and workstations have become the prevalent architecture for many HPC application areas on all ranges of performance However, the Earth Simulator vector system demonstrated that many scientific applications could benefit greatly from other computer architectures At the same time there is renewed broad interest in the scientific HPC community for new hardware architectures and new programming paradigms The IBM BlueGene/L system is one early example of a shifting design focus for large-scale system The DARPA HPCS program has the declared goal of building a Petaflops computer system by the end of the decade using novel computer architectures
TL;DR: It is found that the performance of the algorithm seems to perform well only in the environment with enough resources, comparing with the density of requests in the network, and the algorithm that finds the available matching resource whose attributes are closest to the required attribute can improve the resource utilization.
Abstract: We present an alternative algorithm of fully decentralized resource discovery in Grid computing, which enables the sharing, selection, and aggregation of a wide variety of geographically distributed computational resources. Our algorithm is based on a simply unicast request transmission that can be easily implemented. The addition of a reservation algorithm is enable resource discovery mechanism to find more available matching resources. The deadline for resource discovery time is decided with time-to-live value. With our algorithm, the only one resource is automatically decided for any request if multiple available resources are found on forward path of resource discovery, resulting in no need to ask user to manually select the resource from a large list of available matching resources. We evaluated the performance of our algorithms by comparing with first-found-first-served algorithm. The experiment results show that the percentages of request that can be supported by both algorithms are not different. However, it can improve the performance of either resource utilization or turnaround time, depending on how to select the resource. The algorithm that finds the available matching resource whose attributes are closest to the required attribute can improve the resource utilization, whereas another one that finds the available matching resource which has the highest performance can improve the turn-around time. However, it is found that the performance of our algorithm relies on the density of resource in the network. Our algorithm seems to perform well only in the environment with enough resources, comparing with the density of requests in the network.
TL;DR: In PGSA, the entire population is divided into sub-populations, and in each sub-population the algorithm uses the local search ability of simulated annealing after crossover and mutation to optimize continuous problems.
Abstract: This paper presents a parallel genetic simulated annealing (PGSA) algorithm that has been developed and applied to optimize continuous problems. In PGSA, the entire population is divided into sub-populations, and in each sub-population the algorithm uses the local search ability of simulated annealing after crossover and mutation. The best individuals of each subpopulation are migrated to neighboring ones after a certain number of epochs. An implementation of the algorithm is discussed and the performance is evaluated against a standard set of test functions. PGSA shows some remarkable improvement in comparison with the conventional parallel genetic algorithm and the breeder genetic algorithm (BGA).
TL;DR: The paper presents two-level market grid resource pricing that is an iterative algorithm used to perform optimal resource allocation that outperforms one level market scheme in terms of task completion time and resource allocation efficiency.
Abstract: This paper investigates the interactions between agents representing users, services and resources to solve resource scheduling optimization in computational grid. In order to reduce the computational complexity, we further decompose the grid resource allocation optimization into subproblems: grid user agent-grid service agent in service market and grid service agent-grid resource agent in resource market. Two-level market converges to its optimal points; a globally optimal point is achieved. Total user benefit of the computational grid is maximized when the equilibrium prices are obtained through the service market level optimization and resource market level optimization. It demonstrates a practical approach to market responsive resource pricing that can benefit grid providers and users alike. The paper presents two-level market grid resource pricing that is an iterative algorithm used to perform optimal resource allocation. The experiment shows that two-level market based resource pricing scheme outperforms one level market scheme in terms of task completion time and resource allocation efficiency.
TL;DR: This paper provides an algorithm for finding good distributions of threads to tasks and discusses how to implement nested parallelism in OpenMP.
Abstract: Many problems have multiple layers of parallelism. The outer-level may consist of few and coarse-grained tasks. Next, each of these tasks may also be rich in parallelism, and be split into a number of fine-grained tasks, which again may consist of even finer subtasks, and so on. Here we argue and demonstrate by examples that utilizing multiple layers of parallelism may give much better scaling than if one restricts oneself to only one level of parallelism. Two non-trivial issues for multi-level parallelism are load balancing and implementation. In this paper we provide an algorithm for finding good distributions of threads to tasks and discuss how to implement nested parallelism in OpenMP.
TL;DR: In this paper, a parallel hierarchical clustering algorithm for gene expression data is presented. But, the algorithm is limited to handle large data sets within a reasonable time and memory resources.
Abstract: Identification of groups of genes that manifest similar expression patters is a key step in the analysis of gene expression data. Hierarchical clustering is developed for that purpose. A fundamental problem with the previous implementations of this clustering method is its limitation to handle large data sets within a reasonable time and memory resources. In this paper, we present a parallel approach for solving this problem. Implementation of the parallel algorithm is illustrated on data from high dimensional microarray experiments related to the gene expression in cancerous disease and Arabidopsis seedling growth. They show considerable reduction in computational time and inter-node communication overhead, especially for large data sets.
TL;DR: A series of tests, on different platforms, of simplified codes that reproduce non-trivial issues of the present hardware/software platforms for parallel processing and can be used as a starting point in the search of a possible solution.
Abstract: Agent-based models, an emerging paradigm of simulation of complex systems, appear very suitable to parallel processing. However, during the parallelization of a simulator of financial markets, we found that some features of these codes highlight non-trivial issues of the present hardware/software platforms for parallel processing. Here we present the results of a series of tests, on different platforms, of simplified codes that reproduce such problems and can be used as a starting point in the search of a possible solution.
TL;DR: The results expressed in terms of execution time show that an adequate reordering of the data improves the efficiency of applying register blocking, therefore, reducing the execution time for the sparse algebra code considered.
Abstract: The combination of techniques based on reordering data with classic code restructuring techniques for increasing the locality in the execution of sparse algebra codes is studied in this paper. The reordering techniques are based on, first modeling the locality in run-time, and then applying a heuristic for increasing it. After this, a code restructuring technique specially tuned for sparse algebra codes called register blocking is applied. The product of a sparse matrix by a dense vector (SpMxV) is the code studied on different monoprocessors and distributed memory multiprocessors. The combination of both techniques was tested for a broad set of matrices from real problems and known repositories. The results expressed in terms of execution time show that an adequate reordering of the data improves the efficiency of applying register blocking, therefore, reducing the execution time for the sparse algebra code considered.
TL;DR: Object- and page-based distributed shared memories are compared, and optimizations for efficient memory use are discussed.
Abstract: We present recent results in the application of distributed shared memory to image parallel ray tracing on clusters. Image parallel rendering is traditionally limited to scenes that are small enough to be replicated in the memory of each node, because any processor may require access to any piece of the scene. We solve this problem by making all of a cluster’s memory available through software distributed shared memory layers. With gigabit ethernet connections, this mechanism is sufficiently fast for interactive rendering of multi-gigabyte datasets. Object- and page-based distributed shared memories are compared, and optimizations for efficient memory use are discussed.
TL;DR: An alternative approach that translates OpenMP to Global Arrays (GA) is introduced, explaining the basic strategy and a new directive INVARIANT is proposed to provide information about the dynamic scope of data access patterns.
Abstract: This paper discusses a novel approach to implementing OpenMP on clusters. Traditional approaches to do so rely on Software Distributed Shared Memory systems to handle shared data. We discuss these and then introduce an alternative approach that translates OpenMP to Global Arrays (GA), explaining the basic strategy. GA requires a data distribution. We do not expect the user to supply this; rather, we show how we perform data distribution and work distribution according to the user-supplied OpenMP static loop schedules. An inspector-executor strategy is employed for irregular applications in order to gather information on accesses to potentially non-local data, group non-local data transfers and overlap communications with local computations. Furthermore, a new directive INVARIANT is proposed to provide information about the dynamic scope of data access patterns. This directive can help us generate efficient codes for irregular applications using the inspector-executor approach. We also illustrate how to deal with some hard cases containing reshaping and strided accesses during the translation. Our experiments show promising results for the corresponding regular and irregular GA codes.
TL;DR: The subject of this paper is to show the high efficiency of asynchronism for parallel iterative algorithms in the context of grid computing, that is to say, with machines scattered on a broad geographical scale.
Abstract: The subject of this paper is to show the high efficiency of asynchronism for parallel iterative algorithms in the context of grid computing, that is to say, with machines scattered on a broad geographical scale. The question is: does asynchronism help to reduce the communication penalty and the overall computation time of such a given algorithm? The asynchronous programming model is evaluated on two test problems representing two important classes of scientific applications: a stationary linear problem and a non-stationary non-linear problem. They are implemented with a multi-threaded environment and tested on a set of distant heterogeneous machines. Several experiments have been performed allowing us to compare the performances of such asynchronous algorithms and also to analyze their behavior and extract the main possible optimizations for their use in a grid computing context.
TL;DR: A new parallel multiresolution volume rendering algorithm that can reduce the run-time communication cost to a minimum and ensure a well-balanced workload among processors when visualizing gigabytes of data with arbitrary error tolerances is presented.
Abstract: We present a new parallel multiresolution volume rendering algorithm for visualizing large data sets. Using the wavelet transform, the raw data is first converted to a multiresolution wavelet tree. To eliminate the data dependency between processors at run-time, and achieve load-balanced rendering, we design a novel algorithm to partition the tree and distribute the data along a hierarchical space-filling curve with error-guided bucketization. Further optimization is achieved by storing reconstructed data at pre-selected tree nodes for each processor based on the available storage resources to reduce the overall wavelet reconstruction cost. At run time, the wavelet tree is first traversed according to the user-specified error tolerance. Data blocks of different resolutions that satisfy the error tolerance are then decompressed and rendered to compose the final image in parallel. Experimental results showed that our algorithm can reduce the run-time communication cost to a minimum and ensure a well-balanced workload among processors when visualizing gigabytes of data with arbitrary error tolerances.
TL;DR: JJPF (the Java/Jini Parallel Framework) is a framework that can run stream parallel applications on several parallel-distributed architectures and achieves almost perfect, fully automatic load balancing in the execution of such kind of applications.
Abstract: JJPF (the Java/Jini Parallel Framework) is a framework that can run stream parallel applications on several parallel-distributed architectures. JJPF is actually a distributed execution server. It uses JINI to recruit the computational resources needed to compute parallel applications. Parallel applications can be run on JJPF provided they exploit parallelism accordingly to an arbitrary nesting of task farm and pipeline skeletons/patterns. JJPF achieves almost perfect, fully automatic load balancing in the execution of such kind of applications. It also transparently handles any number of node and network faults. Scalability and efficiency results are shown on workstation networks, both with a synthetic (embarrassingly parallel) image processing application and with a real (not embarrassingly parallel) page ranking application.
TL;DR: The primary goal of this work is to identify and abstract the commonalities present in various hierarchical methods using octrees, design efficient parallel algorithms for them, and encapsulate them in a software library.
Abstract: We describe the design and implementation of efficient parallel algorithms, and a software library for the parallel implementation of compressed octree data structures. Octrees are widely used in supporting hierarchical methods for scientific applications such as the N-body problem, molecular dynamics and smoothed particle hydrodynamics. The primary goal of our work is to identify and abstract the commonalities present in various hierarchical methods using octrees, design efficient parallel algorithms for them, and encapsulate them in a software library. We designed provably efficient parallel algorithms and implementation strategies that perform well irrespective of the spatial distribution of data in the computational domain. The library will enable rapid development of applications, allowing application developers to use efficient parallel algorithms developed for this purpose, without the necessity of having detailed knowledge of the algorithms or of implementing them. The software is developed in C using the Message Passing Interface (MPI). We report experimental results on an IBM xSeries parallel computer.
TL;DR: A new interactive parallel visualization method for large particle datasets by directly rendering individual particles based on a parallel rendering cluster that provides real time interaction and interactive exploration of large datasets, which has been a challenge for scientific visualization and other real time data mining applications.
Abstract: This paper presents a new interactive parallel visualization method for large particle datasets by directly rendering individual particles based on a parallel rendering cluster. A frame rate of 9 frames-per-second is achieved for 256^3 particles using 7 render nodes and a display node. This provides real time interaction and interactive exploration of large datasets, which has been a challenge for scientific visualization and other real time data mining applications. A dynamic data distribution technique is designed for highlighting a subset of the particle volume. It maintains load balance of the system and minimizes network traffic by reconfiguring the rendering chain. Experiments show that on a given subset, interactive manipulation of the subset usually requires less than 3% of the particles inside the subset to be redistributed among all render nodes. The method can be easily extended to other large datasets such as hydrodynamic turbulence, fluid dynamics, and so on.
TL;DR: Simulation results show that one of the four algorithms improves up to 92% of throughput against simple path selection algorithms, and policies to remove paths crossing the bottleneck channels are more efficient than ones to keep paths crossing channels that are not crowded.
Abstract: System Area Networks (SANs), which usually accept irregular topologies, have been used to connect nodes in PC/WS clusters or high-performance storage systems. Although routing algorithms for SANs usually find out alternative paths, SANs usually accept only deterministic routings. Thus, path selection algorithm, which chooses a single path from alternative paths, becomes essential for advanced routings in SANs. However, a few studies of it have been done only for SANs without virtual channels, and its impact is not well analyzed. In this paper, (1) we propose four path selection algorithms which have different concepts to distribute paths in SANs with virtual channels, and (2) we investigate the performance influences of various path selection algorithms through a flit-level simulation. Simulation results show that one of the four algorithms improves up to 92% of throughput against simple path selection algorithms, and policies to remove paths crossing the bottleneck channels are more efficient than ones to keep paths crossing channels that are not crowded.
TL;DR: This paper presents and compares a series of strategies to search the optimal tile size guided by an analytical model of the whole memory hierarchy and the CPU behavior and shows that these strategies find better tile sizes than traditional heuristic approaches proposed in the literature while requiring a small compile-time overhead.
Abstract: As the memory bottleneck problem continues to grow, so does the relevance of the techniques that help improve program locality. A well-known technique in this category is tiling, which decomposes data sets to be used several times in a computation into a series of tiles that are reused before proceeding to process the next tile. This way, capacity misses are avoided. Finding the optimal tile size is a complex task. In this paper we present and compare a series of strategies to search the optimal tile size guided by an analytical model of the whole memory hierarchy and the CPU behavior. Our experiments show that our strategies find better tile sizes than traditional heuristic approaches proposed in the literature while requiring a small compile-time overhead. Iterative compilation can yield better results, but at the expense of very large overheads.
TL;DR: The Grid-DBMS as discussed by the authors is a framework for dynamic data management in a grid enviroment, highlighting its requirements, architecture, components and services, as well as an overview about the Grid Relational Catalog Project (GRelC) developed at the CACT/ISUFI of the University of Lecce.
Abstract: Today many DataGrid applications need to manage and process a very large amount of data distributed across multiple grid nodes and stored into heterogeneous databases. Grids encourage and promote the publication, sharing and integration of scientifica data (distributed across several Virtual Organizations) in a more open manner than is currently the case, and many e-Science pojects have an urgent need to interconnect legacy and independently operated databases through a set os data access and integration services. The complexity of data management within a Computational Grid comes from the distribution, scale and heterogeneity of data sources. A set of dynamic and adaptive services could address specific issues related to automatic data management providing high performance and transparency as well as fully exploiting a grid infrastructure. These services should involved data migration and integration, discovery of data sources and so on, providing a transparent and dynamic layer of data virtualization. In this pape we introduce the Grid-DBMS concept, a framework for dynamic data management in a grid enviroment, highlighting its requirements, architecture, components and services. We also present an overview about the Grid Relational Catalog Project (GRelC) developed at the CACT/ISUFI of the University of Lecce, which represents a partial implementation of a Grid-DBMS for the Globus Community.
TL;DR: An efficient parallel iterative method for finite-element method has been developed for symmetric multiprocessor (SMP) cluster architectures with vector processors such as the Earth Simulator and effect of color number in reordering has been evaluated on various types of computers.
Abstract: An efficient parallel iterative method for finite-element method has been developed for symmetric multiprocessor (SMP) cluster architectures with vector processors such as the Earth Simulator. The method is based on a three-level hybrid parallel programming model, including message passing for inter-SMP node communication, loop directives by OpenMP for intra-SMP node parallelization and vectorization for each processing element (PE). Simple 3D linear elastic problems with more than 2.2x10^9 DOF have been solved using 3x3 block ICCG(0) method with additive Schwarz domain decomposition and PDJDS/CM-RCM reordering on 176 nodes of the Earth Simulator, achieving performance of 3.80 TFLOPS. Furthermore, effect of color number in reordering has been evaluated on various types of computers.
TL;DR: The design and implementation of a library based on an integrated approach to dynamic load balancing with the advantages of optimizing data migration via novel dynamic loop scheduling strategies with the advances in resource management and task migration capabilities offered by a recently developed parallel runtime system are presented.
Abstract: This paper presents the design and implementation of a library based on an integrated approach to dynamic load balancing. This approach combines the advantages of optimizing data migration via novel dynamic loop scheduling strategies with the advances in resource management and task migration capabilities offered by a recently developed parallel runtime system. The performance improvements obtained by the use of this library have been investigated by its use in three scientific applications: the N-body simulations, the profiling of automatic quadrature routines, and the heat solver in an unstructured grid. The experimental results obtained underscore the significance of using such an integrated approach, as well as the benefits of using the library especially in applications characterized by irregular and unpredictable behavior.
TL;DR: This paper discusses workflow management and resource discovery in an intelligent grid environment and proposes a novel approach to co-ordinating workflow creation and coordinated workflow execution.
Abstract: A computational grid provides coordinated and transparent access to computing resources for grid users. Workflow management and resource discovery are two important functions of an intelligent grid. Workflow management refers to automatic workflow creation and coordinated workflow execution, and resource discovery facilitates resource allocation and claiming. In this paper we discuss workflow management and resource discovery in an intelligent grid environment.
TL;DR: It is shown that low cost FPGA nodes interconnected through a standard Ethernet network may advantageously compete against high performance clusters and substitute PCs by re- configuration hardware closely connected to a hard disk.
Abstract: Genomic data are growing exponentially and are daily scanned by thousands of biologists. To reduce the scan time, efficient parallelism can be exploited by dispatching data among a cluster of processing units able to scan locally and independently their own data. If PC clusters are well suited to support this type of parallelism, we propose to substitute PCs by re-configurable hardware closely connected to a hard disk. We show that low cost FPGA nodes interconnected through a standard Ethernet network may advantageously compete against high performance clusters. A prototype of 48 re-configurable processing nodes has been experimented on two genomic applications: a content-based similarity search and a pattern search.
TL;DR: Five resource allocation heuristics to derive near-optimal solutions to the problem of assigning statically computing resources to the subtasks of an application that has an execution time constraint are presented, evaluated, and compared.
Abstract: An ad hoc grid is a heterogeneous computing system composed of mobile devices. Each computing resource is constrained in battery energy. The problem being studied is to assign statically computing resources to the subtasks of an application that has an execution time constraint, when the resources are oversubscribed. All subtasks must be executed; to accommodate this in an oversubscribed environment, each subtask has two versions: the primary or full version, and the secondary or degraded version. The secondary version utilizes only 10% of the resources that the primary version requires, and produces only 10% of the data output for the subsequent children subtasks. Thus, the degraded version (secondary version) represents a reduced capability of lesser overall value, while consuming fewer resources. The goal is to assign resources so that the application meets an execution time constraint and the battery energy constraint while minimizing the number of degraded versions used. Five resource allocation heuristics to derive near-optimal solutions to this problem are presented, evaluated, and compared.
TL;DR: A comprehensive parallel approach based on the simulation of stochastic differential equations enabling the optimization of the global objective function, through an annealing process, that yields computation times compatible with clinical routine.
Abstract: The deformable matching of 3D medical images remains a difficult problem due to the high dimension of both geometric transformations and data. The matching problem is usually expressed as the minimization of a highly non-linear energy (objective) function, yielding a hard, computationally intensive, optimization problem. This paper presents a comprehensive parallel approach that yields computation times compatible with clinical routine. The image matching is based on the simulation of stochastic differential equations, enabling the optimization of the global objective function, through an annealing process. The resulting algorithm allows a fully parallel sampling of the parameters to be optimized. Due to the large number of parameters involved in deformable matching, this approach is naturally suited to massively parallel implementations. We present implementation issues and timing analysis on an MIMD parallel processing computer (SGI Origin 2000). The performances of the approach are assessed on real data, using 3D brain MR images from different individuals. Beside yielding accurate registrations, the parallel algorithm exhibits excellent relative speedups.