TL;DR: A high-speed transport service that extends the popular FTP protocol with new features required for Data Grid applications, such as striping and partial file access and a replica management service that integrates a replica catalog with GridFTP transfers to provide for the creation, registration, location, and management of dataset replicas.
Abstract: An emerging class of data-intensive applications involve the geographically dispersed extraction of complex scientific information from very large collections of measured or computed data. Such applications arise, for example, in experimental physics, where the data in question is generated by accelerators, and in simulation science, where the data is generated by supercomputers. So-called Data Grids provide essential infrastructure for such applications, much as the Internet provides essential services for applications such as e-mail and the Web. We describe here two services that we believe are fundamental to any Data Grid: reliable, high-speed transport and replica management. Our high-speed transport service, GridFTP, extends the popular FTP protocol with new features required for Data Grid applications, such as striping and partial file access. Our replica management service integrates a replica catalog with GridFTP transfers to provide for the creation, registration, location, and management of dataset replicas. We present the design of both services and also preliminary performance results. Our implementations exploit security and other services provided by the Globus Toolkit.
TL;DR: The block partitioning and scheduling problem for sparse parallel factorization without pivoting is considered, and the scalability of the parallel solver and the compromise between memory overhead and efficiency are considered.
Abstract: Solving large sparse symmetric positive definite systems of linear equations is a crucial and time-consuming step, arising in many scientific and engineering applications. The block partitioning and scheduling problem for sparse parallel factorization without pivoting is considered. There are two major aims to this study: the scalability of the parallel solver, and the compromise between memory overhead and efficiency. Parallel experiments on a large collection of irregular industrial problems validate our approach.
TL;DR: By using multi-sampling in a new probabilistic algorithm, more accurate and efficient approximations of CVTs are obtained without the need to explicit construct Voronoi diagrams.
Abstract: Centroidal Voronoi tessellations (CVTs) are Voronoi tessellations of a region such that the generating points of the tessellations are also the centroids of the corresponding Voronoi cells. In this paper, some probabilistic methods for determining CVTs and their parallel implementations on distributed memory systems are presented. By using multi-sampling in a new probabilistic algorithm we introduce, more accurate and efficient approximations of CVTs are obtained without the need to explicit construct Voronoi diagrams. The new algorithm lends itself well to parallelization, i.e., near prefect linear speed up in the number of processors is achieved. The results of computational experiments performed on a CRAY T3E-600 system are provided which illustrate the superior sequential and parallel performance of the new algorithm when compared to existing algorithms. In particular, for the same amount of work, the new algorithms produce significantly more accurate CVTs.
TL;DR: A new parallelization strategy based on a dynamic two-level scheduling scheme that aims at minimizing cache conflicts and interprocessor communication costs and, at the same time, maximizing processor load balance and Level-3 BLAS performance is explored.
Abstract: The PARDISO package is a mathematical library of OpenMP routines for the parallel direct solution of large sparse linear systems of equations. One objective of PARDISO is to achieve a high efficiency on shared memory multiprocessing systems. A new parallelization strategy based on a dynamic two-level scheduling scheme is therefore explored. The method aims at minimizing cache conflicts and interprocessor communication costs and, at the same time, maximizing processor load balance and Level-3 BLAS performance. The synchronization events are reduced by one order of magnitude compared with a one-level scheduling strategy. This results in an efficient parallel sparse LU decomposition method. An overview of the two-level scheduling algorithm and the key algorithmic features of the solver PARDISO is given, Finally, numerical results and a comparison with another software package demonstrate the performance.
TL;DR: Imperial College e-Science Networked Infrastructure (ICENI), a Grid middleware framework developed within the London e- science Centre, is described and the effectiveness of this architecture is demonstrated through the high-level specification and solution of a set of linear equations by automatic and selection of optimal resources and implementations.
Abstract: Effective exploitation of Computational Grids can only be achieved when applications are fully integrated with the Grid middleware and the underlying computational resources. Fundamental to this exploitation is information. Information about the structure and behaviour of the application, the capability of the computational and networking resources, and the availability and access to these resources by an individual, a group or an organisation.In this paper we describe Imperial College e-Science Networked Infrastructure (ICENI), a Grid middleware framework developed within the London e-Science Centre. ICENI is a platform-independent framework that uses open and extensible XML derived protocols, within a framework built using Java and Jini, to explore effective application execution upon distributed federated resources. We match a high-level application specification, defined as a network of components, to an optimal combination of the currently available component implementations within our Grid environment, by using composite performance models. We demonstrate the effectiveness of this architecture through the high-level specification and solution of a set of linear equations by automatic and selection of optimal resources and implementations.
TL;DR: The Parallel Design Patterns (PDP) process, the basis of the CO2P3S parallel programming system, combines these techniques in a layered development model, creating a new approach to parallel programming that addresses correctness and openness in a unique way.
Abstract: Object-oriented programming, design patterns, and frameworks are abstraction techniques that have been used to reduce the complexity of sequential programming. This paper describes our approach of applying these three techniques to the more difficult parallel programming domain. The Parallel Design Patterns (PDP) process, the basis of the CO2P3S parallel programming system, combines these techniques in a layered development model. The result is a new approach to parallel programming that addresses correctness and openness in a unique way. At the topmost developmem layer, a customized framework is generated from a design pattern specification of the parallel structure of the program. This framework encapsulates all of the structural details of the pattern, including communication and synchronization, to prevent programmer errors and ensure correctness. Lower layers are used only for performance tuning to make the code as efficient as necessary. This paper describes CO2P3S, based on the PDP process, and demonstrates it using an example application. We also provide results from a usability study of CO2P3S.
TL;DR: Results indicate that the core of the architecture forms a powerful basis for automatic parallelization and optimization of a wide range of imaging software.
Abstract: This paper describes a software architecture that allows image processing researchers to develop parallel applications in a transparent manner. The architecture's main component is an extensive library of data parallel low level image operations capable of running on homogeneous distributed memory MIMD-style multicomputers. Since the library has an application programming interface identical to that of an existing sequential library, all parallelism is completely hidden from the user.The first part of the paper discusses implementation aspects of the parallel library, and shows how sequential as well as parallel operations are implemented on the basis of so-called parallelizable patterns. A library built in this manner is easily maintainable, as extensive code redundancy is avoided. The second part of the paper describes the application of performance models to ensure efficiency of execution on all target platforms. Experiments show that for a realistic application performance predictions are highly accurate. These results indicate that the core of the architecture forms a powerful basis for automatic parallelization and optimization of a wide range of imaging software.
TL;DR: An overview of the recent research in video compression using parallel processing is presented, outlining the basic philosophy of each approach and providing examples, and suggesting future research directions.
Abstract: Driven by the rapidly increasing demand for audio-visual applications, digital video compression technology has become a mature field, offering several available products based on both hardware and software implementations. Taking advantage of spatial, temporal, and statistical redundancies in video data, a video compression system aims to maximize the compression ratio while maintaining a high picture quality. Despite the tremendous progress in this area, video compression remains a challenging research problem due to its computational requirements and also because of the need for higher picture quality at lower data rates. Designing efficient coding algorithms continues to be a prolific area of research. For circumvent the computational requirement, researchers has resorted to parallel processing with a variety of approaches using dedicated parallel VLSI architectures as well as software on general-purpose available multiprocessor systems. Despite the availability of fast single processors, parallel processing helps to explore advanced algorithms and to build more sophisticated systems. This paper presents an overview of the recent research in video compression using parallel processing. The paper provides a discussion of the basic compression techniques, existing video coding standards, and various parallelization approaches. Since video compression is multi-step in nature using various algorithms, parallel processing can be exploited at an individual algorithm or at a complete system level. The paper covers a broad spectrum of such approaches, outlining the basic philosophy of each approach and providing examples. We contrast these approaches when possible, highlight their pros and cons, and suggest future research directions. While the emphasis of this paper is on software-based methods, a significant discussion of hardware and VLSI is also included.
TL;DR: This paper presents a compendium of frameworks and methods developed to support efficient execution of subsetting and aggregation operations in applications that query and manipulate large, multi-dimensional datasets in parallel and distributed computing environments.
Abstract: Analysis of data is an important step in understanding and solving a scientific problem. Analysis involves extracting the data of interest from all the available raw data in a dataset and processing it into a data product. However, in many areas of science and engineering, a scientist's ability to analyze information is increasingly becoming hindered by dataset sizes. The vast amount of data in scientific datasets makes it a difficult task to efficiently access the data of interest, and manage potentially heterogeneous system resources to process the data. Subsetting and aggregation are common operations executed in a wide range of data-intensive applications. We argue that common runtime and programming support can be developed for applications that query and manipulate large datasets. This paper presents a compendium of frameworks and methods we have developed to support efficient execution of subsetting and aggregation operations in applications that query and manipulate large, multi-dimensional datasets in parallel and distributed computing environments.
TL;DR: The development of Grid programming environments that would enable programmers to efficiently exploit this technology is an important and hot research issue and the most important approaches/projects conducted in this field worldwide are reviewed.
Abstract: Grid computing enables the development of large scientific applications on an unprecedented scale. Grid-aware applications, also called meta-applications or multi-disciplinary applications, make use of coupled computational resources that are not available at a single site. In this light, the Grids let scientists solve larger or new problems by pooling together resources that could not be coupled easily before. It is well known that the programmer's productivity in designing and implementing efficient distributed/parallel applications on high-performance computers is still usually a very time-consuming task. Grid computing makes the situation worse. Consequently, the development of Grid programming environments that would enable programmers to efficiently exploit this technology is an important and hot research issue.After an introduction on the main Grid programming issues, this paper will review the most important approaches/projects conducted in this field worldwide.
TL;DR: The purpose of this introductory article is to provide an overview of the main issues in parallel data intensive computing in scientific and commercial applications and to encourage the reader to go into the more in-depth articles later in this special issue.
Abstract: Applications that explore, query, analyze, visualize, and, in general, process very large scale data sets are known as Data Intensive Applications. Large scale data intensive computing plays an increasingly important role in many scientific activities and commercial applications, whether it involves data mining of commercial transactions, experimental data analysis and visualization, or intensive simulation such as climate modeling. By combining high performance computation, very large data storage, high bandwidth access, and high-speed local and wide area networking, data intensive computing enhances the technical capabilities and usefulness of most systems. The integration of parallel and distributed computational environments will produce major improvements in performance for both computing intensive and data intensive applications in the future. The purpose of this introductory article is to provide an overview of the main issues in parallel data intensive computing in scientific and commercial applications and to encourage the reader to go into the more in-depth articles later in this special issue.
TL;DR: This paper presents an efficient linear-time sequential algorithm for constructing Hamiltonian paths between two given vertices in meshes with horizontal size m and vertical size n and shows that the algorithm can be optimally parallelized to obtain a constant-time parallel algorithm on the weakest parallel machine without need of inter-processor communication.
Abstract: This paper presents an efficient linear-time sequential algorithm for constructing Hamiltonian paths between two given vertices in meshes with horizontal size m and vertical size n. The algorithm first partitions the given mesh into a number of submeshes in constant steps, and then constructs a Hamiltonian cycle or path in each submesh and combines them together to become a complete Hamiltonian path in mn steps. Our algorithm has improved the previous algorithm [6] by reducing the number of partition steps from O(m + n) to only a constant. Moreover, we show that our algorithm can be optimally parallelized to obtain a constant-time parallel algorithm on the weakest parallel machine without need of inter-processor communication, while this cannot be achieved for the previous algorithm.
TL;DR: This paper discusses recent work on building component interfaces and implementations in parallel numerical toolkits for mesh manipulations, discretization, linear algebra, and optimization for high-performance simulations in computational science.
Abstract: High-performance simulations in computational science often involve the combined software contributions of multidisciplinary teams of scientists, engineers, mathematicians, and computer scientists. One goal of component-based software engineering in large-scale scientific simulations is to help manage such complexity by enabling better interoperability among codes developed by different groups. This paper discusses recent work on building component interfaces and implementations in parallel numerical toolkits for mesh manipulations, discretization, linear algebra, and optimization. We consider several motivating applications involving partial differential equations and unconstrained minimization to demonstrate this approach and evaluate performance.
TL;DR: A data and task parallel low-level image processing environment for distributed memory systems that is parallelized by data decomposition using algorithmic skeletons and validated on the multi-baseline stereo vision application.
Abstract: The paper presents a data and task parallel low-level image processing environment for distributed memory systems. Image processing operators are parallelized by data decomposition using algorithmic skeletons. Image processing applications are parallelized by task decomposition, based on the image application task graph. In this way, an image processing application can be parallelized both by data and task decomposition, and thus better speed-ups can be obtained. We validate our method on the multi-baseline stereo vision application.
TL;DR: This paper uses a non-oriented graph to model a grid, where resources can have different speeds of computation and communication, as well as different overlap capabilities, and shows how to determine the optimal steady-state scheduling strategy for each processor.
Abstract: In this paper, we consider the problem of allocating a large number of independent, equal-sized tasks to a heterogeneous "grid" computing platform. We use a non-oriented graph to model a grid, where resources can have different speeds of computation and communication, as well as different overlap capabilities. We show how to determine the optimal steady-state scheduling strategy for each processor.Because spanning trees are easier to deal with in practice, a natural question arises: how to extract the best spanning tree, i.e. the one with optimal steady-state throughput, out of a general interconnection graph? We show that this problem is NP-Complete. Still, we introduce and compare several low-complexity heuristics to determine a sub-optimal spanning tree.
TL;DR: A flexible, parallel version of MODTRAN is implemented on the Cray T3E, the HP SPP2000, and a Beowulf-class cluster computer using domain decomposition techniques and the Message Passing Interface (MPI) library.
Abstract: To enable efficient reduction of large data sets such as is done in the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) project at the Jet Propulsion Laboratory (JPL), a high performance version of MODTRAN is essential. One means to accomplish this is to apply the computational resources of parallel computer systems. In our present work, a flexible, parallel version of MODTRAN has been implemented on the Cray T3E, the HP SPP2000, and a Beowulf-class cluster computer using domain decomposition techniques and the Message Passing Interface (MPI) library. In this paper, porting the sequential MODTRAN to various platforms is discussed; strategies of designing a parallel version of MODTRAN are developed; detailed implementation for a parallel MODTRAN is reported, and performance data of the parallel code on various computers are presented. Near linear scaling performance of parallel MODTRAN has been obtained, and comparisons of wallclock time are made among various supercomputers and parallel computers. The parallel version of MODTRAN gives excellent speedup, which dramatically reduces total data processing time for many applications such as the AVIRIS project at JPL.
TL;DR: In this paper, the authors propose a model that implements one large logical disk allowing concurrent access to arbitrary sets of variable size blocks, which can be implemented efficiently on multiple independent disks even if zones with different speed, communication bottlenecks and failed disks are allowed.
Abstract: For the design and analysis of algorithms that process huge data sets, a machine model is needed that handles parallel disks. There seems to be a dilemma between simple and flexible use of such a model and accurate modeling of details of the hardware. This paper explains how many aspects of this problem can be resolved. The programming model implements one large logical disk allowing concurrent access to arbitrary sets of variable size blocks. This model can be implemented efficiently on multiple independent disks even if zones with different speed, communication bottlenecks and failed disks are allowed. These results not only provide useful algorithmic tools but also imply a theoretical justification for studying external memory algorithms using simple abstract models.The algorithmic approach is random redundant placement of data and optimal scheduling of accesses. The analysis generalizes a previous analysis for simple abstract external memory models in several ways (higher efficiency, variable block sizes, more detailed disk model).
TL;DR: MOSE as mentioned in this paper is a scalable parallel and distributed engine for searching the web, specifically designed to efficiently exploit affordable parallel architectures, such as clusters of workstations, which can be easily adjusted to fulfill the bandwidth requirements of the application at hand.
Abstract: This paper describes the architecture of MOSE (My Own Search Engine), a scalable parallel and distributed engine for searching the web. MOSE was specifically designed to efficiently exploit affordable parallel architectures, such as clusters of workstations. Its modular and scalable architecture can be easily adjusted to fulfill the bandwidth requirements of the application at hand. Both task-parallel and data-parallel approaches are exploited within MOSE in order to increase the throughput and efficiently use communication, storing and computational resources. We used a collection of html documents as a benchmark and conducted preliminary experiments on a cluster of three SMP Linux PCs.
TL;DR: The Logistical Computing and Internetworking project is a reflection of the way that the next generation internetworking fundamentally changes the authors' definition of high performance wide area computing, with a richer view of the use of storage in communication and information sharing.
Abstract: The Logistical Computing and Internetworking (LoCI) project is a reflection of the way that the next generation internetworking fundamentally changes our definition of high performance wide area computing. A key to achieving this aim is the development of middleware that can provide reliable, flexible, scalable, and cost-effective delivery of data with quality of service guarantees to support high performance applications of all types. The LoCI effort attacks this problem with a simple but innovative strategy. At the base of the LoCI project is a richer view of the use of storage in communication and information sharing.
TL;DR: The derivation of terminal, broadcast, lower and upper bounds network reliability expressions of the extra-stage cube network will be demonstrated and lower bound reliability provides sufficient assurance that the system will be operational at some specified time.
Abstract: To derive the exact reliability expressions for large Multi-stage Interconnection Networks (MINs) can become rather complex. As network size increases, the reliability bounds could be used to estimate the reliability of the networks. In this paper, terminal, broadcast, lower and upper bounds network reliability will be determined. Lower bound reliability is the minimum probability that the system will be operational for a specified time. Upper bound reliability presents an optimistic view of probability that the system will work at some specified time, which is not the center of attention in terms of reliability point of view. If the lower bound reliability provides sufficient assurance that the system will be operational at some specified time, then no further effort for obtaining the exact reliability expression is necessary. As examples, the derivation of terminal, broadcast, lower and upper bounds network reliability expressions of the extra-stage cube network will be demonstrated.
TL;DR: This paper presents and illustrates the approach to developing a general framework for FPGA based Image Processing based on a library of hardware skeletons, with optimised implementations specifically for Xilinx XC4000 FPGAs.
Abstract: In this paper, we present our approach to developing a general framework for FPGA based Image Processing. This framework is based on a library of hardware skeletons. A hardware skeleton is a parameterised description of a task-specific architecture. A skeleton's implementation will apply optimisations specific to the target hardware. The library normally contains a range of alternative skeletons for the same task, perhaps tailored for different data representations. The library also contains high level skeletons for compound operations, whose implementation can apply appropriate optimisations. Given a complete algorithm description in terms of skeletons, an efficient hardware configuration is generated automatically. We have developed a library of hardware skeletons for common image processing tasks, with optimised implementations specifically for Xilinx XC4000 FPGAs. This paper presents and illustrates our hardware skeleton approach in the context of some common image processing tasks. It demonstrates our approach to the broader problem of achieving optimised hardware configurations while retaining the convenience and rapid development cycle of an application-oriented, high level programming model.
TL;DR: The GLS preconditioning polynomial and its influence on the flexible generalized minimized residual (FGMRES) solver are discussed in this paper and experimental results using classical benchmark systems are presented.
Abstract: Polynomial preconditioners are frequently used in a parallel environment for the computation of the solution of large-scale sparse linear equations (Ax = b) because of their easy implementation and trivial parallelization. With respect to symmetrical indefinite (SID) linear systems, the use of generalized least-squares (GLS) polynomial preconditioning is preferable to other polynomial preconditioning methods because of the ability to use a three-term recurrence relationship and the low implementation costs. The GLS preconditioning polynomial and its influence on the flexible generalized minimized residual (FGMRES) solver are discussed in this paper. The orthogonal polynomials required in the solution of the least-squares approximation problem are constructed using the Stieltjes procedure in multiple disjoint intervals which exclude the origin. The time-consuming numerical integration associated with this procedure is computed efficiently using Chebyshev polynomials of the first kind and the GLS polynomial reconditioned FGMRES algorithm is implemented using MPI in a highly parallel IBM SP2 environment. Experimental results using classical benchmark systems are presented and compared with those obtained using the recently developed SPAI preconditioned Bi-CGSTAB iterative method. The performance of the GLS preconditioned FGMRES solver is critically accessed.
TL;DR: Algorithm and implementation issues regarding the single-/double-shift QZ algorithm are discussed and multishift strategies to enhance the performance in blocked as well as in parallell variants of the QZ method are described.
Abstract: A parallel three-stage algorithm for reduction of a regular matrix pair (A, B) to generalized Schur from (S, T) is presented. The first two stages transform (A, B) to upper Hessenberg-triangular form (H, T) using orthogonal equivalence transformations. The third stage iteratively reduces the matrix in (H, T) form to generalized Schur form. Algorithm and implementation issues regarding the single-/double-shift QZ algorithm are discussed. We also describe multishift strategies to enhance the performance in blocked as well as in parallell variants of the QZ method.
TL;DR: Some results, which are obtained when several versions of a large-scale air pollution model are run on different parallel architectures, will be presented in this paper.
Abstract: Mathematical models for large-scale air pollution studies consist of systems of partial differential equations (PDEs). The number of equations in these systems of PDEs is equal to the number of chemical compounds (the number of chemical compounds involved in the current large-scale air pollution models varies from 20 to about 200). The space domain of the systems of PDEs is normally very large, because the models must be able to treat transboundary long-range transport of the harmful pollutants. The time-intervals are often very long (runs with meteorological data covering up to 10 years have sometimes to be carried out). Moreover, fine spatial and temporal resolution is as a rule required. This leads to very large computational tasks when the air pollution models are discretized. Therefore, it is necessary to use fast and sufficiently accurate numerical methods as well as to exploit efficiently the great potential power of the parallel computers. Some results, which are obtained when several versions of a large-scale air pollution model are run on different parallel architectures, will be presented in this paper.
TL;DR: Madeleine II as mentioned in this paper is an adaptive and portable multiprotocol communication library for high-performance multithreaded applications, which has the ability to control multiple network protocols (BIP, SISCI, VIA) and multiple network adapters (ETHERNET, MYRINET, SCI).
Abstract: This paper introduces Madeleine II, an adaptive and portable multiprotocol communication library for high-performance multithreaded applications. Madeleine II has the ability to control multiple network protocols (BIP, SISCI, VIA) and multiple network adapters (ETHERNET, MYRINET, SCI). Moreover, it includes advanced mechanisms to dynamically select the most appropriate transfer method for a given network protocol according to various parameters such as data size or responsiveness user requirements. We report on performance measurements obtained using various protocols and we present preliminary results about porting the MPICH and the NEXUS communication libraries on top of Madeleine II.
TL;DR: This paper provides an overview showing why Grid research is difficult, and a number of management-related issues that must be addressed to make Grids a reality.
Abstract: In this paper we provide a motivation for Grid computing based on a vision to enable a collaborative research environment. Our vision goes beyond the connection of hardware resources. We argue that with an infrastructure such as the Grid, new modalities for collaborative research are enabled. We provide an overview showing why Grid research is difficult, and we present a number of management-related issues that must be addressed to make Grids a reality. We list projects that provide solutions to subsets of these issues.
TL;DR: A new Recursive Packed Inverse Calculation Algorithm for symmetric positive definite matrices has been developed and has nearly the same performance as the LAPACK full storage algorithm using n2 memory words.
Abstract: A new Recursive Packed Inverse Calculation Algorithm for symmetric positive definite matrices has been developed The new Recursive Inverse Calculation algorithm uses minimal storage, n(n + 1)/2, and has nearly the same performance as the LAPACK full storage algorithm using n2 memory words New recursive packed BLAS needed for this algorithm have been developed too Two transformation routines, from the LAPACK packed storage data format to the recursive storage data format were added to the package tooWe present performance measurements on several current architectures that demonstrate improvements over the traditional packed routines
TL;DR: A view of the design and development activity concerning advanced environments for parallel and distributed computing is provided, and a "classification" of these environments into two main classes: programming environments and problems solving environments is come up.
Abstract: In this paper we provide a view of the design and development activity concerning advanced environments for parallel and distributed computing. We start from assessing the main issues driving this research track, in the areas of hardware and software technology and of applications. Then, we identify some key concepts, that can be considered as common guidelines and goals in the development of modern advanced environments, and we come up with a "classification" of these environments into two main classes: programming environments and problems solving environments. Both classes are widely discussed, in light of the key concepts previously outlined, and several examples are provided, in order to give a picture of the current status and trends.
TL;DR: An incremental parallelization method which carries out simultaneously both scheduling and mapping in co-operation with two incremental memory allocation heuristics (ParAd: parallelism degree adjustment, and MaCRelax: mapping clues relaxation) in a dynamic multi-user context is proposed.
Abstract: In order to re-adjust the parallel execution of SQL queries in case of metric estimation or discretization errors, we propose an incremental parallelization method which carries out simultaneously both scheduling and mapping in co-operation with two incremental memory allocation heuristics (ParAd: parallelism degree adjustment, and MaCRelax: mapping clues relaxation) in a dynamic multi-user context. The two incremental memory allocation heuristics are integrated in the mapping method which attempt to avoid time-consuming multibucket join execution generating numerous additional I/O. A performance evaluation of the ParAd heuristic shows: (i) a significant join response time savings (from 16.11% to 35.62%), and (ii) with many complex queries, a more significant gain in response time (from 29% to 54%).
TL;DR: An analysis of the models shows that the parallel algorithms are highly scalable when the size of the experts grows from linear units to multi-layer perceptrons (MLPs) and achieving near-linear speedups for HME-MLP.
Abstract: Efficient parallel learning algorithms are proposed for training a powerful modular neural network, the hierarchical mixture of experts (HME). Parallelizations are based on the concept of modular parallelism, i.e. parallel execution of network modules. From modeling the speed-up as a function of the number of processors and the number of training examples, several improvements are derived, such as pipelining the training examples by packets. Compared to experimental measurements, theoretical models are accurate. For regular topologies, an analysis of the models shows that the parallel algorithms are highly scalable when the size of the experts grows from linear units to multi-layer perceptrons (MLPs). These results are confirmed experimentally, achieving near-linear speedups for HME-MLP. Although this work can be viewed as a case study in the parallelization of HME neural networks, both algorithms and theoretical models can be expanded to different learning rules or less regular tree architectures.