TL;DR: This paper presents SKiPPER, a programming environment dedicated to the fast prototyping of parallel vision algorithms on MIMD-DM platforms based upon the concept of algorithmic skeletons, which can be executed on any sequential platform to check the correctness of the parallel algorithm.
Abstract: This paper presents SKiPPER, a programming environment dedicated to the fast prototyping of parallel vision algorithms on MIMD-DM platforms. SKiPPER is based upon the concept of algorithmic skeletons, i.e. higher order program constructs encapsulating recurring forms of parallel computations and hiding their low-level implementation details. Each skeleton is given an architecture-independent functional (but executable) specification and a portable implementation as a generic process template. The source program is a purely functional specification of the algorithm in which all parallelism is made explicit by means of composing instances of selected skeletons, each instance taking as parameters the application specific sequential functions written in C. SKiPPER compiles this specification down to a process graph in which nodes correspond to sequential functions and/or skeleton control processes and edges to communications. This graph is then mapped onto the target topology using a third-party CAD software (SynDEx). The result is a dead-lock free, optimized (but still portable) distributed executive, which SKiPPER finally turns into executable code for the target platform. The initial specification, written in ML language, can also be executed on any sequential platform to check the correctness of the parallel algorithm. The applicability of SKiPPER concepts and tools has been demonstrated by parallelising several realistic real-time vision applications both on a multi-DSP platform and a network of workstations. It is here illustrated with a real-time vehicle detection and tracking application.
TL;DR: Stochastic models are shown to be more precise in reflecting pure diffusion dynamics and heat distribution, while the deterministic ones model more complex phenomena diplaying both diffusive and wavelike properties, inherent in gas and fluids.
Abstract: Cellular-automata diffusion models are studied by simulation and their characteristics are compared. The simulation results are obtained by process observation and by computing concentration distribution along one of the axis of the array. To prove the validity of the models and assess their macroscopic parameters the results are compared to those obtained by corresponding PDE solution. Stochastic and deterministic models are investigated. Stochastic models are shown to be more precise in reflecting pure diffusion dynamics and heat distribution, while the deterministic ones model more complex phenomena diplaying both diffusive and wavelike properties, inherent in gas and fluids.
TL;DR: In this article, the authors describe three primary parallel components to BLAST, one at the sequence-to-sequence comparison level, the second parallelizes a single query across a partitioned and distributed database, and the set of queries themselves are partitioned across a set of servers with replicated or partitioned databases.
Abstract: This paper describes approaches to improving the perfor- mance of one of the most common and increasingly important aspects of the Human Genome Project (HGP) — large-volume, batch comparison of DNA sequence data. This basic comparison operation, usually carried out by the well-known BLAST program on one subject sequence against the internationally-available databases of over 3 million target sequences, is already used hundreds of thousands of times each day by researchers around the world. At present, it is still used primarily in single query, or small batch query mode. As the entire sequence of the human genome nears completion, the area of functional genomics, and the use of micro- arrays of sets of genes, is coming to the fore. These developments will demand ever more efficient means of BLASTing sets of data that will make single processor implementation on powerful workstations infea- sible. We describe the three primary parallel components to BLAST. The first is at the sequence-to-sequence comparison level. The second parallelizes a single query across a partitioned and distributed database. And finally, the set of queries themselves are partitioned across a set of servers with replicated or partitioned databases. The three methods may be employed alone or in concert. Our current implementation is described which parallelizes batch requests, and our plans for implementation of the other levels is also described. The results will ultimately be applied to hardware assistance for this soon-to-be primitive computer operation.
TL;DR: This work proposes an environment which integrates a framework for algorithm transformation, called FAN, with two existing skeleton-based programming systems: the academic system P3L and its commercial counterpart SkIE.
Abstract: We present an integrated environment for the systematic development of parallel and distributed programs. Our approach allows the user to construct complex applications by composing and transforming skeletons, i.e., recurring patterns of task and data parallelism. First academic and commercial experience with skeleton-based systems has demonstrated the benefits of the approach but also the lack of a dedicated set of methods for algorithm design and performance prediction. We take a first step towards such a set of methods by proposing an environment which integrates a framework for algorithm transformation, called FAN, with two existing skeleton-based programming systems: the academic system P3L and its commercial counterpart SkIE.
TL;DR: The current implementation is described which parallelizes batch requests, and the plans for implementation of the other levels is also described, which will ultimately be applied to hardware assistance for this soon-to-be primitive computer operation.
Abstract: This paper describes approaches to improving the performance of one of the most common and increasingly important aspects of the Human Genome Project (HGP) - large-volume, batch comparison of DNA sequence data. This basic comparison operation, usually carried out by the well-known BLAST program on one subject sequence against the internationally-available databases of over 3 million target sequences, is already used hundreds of thousands of times each day by researchers around the world. At present, it is still used primarily in single query, or small batch query mode. As the entire sequence of the human genome nears completion, the area of functional genomics, and the use of microarrays of sets of genes, is coming to the fore. These developments will demand ever more efficient means of BLASTing sets of data that will make single processor implementation on powerful workstations infeasible. We describe the three primary parallel components to BLAST. The first is at the sequence-to-sequence comparison level. The second parallelizes a single query across a partitioned and distributed database. And finally, the set of queries themselves are partitioned across a set of servers with replicated or partitioned databases. The three methods may be employed alone or in concert. Our current implementation is described which parallelizes batch requests, and our plans for implementation of the other levels is also described. The results will ultimately be applied to hardware assistance for this soon-to-be primitive computer operation.
TL;DR: The aim of the project was to implement on parallel machines codes for simulation of specific applicative models based on Cellular Automata for a variety of industrial applications like the design of new products in the coffee industry, the experimentation of elasticity properties of batches for tires and the monitoring of chemical contamination of soils.
Abstract: This paper will present the main ideas and results of the CAPP — Cellular Automata for Percolation Processes — Project, funded by European Union in the frame of the activity of the Technology Transfer Node NOTSOMAD The aim of the project was to implement on parallel machines codes for simulation of specific applicative models based on Cellular Automata for a variety of industrial applications like the design of new products in the coffee industry, the experimentation of elasticity properties of batches for tires and the monitoring of chemical contamination of soils, all sharing the need of dealing with percolation phenomena
TL;DR: This paper proposes an environment helping the user to parallelize a FIIT application that is not only independent of the particular application considered, but also of the target parallel machine.
Abstract: The main goal of this paper is to propose an environment helping the user to parallelize a FIIT application This object oriented environment is not only independent of the particular application considered, but also of the target parallel machine It offers a facility of programming: in fact, parallelism is managed by the environment, it is thus completely transparent for the user We experiment this environment in the framework of parallel ray tracing and show the main advantages
TL;DR: This work investigates the performance of a programming approach based on the MPI for inter- Multiprocessor communications and OpenMP standards for intra-multiprocessionor exchanges and presents a performance evaluation for the NAS parallel benchmarks.
Abstract: The availability of multiprocessors and high performance networks offer the opportunity to build CLUMPs (Cluster of Multiprocessors) and use them as parallel computing platforms. The main distinctive feature of the CLUMP architecture over the usual parallel computers is its hybrid memory model (message passing between the nodes and shared memory inside the nodes). To be largely used, the CLUMPs must be able to execute the existing programs with few modifications. We investigate the performance of a programming approach based on the MPI for inter-multiprocessor communications and OpenMP standards for intra-multiprocessor exchanges. The approach consists in the intra-node parallelization of the MPI programs with an OpenMP directive based parallel compiler. The paper details the approach in the context of the biprocessor PC CLUMPs and presents a performance evaluation for the NAS parallel benchmarks.
TL;DR: A new model for objects which can move around on a cellular grid is introduced, which consists of two phases, the movement phase and the conflict resolution phase.
Abstract: We introduce a new model for objects which can move around on a cellular grid. The model consists of two phases, the movement phase and the conflict resolution phase. In the movement part of the description objects specify their desired direction. The conflict, which occurs when alternative objects want to move to the same free cell, is resolved in the conflict resolution part. The cellular description language CDL was extended to CDL++ in order to describe moving objects. This extension is automatically converted into a two-phased CDL program.
TL;DR: An algorithm for timing behaviour analysis of concurrent and real time systems is developed that allows 'mutual adjustment' of timing specifications of both the system and the property via a single execution of verification procedure.
Abstract: The intention of the paper is to develop an algorithm for timing behaviour analysis of concurrent and real time systems. To this purpose we introduce a notion of the parametric time net that is a modification of the time Petri net [4,7] by using parameter variables in specification of timing constraints on transition firings. A property of the system is given as a formula of Parametric TCTL (PTCTL), a real time branching time temporal logic with timing parameter variables in its operators [6]. Timing behaviour analysis consists in finding necessary and sufficient conditions on parameter values under which the checked PTCTL-formula is valid in the given system. Thus the approach allows 'mutual adjustment' of timing specifications of both the system and the property via a single execution of verification procedure. It is further shown the correctness and evaluated the complexity of the algorithm proposed.
TL;DR: A parallel model based on Cellular Automata is presented, that has been applied to the percolation of pesticides in the soil, to reproduce the process that causes pesticides to be released into water flowing through the soil and to be carried to the groundwater layer, polluting it.
Abstract: We present a parallel model based on Cellular Automata for the simulation of reaction-diffusion processes, that has been applied to the percolation of pesticides in the soil. The main contribution of our approach consists of a model where chemical reactions and the movement of fluid particles in a porous medium can be explicitly described and simulated. The model has been used to reproduce the process that causes pesticides, contained in the soil after their application to crops, to be released into water flowing through the soil and to be carried to the groundwater layer, polluting it. The model has been successfully implemented on Cray T3E and SGI Origin 2000 parallel computers.
TL;DR: A parallel iterative solver for large sparse symmetric positive definite (SPD) linear systems based on a new theory describing the convergence of the Preconditioned Conjugate Gradient method and a proper combination of preconditioning strategies is presented.
Abstract: We present a parallel iterative solver for large sparse symmetric positive definite (SPD) linear systems based on a new theory describing the convergence ofthe Preconditioned Conjugate Gradient (PCG) method and a proper combination ofa dvanced preconditioning strategies. Formally, the preconditioning can be interpreted as a special (nearly optimum from the viewpoint of the new PCG theory) version of overlapping domain decomposition with incomplete Cholesky solutions over subdomains. The estimates ofpa rallel efficiency are given as well as the results ofn umerical experiments for the serial and parallel versions oft he solver.
TL;DR: A new, parallel, mathematical library suite for sparse matrices, which brings object-oriented programming techniques and C++ to the task of providing linear and non-linear algebraic-oriented algorithms for scientists and engineers.
Abstract: In this paper, we present a new, parallel, mathematical library suite for sparse matrices. The Parallel Mathematical Libraries Project (PMLP), a joint effort of Intel, Lawrence Livermore National Laboratory, the Russian Federal Nuclear Laboratory (VNIIEF), and Mississippi State University (MSU), constitutes a concerted effort to create a supportable, comprehensive "Sparse Object-oriented Mathematical Library Suite." With overall design and software validation work at MSU, most software development and testing at VNIIEF, and logistics and other miscellaneous support provided by LLNL and Intel, this international collaboration brings object-oriented programming techniques and C++ to the task of providing linear and non-linear algebraic-oriented algorithms for scientists and engineers. Language bindings for C, Fortran-77, and C++ are provided.
TL;DR: The ground of architecture and description of WinALT simulating system, suitable for representation of versatile classes of fine-grain algorithms and structures, and a comprehensive set of tools for user extensions are given.
Abstract: The ground of architecture and description of WinALT simulating system are given in the paper. The main purpose of graphical WinALT interface is to visualize the model construction and execution. WinALT language is suitable for representation of versatile classes of fine-grain algorithms and structures. The system has a comprehensive set of tools for user extensions.
TL;DR: A program restructuring technique for on-the-fly race detection is presented, which results in a serializable program preserving the semantics of original program, eliminating one component of the space complexity.
Abstract: Detecting races is important for debugging explicit sharedmemory parallel programs, because the races result in unintended nondeterministic executions of the programs. Previous on-the-fly techniques to detect races in parallel programs with inter-thread coordination show serious space overhead in two components of complexity, and can not guarantee that, in an execution instance, the race detected first is not preceded by accesses that also participate in a race. This paper presents a program restructuring technique for on-the-fly race detection, which results in a serializable program preserving the semantics of original program. Monitoring an execution of the restructured program can detect the first races in the original program, eliminating one component of the space complexity.
TL;DR: It is investigated in what extent can be practically exploited the idea of parallelism across method in the case of large-scale initial value problem for ordinary differential equations which often cannot be solved in a reasonable time on a sequential computer.
Abstract: The semidicretization of a time-dependent nonlinear partial differential equation leads to a large-scale initial value problem for ordinary differential equations which often cannot be solved in a reasonable time on a sequential computer. We investigate in what extent can be practically exploited the idea of parallelism across method in the case of such large problems, and using a distributed computational system.
TL;DR: This paper explores Logically Instantaneous communication and provides a simple and efficient protocol that implements li on top of asynchronous distributed systems that allows to adopt the following approach: first design a distributed application assuming Rendezvous communication, and then run it ontop of an asynchronous distributed system providing only li communication.
Abstract: Communication is Logically Instantaneous (LI) if it is possible to timestamp communication events with integers in such a way that (1) timestamps increase within each process and (2) the sending and the delivery events associated with each message have the same times-tamp. So, there is a logical time frame in which for each message, the send event and the corresponding delivery events occur simultaneously. li is stronger than Causally Ordered (CO) communication, but weaker than Rendezvous (RDV) communication. This paper explores Logically Instantaneous communication and provides a simple and efficient protocol that implements li on top of asynchronous distributed systems. li is attractive as it includes co and provides more concurrency than rdv. Moreover it allows to adopt the following approach: first design a distributed application assuming Rendezvous communication, and then run it on top of an asynchronous distributed system providing only li communication.
TL;DR: This paper presents a method to perform I/O optimization automatically based on a combination of a blackboard system and an A* algorithm, which allows to achieve (near) optimal performance in reasonable time.
Abstract: The performance of parallel I/O operations is highly dependent on various parameters like disk transfer rates, speed of processor (network) interconnections, size of available memory for data buffers and so forth. Tuning of parallel I/O to achieve optimum performance is a very complex task for application programmers. This paper presents a method to perform I/O optimization automatically. The approach used is based on a combination of a blackboard system and an A* algorithm, which allows to achieve (near) optimal performance in reasonable time. The architecture of the blackboard is described in detail and illustrated on an example based on a simple cost model.
TL;DR: A software environment in which students learn concurrency by programming the behaviour of a set of interacting agents by putting together the turtle primitives of the Logo language, the classic sequential imperative language constructs and the concurrent ones.
Abstract: We present a software environment in which students learn concurrency by programming the behaviour of a set of interacting agents. The language defined puts together the turtle primitives of the Logo language, the classic sequential imperative language constructs and the concurrent ones. It is possible to program a dynamic world in which independent agents interact with one another through the exchange of messages.
TL;DR: A formal method to schedule algorithms for the special case of 3D → 1D spatial mapping is proposed, based on a technique of two-dimensional scheduling.
Abstract: A formal method to schedule algorithms for the special case of 3D → 1D spatial mapping is proposed. The method is based on a technique of two-dimensional scheduling. Initial 3D algorithms should be represented as a system of uniform recurrence equations or as a uniform loopn est. The method can be generalized for the case of 4D → 2D, 5D → 3D spatial mapping or for the case of affine scheduling with the same linear part.
TL;DR: An Associative memory architecture is presented which is the basis of a machine APPLES (Associative Parallel Processor for Logic Event Simulation), specifically designed for parallel discrete event logic simulation.
Abstract: As circuits increase in size and complexity, there is an ever demanding requirement to accelerate the processing speed of logic simulation. Parallel processing has been perceived as an obvious candidate to assist in this goal and numerous parallel processing systems have been investigated. Unfortunately, large speedup figures have eluded these approaches. A large communication overhead due to basic passing of values between processors, elaborate measures to avoid or recover from deadlock and load balancing techniques, is the principal barrier to achieving high speedup. This paper presents an Associative memory architecture which is the basis of a machine APPLES(Associative Parallel Processor for Logic Event Simulation), specifically designed for parallel discrete event logic simulation. A scan mechanism replaces inter-process communication. This mechanism is well disposed to parallelisation. The machine has been evaluated theoretically and empirically.
TL;DR: Virtual Shared File (VSF) as discussed by the authors is a paradigm of parallel components interaction based on ordinary I/O notion and look like matrixes and ordinary files for application programmers, all operations are applied to a file as a whole; operations remotely changing the content of file are prohibited; memory is explicitly allocated by user what is essential for massively parallel computers.
Abstract: This paper presents conception of virtual shared files (VSF) as paradigm of parallel components interaction. Metaphor of virtual shared files space ensures a compromise between flexibility of explicit message passing and transparency of shared memory model. VSF are based on ordinary I/O notion and look like matrixes and ordinary files for application programmers. The most essential design issues are: all operations are applied to a file as a whole; operations remotely changing the content of file are prohibited; memory is explicitly allocated by user what is essential for massively-parallel computers.
TL;DR: This paper develops a model of cellular automata for massive parallel arithmetic computations with complex numbers on a bit level, shows that complex numbers should be represented in the second order negabinary coding system, and suggests a system of automaton rules for evaluating complex arithmetic expressions.
Abstract: In this paper we develop a model of cellular automata for massive parallel arithmetic computations with complex numbers on a bit level, show that complex numbers should be represented in the second order negabinary coding system, and suggest a system of automaton rules for evaluating complex arithmetic expressions.
TL;DR: In this note, the RT-MEC tool is presented, including general unique features, and the development and usage experience is summarized.
Abstract: RT-MEC is a tool box for validation (via graphical simulation) and verification (via model checking and equivalence checking) of real time systems based on partial order reduction [11] and on-the-fly technique [10]. It is appropriate for systems that can be modelled as Petri nets with real (dense) time. The tool is available within the system PEP (Programming Environment based on Petri nets) [4]. In this note, we present the RT-MEC tool, including general unique features, and summarize our development and usage experience.
TL;DR: This paper proposes a generic solution for object mobility in CORBA in the framework of the lifecycle service using a multi-agent autoorganizational group mechanism so as to reduce the administration task for a large system.
Abstract: Standards such as CORBA are spreading in the development of large scale projects. However, CORBA lacks a mobility mechanism which is an interesting feature to deal with the system's dynamics. In this paper, we propose a generic solution for object mobility in CORBA in the framework of the lifecycle service. Implementation at the object level handles the migration process using intermediary objects. A group mechanism is used to manage the object creation infrastructure so as to allow scalability. We have chosen a multi-agent autoorganizational group mechanism so as to reduce the administration task for a large system. The performance tests show that reasonable performance can be achieved using a high level generic and portable implementation.
TL;DR: Various methods of CNAM learning (synthesis) are compared in order to find their common features and to transfer the important characteristics among the methods, and to do some assumptions about their capabilities.
Abstract: In this paper various methods of CNAM learning (synthesis) are compared in order to find their common features. This allows to transfer the important characteristics among the methods, and to do some assumptions about their capabilities. Also the influence of learning parameters in some methods on the CNAM stability is investigated, and recommendations on their choice are given.
TL;DR: The paper is focussed on the preconditioners like the (in)complete Gaussian algorithm, which is being implemented on massively parallel systems with distributed memory.
Abstract: In order to provide generallyap plicable iterative linear solvers for the communityof scientific computing the LINSOL program package has been designed. The focus of this package is on portability, robustness and on an efficient implementation on massivelyp arallel systems. LINSOL uses iterative solvers as basic methods that are state of the art. Different normalization methods can be used to improve the convergence rates of the iterative solvers. Now preconditioners like the (in)complete Gaussian algorithm are being implemented. The paper is focussed on this type of algorithm. LINSOL is tuned to massively parallel systems with distributed memory. Therefore, the message passing programming style is used. LINSOL supports many matrix formats for the convenience of the users. Moreover, adaptive method selection schemes called polyalgorithms are implemented.
TL;DR: In this article, the authors describe the design of a multiprocessor COOL system to be implemented with superconductor Rapid Single-Flux-Quantum (RSFQ) technology that is being developed at SUNY (Stony Brook, USA).
Abstract: This paper describes the design of a multiprocessor COOL system to be implemented with superconductor Rapid Single-Flux-Quantum (RSFQ) technology that is being developed at SUNY (Stony Brook, USA) within the framework of the Hybrid Technology MultiThreaded architecture (HTMT) project. The objective of the current phase of the project is the proof-of-concept study of a computer that could be built with novel technologies such as RSFQ, optical networks, processors-in-memory, and holographic memory in order to achieve petaflops-level performance within a reasonable hardware and power budget by 2007. The COOL system design is based on a new multithreaded COOL-I architecture which supports two-level multithreading to hide latencies associated with memory and arithmetic operations in superconductor SPELL processors. Preliminary simulation results show that a COOL system with 4096 66-GHz processors can achieve petaflops-level performance on computationally-intensive parallel program kernels.
TL;DR: A parallelisation scheme of arc-consistency to be run on MIMD multiprocessor that removes inconsistent values from the set of values that can be assigned to a variable (its domain), thus reducing the search space.
Abstract: Many problems from artificial intelligence can be described as constraint satisfaction problems over finite domains (CSP(FD)), that is, a solution is an assignment of a value to each problem variable such that a set of constraints is satisfied. Arc-consistency algorithms remove inconsistent values from the set of values that can be assigned to a variable (its domain), thus reducing the search space. We have developed a parallelisation scheme of arc-consistency to be run on MIMD multiprocessor. The set of constraints is divided into N partitions, which are executed in parallel on N processors. The parallelisation scheme has been implemented on a CRAY T3E multiprocessor with up to thirty-four processors. Empirical results on speedup and behaviour are reported and discussed.
TL;DR: Preliminary simulation results show that a COOL system with 4096 66-GHz processors can achieve petaflops-level performance on computationally-intensive parallel program kernels.
Abstract: This paper describes the design of a multiprocessor COOL system to be implemented with superconductor Rapid Single-Flux-Quantum (RSFQ) technology that is being developed at SUNY (Stony Brook, USA) within the framework of the Hybrid Technology MultiThreaded architecture (HTMT) project. The objective of the current phase of the project is the proof-of-concept study of a computer that could be built with novel technologies such as RSFQ, optical networks, processors-in-memory, and holographic memory in order to achieve petaflops-level performance within a reasonable hardware and power budget by 2007. The COOL system design is based on a new multithreaded COOL-I architecture which supports two-level multithreading to hide latencies associated with memory and arithmetic operations in superconductor SPELL processors. Preliminary simulation results show that a COOL system with 4096 66-GHz processors can achieve petaflops-level performance on computationally-intensive parallel program kernels.