TL;DR: The novelty of this paper is a type system which prevents nesting of parallel vectors and is correct w.r.t. the dynamic semantics which is also presented.
Abstract: The BSMLlib library is a library for Bulk Synchronous Parallel (BSP) programming with the functional language Objective Caml. It is based on an extension of the λ-calculus by parallel operations on a data structure named parallel vector, which is given by intention. In order to have an execution that follows the BSP model, and to have a simple cost model, nesting of parallel vectors is not allowed. The novelty of this paper is a type system which prevents such nesting. This system is correct w.r.t. the dynamic semantics which is also presented.
TL;DR: This paper compares a high-level implementation of the NAS benchmark FT in the functional array language SaC with traditional solutions based on Fortran-77 and C and quantifies the associated impact on runtime performance.
Abstract: This paper compares a high-level implementation of the NAS benchmark FT in the functional array language SaC with traditional solutions based on Fortran-77 and C The impact of abstraction on expressiveness, readability, and maintainability of code as well as on clarity of underlying mathematical concepts is discussed The associated impact on runtime performance is quantified both in a uniprocessor environment as well as in a multiprocessor environment based on automatic parallelization and on OpenMP
TL;DR: This paper presents a new scheme for parallel computations on cluster systems for time consuming problems of globally optimal decision making based on the idea of multidimensional problem reduction using same new multiple mappings of the Peano curve type.
Abstract: This paper presents a new scheme for parallel computations on cluster systems for time consuming problems of globally optimal decision making. This uniform scheme (without any centralized control processor) is based on the idea of multidimensional problem reduction. Using same new multiple mappings (of the Peano curve type), a multidimensional problem is reduced to a family of univariate problems which can be solved in parallel in such a way that each of these processors shares the information obtained by the other processors.
TL;DR: It is shown that a combination of today’s processors, today's compilers, and algorithms written in C using a data representation which exposes parallelism, is able to outperform the reference GMP library which is partially written in assembler.
Abstract: This paper is a practical study of the performance impact of avoiding data-dependencies at the algorithm level, when targeting recent deeply pipelined, superscalar processors. We are interested in multiple-precision libraries offering the equivalent of quad-double precision. We show that a combination of today’s processors, today’s compilers, and algorithms written in C using a data representation which exposes parallelism, is able to outperform the reference GMP library which is partially written in assembler. We observe that the gain is related to a better use of the processor’s instruction parallelism.
TL;DR: This work describes a reference implementation of a multi-threaded run-time system for a core programming language based on a process calculus that is effectively extracted by the language compiler and explored by the run- time system.
Abstract: We describe a reference implementation of a multi-threaded run-time system for a core programming language based on a process calculus. The core language features processes running in parallel and communicating through asynchronous messages as the fundamental abstractions. The programming style is fully declarative, focusing on the interaction patterns between processes. The parallelism, implicit in the syntax of the programs, is effectively extracted by the language compiler and explored by the run-time system.
TL;DR: It is proved that the resource similarity is a semilinear relation and can be represented as a finite union of linear combinations over a finite set of base conditional resource similarities.
Abstract: Resources are defined as submultisets of Petri net markings. Two resources are called similar if replacing of one by another doesn’t change the net’s behavior. Two resources are called similar under a certain condition if one of them can be replaced by another without changing an observable behavior provided that a comprehending marking contains also some additional resources. The paper studies conditional similarity of Petri net resources, for which the (unconditional) similarity is a special case. It is proved that the resource similarity is a semilinear relation and can be represented as a finite union of linear combinations over a finite set of base conditional resource similarities. The algorithm for computing a finite approximation for conditional resource similarity relation is also presented.
TL;DR: A numerical 3D-model for investigation of non-stationary processes in a gravitating system with gas is created based on the solution of the Poisson equation for gravitational field, the Vlasov-Liouville equation for solids and equations of gas dynamics.
Abstract: A numerical 3D-model for investigation of non-stationary processes in a gravitating system with gas is created. The model is based on the solution of the Poisson equation for gravitational field, the Vlasov-Liouville equation for solids and equations of gas dynamics. For solution of the Poisson equation at each timestep an efficient iterational solver is created with extrapolation of the evolutionary prosesses under study. It provides fast convergence at high precision.
TL;DR: Two associative parallel algorithms for the edge update of a minimum spanning tree when an edge is deleted or inserted in the underlying graph are proposed and their time complexity is evaluated.
Abstract: In this paper we propose two associative parallel algorithms for the edge update of a minimum spanning tree when an edge is deleted or inserted in the underlying graph. These algorithms are represented as the corresponding procedures implemented on a model of associative parallel systems of the SIMD type with vertical data processing (the STAR–machine). We justify correctness of these procedures and evaluate their time complexity.
TL;DR: This paper presents a novel parallel extension to an EST clustering program, UIcluster4, that incorporates alternative splicing information and a new parallelization strategy and the results are compared to other parallelized EST clusters.
Abstract: One of the fundamental components of large-scale gene discovery projects is that of clustering of Expressed Sequence Tags (ESTs) from complementary DNA (cDNA) clone libraries. Clustering is used to create non-redundant catalogs and indices of these sequences. In particular, clustering of ESTs is frequently used to estimate the number of genes derived from cDNA-based gene discovery efforts. This paper presents a novel parallel extension to an EST clustering program, UIcluster4, that incorporates alternative splicing information and a new parallelization strategy. The results are compared to other parallelized EST clustering systems in terms of overall processing time and in accuracy of the resulting clustering.
TL;DR: By way of experiments in MC#, a series of parallel programs such as a computation of Fibonacci numbers, walking through binary tree, computation of primes by Eratosthenes sieve, calculation of Mandelbrot set, modeling the Conway’s game “Life”, etc are written.
Abstract: MC# is a programming language for cluster- and GRID-architectures based on asynchronous parallel programming model accepted in Polyphonic C# language (N.Benton, L.Cardelli, C.Fournet; Microsoft Research, Cambridge, UK). Asynchronous methods of Polyphonic C# play two major roles in MC#: 1) as autonomous methods executed on remote machines, and 2) as methods used for delivering messages. The former are identified in MC# as the “movable methods”, and the latter form a special syntactic class with the elements named “channels”. Similar to Polyphonic C#, chords are used for defining the channels and as a synchronization mechanism. The MC# channels are generalised naturally to “bidirectional channels”, which may be used both for sending and receiving messages in the movable methods. The runtime-system of MC# has as the basic operation a copying operation for the object which is scheduled for execution on remote machine. This copy is “dead” after the movable method has finished its work, and all changes of this remote copy are not transferred to the original object. Arguments of the movable method are copied together with an original object, but the passing of bidirectional channels is realised through transferring the proxies for such channels. By way of experiments in MC#, we have written a series of parallel programs such as a computation of Fibonacci numbers, walking through binary tree, computation of primes by Eratosthenes sieve, calculation of Mandelbrot set, modeling the Conway’s game “Life”, etc. In all these cases, we got the easy readable and compact code. Also we have an experimental implementation in which the compiler is written in SML.NET, and the execution of movable methods on remote machines is based on the Reflection library of .NET platform.
TL;DR: A Grid-based Parallel and Distributed Simulation Environment (GPDS) as mentioned in this paper is a 3-tier architecture which consists of clients at front end, interaction server at the middle, and a network of computing resources at back-end.
Abstract: Although parallel and distributed computing for a large-scale simulation has many advantages in speed and efficiency, it is difficult for parallel and distributed application to achieve its expected performance, because of some obstacles such as deficient computing powers, weakness in fault and security problem. Motivated by these concerns, we present a Grid-based Parallel and Distributed Simulation environment(GPDS) which not only addresses the problems but also supports transparency and scalability using Grid technologies. GPDS supports a 3-tier architecture which consists of clients at front end, interaction server at the middle, and a network of computing resources at back-end. Grid and simulation agents in the interaction server enables client to transparently perform a large-scale object-oriented simulation by automatically distributing the relevant simulation objects among the computing resources while supporting scalability and fault tolerance by load balancing and dynamic migration schemes.
TL;DR: This paper presents a parallel version of a recently proposed algorithm that has the ability to scale very well in parallel environments.
Abstract: Clustering can be defined as the process of partitioning a set of patterns into disjoint and homogeneous meaningful groups (clusters). There is a growing need for parallel algorithms in this field since databases of huge size are common nowadays. This paper presents a parallel version of a recently proposed algorithm that has the ability to scale very well in parallel environments.
TL;DR: The Intel® Debugger achieves better startup time and user response time than conventional parallel debuggers by setting up a tree-like debugger network, which has a higher degree of parallelism and scalability than a flat network.
Abstract: In addition to being a quality symbolic debugger for serial IA32 and IPF Linux applications written in C, C++, and Fortran, the Intel® Debugger is also capable of debugging parallel applications of Pthreads, OpenMP, and MPI. When debugging a MPI application, the Intel® Debugger achieves better startup time and user response time than conventional parallel debuggers by (1) setting up a tree-like debugger network, which has a higher degree of parallelism and scalability than a flat network, and (2) employing a message aggregation mechanism to reduce the amount of data flowing in the network. This parallel debugging architecture can be further enhanced to support the debugging of mixed-mode and heterogeneous parallel applications. Moreover, a generalized version of this architecture can be applied in areas other than debugging, such as performance profiling of parallel applications.
TL;DR: The focus is on providing a benchmark suite which is flexible and easy to port on (novel) systems, yet complete enough to expose the main difficulties which are encountered when dealing with sparse matrices.
Abstract: In this paper we present the Delft Sparse Architecture Benchmark (D-SAB) Suite for evaluating sparse matrice architectures. The focus is on providing a benchmark suite which is flexible and easy to port on (novel) systems, yet complete enough to expose the main difficulties which are encountered when dealing with sparse matrices. The novelty compared to previous benchmarks is that it is not limited by the need for a compiler. The D-SAB comprises of two parts: (1) the benchmark algorithms and (2) the sparse matrix set. The benchmark algorithms (operations) are categorized in (a) value related operations and (b) position related operations.
TL;DR: This work investigates this set of conditions from a complexity perspective, and shows that it actually consists of a hierarchy of classes of conditions, \(\mathcal{E}^{[\delta]}_{f_c,f_e}\), where δ is the degree of the condition (0 ≤ δ ≤ f c ), each class being contained in the previous one.
Abstract: The condition based approach consists in identifying sets of input vectors, called conditions, for which it is possible to design a protocol solving a distributed computing problem despite failures. In a recent work we have applied the condition based approach to the interactive consistency (IC) problem (the agreement problem where the processes have to agree on the vector of proposed values), and provided a characterization of the conditions that allow us to solve it in presence of up to f c process crashes and f e erroneous proposals. We have shown that these conditions correspond exactly to error correcting codes, where the errors can be erasures or modified values. Here, we investigate this set of conditions from a complexity perspective, and show that it actually consists of a hierarchy of classes of conditions, \(\mathcal{E}^{[\delta]}_{f_c,f_e}\), where δ is the degree of the condition (0 ≤ δ ≤ f c ), each class being contained in the previous one (intuitively, the value f c -δ represents the “difficulty” of a class).
TL;DR: This work describes an experimental Grid programming system, focusing on improving the Java RMI mechanism and the predictability of Java performance in a Grid environment, using reusable algorithmic patterns, called skeletons.
Abstract: Grids aim to combine different kinds of computational resources connected by the Internet and make them easily available to a wide user community. While initial research focused on creating the enabling infra-structure, the challenge of programming the Grid has recently become increasingly important. The difficulties for application programmers lie in the highly heterogeneous and dynamic nature of Grid environments. We address this problem by employing reusable algorithmic patterns, called skeletons. Skeletons are used, in addition to the usual library functions, as generic algorithmic building blocks, customizable for particular applications. We describe an experimental Grid programming system, focusing on improving the Java RMI mechanism and the predictability of Java performance in a Grid environment.
TL;DR: The CPN model can be easily configured and used as a basis for constructing formal models of future TCP modifications and a way of extension of this model for representing the Adaptive Rate Transmission Control Protocol (ARTCP) is shown.
Abstract: The paper deals with modelling and analysis of the Transmission Control Protocol (TCP) by means of Coloured Petri Nets (CPN). We present our CPN model and examples of how correctness and performance issues of the TCP protocol can be studied. We show a way of extension of this model for representing the Adaptive Rate Transmission Control Protocol (ARTCP). Our model can be easily configured and used as a basis for constructing formal models of future TCP modifications.
TL;DR: Accuracy and stability properties of fine-grained parallel computations, based on modeling spatial dynamics by cellular automata (CA) evolution, are studied and methods of providing admissible accuracy are proposed.
Abstract: Accuracy and stability properties of fine-grained parallel computations, based on modeling spatial dynamics by cellular automata (CA) evolution, are studied. The problem arises when phenomena under simulation are represented as a composition of a CA and a function given in real numbers, and the whole computation process is transferred into a Boolean domain. To approach the problem accuracy of real spatial functions approximation by Boolean arrays, as well as of some operations on cellular arrays with different data types are determined and approximation errors are assessed. Some methods of providing admissible accuracy are proposed. Stability is shown to depend only of the nonlinear terms in hybrid methods, the use of CA-diffusion instead of Laplace operator having no effect on it. Some experimental results supporting the theoretical conclusions are presented.
TL;DR: An efficient (un)marshaling framework is presented, designed for distributed applications implemented in languages such as C++, that is actually used in Mozart, a distributed programming system that implements Oz, a multi-paradigm concurrent language.
Abstract: An efficient (un)marshaling framework is presented. It is designed for distributed applications implemented in languages such as C++. A marshaler/unmarshaler pair converts arbitrary structured data between its host and network representations. This technology can also be used for persistent storage. Our framework simplifies the design of efficient and flexible marshalers. The network latency is reduced by concurrent execution of (un)marshaling and network operations. The framework is actually used in Mozart, a distributed programming system that implements Oz, a multi-paradigm concurrent language. Mozart, including the implementation of the framework, is available at www.mozart-oz.org.
TL;DR: The impact of Disk I/O workload on the CPU workload is examined using the Storage Weather Service, hereinafter referred to as SWS, which has developed a performance monitoring system and on-line performance forecast functions for providing forecast information to the Grid.
Abstract: From investigations on the impact of Disk I/O load on CPU load, we have found that the immanent Disk I/O load could affect the resource scheduler’s decision on assigning an appropriate storage resource to a job in which the Disk I/O operation is dominant. A possible but improper assignment can prolong the execution time of a task due to the contention for Disk I/O when the Disk I/O load in the machine is higher than the CPU load. Because the scheduler uses CPU load only for computing schedules, it does not even know the potential Disk I/O contention that could occur at the assigned resource. To avoid or at least alleviate these effects, we have developed a performance monitoring system and on-line performance forecast functions for providing forecast information to the Grid. In this paper, we examine the impact of Disk I/O workload on the CPU workload using our system, hereinafter referred to as Storage Weather Service(SWS). We evaluate several prediction methods in order to get an insight on varying Disk I/O workload.
TL;DR: An ongoing project whose aim is to develop a set of tools to support the development and execution of SCA application is outlined, and the algorithm designed and implemented to manage field diffusion throughout structurally non–uniform environments is described.
Abstract: This paper presents Situated Cellular Agents (SCA), a special class of Multilayered Multi Agent Situated Systems (MMASS). Situated Cellular Agents are systems of reactive agents that are heterogeneous (i.e. characterized by different behavior and perceptive capabilities), and populate a single layered structured environment. The structure of this environment is defined as a non–uniform network of sites in which the agents are situated. The behavior of Situated Cellular Agents (i.e. change of state and position) is influenced by states and types of agents that are situated in adjacent and at–a–distance sites. In the paper it will be outlined an ongoing project whose aim is to develop a set of tools to support the development and execution of SCA application. In particular it will be described the algorithm designed and implemented to manage field diffusion throughout structurally non–uniform environments.
TL;DR: This paper adopts the internal structure of a typical Internet e-mail system for a single server, called the MTA-MDA structure, to the proposed system architecture for the low cost requirements.
Abstract: A large-scale e-mail service provider requests a highly scalable and available e-mail system to accommodate the increasing volume of e-mail traffic as well as the increasing number of e-mail users. To reduce the system development and maintenance cost, it is requested to make the system modular using off-the-shelf components. In this paper, we propose a cluster-based e-mail system architecture to achieve the goals of high scalability and availability, and low development and maintenance cost. We adopt the internal structure of a typical Internet e-mail system for a single server, called the MTA-MDA structure, to the proposed system architecture for the low cost requirements. We have implemented four different system configurations with the MTA-MDA structure and compare their performances. Experimental results show that the proposed system architecture achieves all the design objectives.
TL;DR: This paper demonstrates the simplicity of integrating LMC (Load Managed CORBA) into a real-world application and how it enhances the performance and scalability of DAxML.
Abstract: High performance computing in bioinformatics has led to important progress in the field of genome analysis. Due to the huge amount of data and the complexity of the underlying algorithms many problems can only be solved by using supercomputers. In this paper we present DAxML, a program for the distributed computation of evolutionary trees. In contrast to prior approaches DAxML runs on a cluster of workstations instead of an expensive supercomputer. For this purpose we transformed PAxML, a fast parallel phylogeny program incorporating novel algorithmic optimizations, into a distributed application. DAxML uses modern object-oriented middleware instead of message-passing communication in order to reduce the development and maintenance costs. Our goal is to provide DAxML to a broad range of users, in particular those who do not have supercomputers at their disposal. We ensure high performance and scalability by applying a high-level load management service called LMC (Load Managed CORBA). LMC provides transparent system level load management by integrating the load management functionality directly into the ORB. In this paper we demonstrate the simplicity of integrating LMC into a real-world application and how it enhances the performance and scalability of DAxML.
TL;DR: This work proposes a two-level architecture of a hierarchical workflow management system modelled and developed in such a state-of-the-art coordination language.
Abstract: We argue for the need to use control-based, event-driven and state-defined coordination models and associated languages in modelling and automating business processes (workflows). We propose a two-level architecture of a hierarchical workflow management system modelled and developed in such a state-of-the-art coordination language. The main advantage of a hierarchical, coordination-based architecture is that individual workflow entities can be easily replaced with others, without disrupting the overall workflow process. Each individual workflow entity exhibits a certain degree of flexibility and autonomy. This makes possible the construction of workflow systems that bring further improvements to process automation and dynamic management, such as dynamic (re-) allocation of activities to actors, reusability of coordination (collaboration) patterns, etc. A case study is presented to demonstrate the use of our approach.
TL;DR: An algorithm to find an efficient scheduling and distribution of data and operations to virtual processors is presented and it reduces the sheduling and the alignment problems to the solving of linear algebraic equations.
Abstract: The paper is devoted to the problem of mapping affine loop nests onto distributed memory parallel computers. An algorithm to find an efficient scheduling and distribution of data and operations to virtual processors is presented. It reduces the sheduling and the alignment problems to the solving of linear algebraic equations. The algorithm finds the maximal degree of pipelined parallelism and tries to minimize the number of nonlocal communications.
TL;DR: In this paper, a scheduler of which policies are rate monotonic, as well as Priority Ceiling Protocol, and perform schedulability analysis by formal verification is presented. But, due to the complexity of the embedded kernel, studies on formal specification for embedded kernel like scheduler are relatively few.
Abstract: The reliability of safety-critical embedded real-time system depends partly on that of the system design. Because of this, formal methods have been adopted in the design phase of developing such systems, and various kinds of formal methods have been introduced and used in practice. Many successful results have been published in application systems/softwares. However, studies on formal specification for embedded kernel, like scheduler, are relatively few due to the complexity of the software. In this paper, we present a formal specification for real-time scheduler based on SyncCharts. We specify a scheduler of which policies are rate monotonic, as well as Priority Ceiling Protocol, and perform schedulability analysis by formal verification. Once requirements of the real-time scheduler and timing properties of given tasks are satisfied, a real code can be automatically generated and, we believe, ported in a real target platform.
TL;DR: The paper presents an approach to the performance analysis of heterogeneous parallel algorithms to compare the heterogeneous algorithm with its homogeneous prototype, and to assess theheterogeneous modification rather than to analyse the algorithm as an isolated entity.
Abstract: The paper presents an approach to the performance analysis of heterogeneous parallel algorithms. As a typical heterogeneous parallel algorithm is just a modification of some homogeneous one, the idea is to compare the heterogeneous algorithm with its homogeneous prototype, and to assess the heterogeneous modification rather than to analyse the algorithm as an isolated entity. A criterion of optimality of heterogeneous parallel algorithms is suggested. A parallel algorithm of matrix multiplication on heterogeneous clusters is used to demonstrate the proposed approach.
TL;DR: This paper presents a novel single system image (SSI) architecture for cluster system, called Glosim, implemented in the kernel layer of operating system, and modifies system invokes relative to IPC objects and process signals.
Abstract: This paper presents a novel single system image (SSI) architecture for cluster system, called Glosim. It is implemented in the kernel layer of operating system, and modifies system invokes relative to IPC objects and process signals. This system not only supports global IPC objects including message queue, semaphore and shared memory, but also a new concept of global working process. Combined with Linux Virtual Server, single IO space, it completely constructs a high performance cluster network server with SSI.
TL;DR: Two authentication primitives proposed in a calculus for distributed systems offer a way of abstracting from various specifications of authentication and obtaining idealized protocols “secure by construction”.
Abstract: We advocate here the use of two authentication primitives we recently propose in a calculus for distributed systems, as a further instrument for programmers interested in authentication. These primitives offer a way of abstracting from various specifications of authentication and obtaining idealized protocols “secure by construction”. We can consequently prove that a cryptographic protocol is the correct implementation of the corresponding abstract protocol; when the proof fails, reasoning on the abstract specification may drive to the correct implementation.
TL;DR: The architecture and the realization of distributed simulation system DSS that is realized for parallel computer RM600-E30 are described and directions for the further development of this system are defined.
Abstract: Problems of development of distributed simulation system are discussed in this paper. The architecture and the realization of distributed simulation system DSS that is realized for parallel computer RM600-E30 are described. Directions for the further development of this system are defined.