TL;DR: The need of a user-interactive environment that highlights the problems evoked during parallelization is underlines the obligation for minimal manual intervention concerning coding changes to resolve the problematic code section and make them amenable to parallelization.
Abstract: This paper provides an examination of OpenMP based auto-parallelizers and their limitations encountered during parallelization of NAS parallel benchmarks. It also elucidates the issues faced by the parallelizers during parallelization and the resolutions to overcome the problems. Compute-intensive loops are pinpointed using Gprof and the problematic loops within the hotspot area were recognized. Our work concentrates on identifying the pitfalls within the located hotspots and rendering solution in such cases. Analysis on measured speedup and its reasons are well illustrated. This paper underlines the need of a user-interactive environment that highlights the problems evoked during parallelization. It also underscores the obligation for minimal manual intervention concerning coding changes to resolve the problematic code section and make them amenable to parallelization.
TL;DR: This article presents a signature-free distributed algorithm which builds an atomic read/write shared memory on top of a fully connected peer-to-peer n-process asynchronous message-passing system in which up to t
Abstract: This article presents a signature-free distributed algorithm which builds an atomic read/write shared memory on top of a fully connected peer-to-peer n-process asynchronous message-passing system in which up to t
TL;DR: An architecture and functional capabilities of automated toolkit for the service-oriented application creation based on applied programs package, and multi-agent control of this application parallel running in HDCE are described.
Abstract: Currently high-performance computing technologies using computational capabilities for solving scientific, are actively improving. The purpose of our research is the development of toolkit for construction and execution of scientific service-oriented application in heterogeneous distributed computing environment (HDCE). These tools provide the access for subject domain experts to the high-capacity computing resource, using these resources without extensive knowledge of computing architecture and low-level software, and the parallel execution of the user application on the base of the service-oriented technology and multi-agent control. We describe an architecture and functional capabilities of automated toolkit for the service-oriented application creation based on applied programs package, and multi-agent control of this application parallel running in HDCE. We demonstrate an example of the creation of the web-application for parametric feedback synthesis of linear dynamic object by these tools. The offered technology allows simplifying service creation and provides new qualitative opportunities of controlling parallel high-performance computations.
TL;DR: Gauss-Seidel algorithm optimized by modified hyperplane method is faster than non-optimized in 2.5 times and this algorithm was paralleled by the technique of data placement with overlaps and got the speedup in 28 times on 16 processors in comparison with the non- Optimized sequential algorithm.
Abstract: This paper allows L. Lamport hyperplane method modified for improvement of the temporal data locality. Gauss-Seidel algorithm optimized by modified hyperplane method is faster than non-optimized in 2.5 times. This algorithm was paralleled by the technique of data placement with overlaps and we have got the speedup in 28 times on 16 processors in comparison with the non-optimized sequential algorithm.
TL;DR: A probabilistic but efficient causal broadcast mechanism for large systems with changing membership that uses few integer timestamps is proposed.
Abstract: Causal broadcast is a classical communication primitive that has been studied for more then three decades and several implementations have been proposed. The implementation of such a primitive has a non negligible cost either in terms of extra information messages have to carry or in time delays needed for the delivery of messages. It has been proved that messages need to carry a control information the size of which is linear with the size of the system. This problem has gained more interest due to new application domains such that collaborative applications are widely used and are becoming massive and social semantic web and linked-data the implementation of which needs causal ordering of messages. This paper proposes a probabilistic but efficient causal broadcast mechanism for large systems with changing membership that uses few integer timestamps.
TL;DR: The analysis of scalability of GeRa code on different computer platforms from multicore laptop to Lomonosov supercomputer is presented and the comparison of parallel efficiency for different linear solvers in the INMOST framework is performed.
Abstract: The hydrogeological modeling code GeRa is based on INMOST software platform, which operates with distributed mesh data and allows to assemble and solve the system of linear equations. The set of groundwater flow models with filtration, transport, and chemical processes are considered. The comparison of parallel efficiency for different linear solvers in the INMOST framework is performed. The analysis of scalability of GeRa code on different computer platforms from multicore laptop to Lomonosov supercomputer is presented.
TL;DR: Considered is a 2D cellular automaton with moving agents that aims to find agents controlled by a Finite State Program (FSP) that can form domino patterns.
Abstract: Considered is a 2D cellular automaton with moving agents. The objective is to find agents controlled by a Finite State Program (FSP) that can form domino patterns. The quality of a formed pattern is measured by the degree of order computed by counting matching \(3 \times 3\) patterns (templates). The class of domino patterns is defined by four templates. An agent reacts on its own color, the color in front, and whether it is blocked or not. It can change the color, move or not, and turn into any direction. Four FSP were evolved for multi-agent systems with 1, 2, 4 agents initially placed in the corners of the field. For a \(12 \times 12\) training field the aimed pattern could be formed with a 100% degree of order. The performance was also high with other field sizes. Livelocks are avoided by using three different variants of the evolved FSP. The degree of order usually fluctuates after reaching a certain threshold, but it can also be stable, and the agents may show the termination by running in a cycle, or by stopping their activity.
TL;DR: The accuracy of the used network model helped to reveal drawbacks of simpler models commonly used for studying scheduling algorithms and developed open source simulation framework based on SimGrid toolkit allowed us to perform a large number of experiments in a reasonable amount of time and to ensure reproducible results.
Abstract: The paper studies the efficiency of nine state-of-the-art algorithms for scheduling of workflow applications in heterogeneous computing systems (HCS). The comparison of algorithms is performed on the base of discrete-event simulation for a wide range of workflow and system configurations. The developed open source simulation framework based on SimGrid toolkit allowed us to perform a large number of experiments in a reasonable amount of time and to ensure reproducible results. The accuracy of the used network model helped to reveal drawbacks of simpler models commonly used for studying scheduling algorithms.
TL;DR: The analysis of the numerical experiments has allowed us to set some important parameters of the parallel algorithm for speeding up calculations, which are based on the well-known factoring method and on the factoring methods modification proposed by H. Cancela and L. Petingi.
Abstract: The problem of network reliability calculation in case of the diameter constraint is studied. The problem of computing this characteristic is known to be NP-hard. We introduce the parallel methods, which are based on the well-known factoring method and on the factoring method modification proposed by H. Cancela and L. Petingi. The analysis of the numerical experiments has allowed us to set some important parameters of the parallel algorithm for speeding up calculations.
TL;DR: The model of geographically distributed computing system with absolute priorities of jobs is described in the paper and the decentralized scheduling algorithm using the auction methods is designed using the first-price sealed-bid auction and the English auction.
Abstract: The model of geographically distributed computing system with absolute priorities of jobs is described in the paper. Authors designed the decentralized scheduling algorithm using the auction methods. Two auction methods were researched and compared: the first-price sealed-bid auction and the English auction. The paper includes results of experimental comparison of researched auction methods.
TL;DR: The Globalizer software system is described, which implements an approach to solving the global optimization problems using the block multistage scheme of the dimension reduction, which combines the use of Peano curve type evolvents and the multistages reduction scheme.
Abstract: In this paper, we describe the Globalizer software system for solving global optimization problems. The system implements an approach to solving the global optimization problems using the block multistage scheme of the dimension reduction, which combines the use of Peano curve type evolvents and the multistage reduction scheme. The scheme allows an efficient parallelization of the computations and increasing the number of processors employed in the parallel solving of the global optimization problems many times.
TL;DR: A new software library for multiple-precision (integer and floating-point) and extended-range computations is considered, targeted at heterogeneous CPU-GPU architectures and the use of residue number system (RNS) lies in the basis of library multiple- Precision modules.
Abstract: In this paper a new software library for multiple-precision (integer and floating-point) and extended-range computations is considered. The library is targeted at heterogeneous CPU-GPU architectures. The use of residue number system (RNS), enabling effective parallelization of arithmetic operations, lies in the basis of library multiple-precision modules. The paper deals with the supported number formats and the library features. An algorithm for the selection of an RNS moduli set for a given precision of computations are also presented.
TL;DR: This work estimates speedup by running the loops in scalar and vector modes for different data types and determine loop classes which the compilers used in the study fail to vectorize.
Abstract: This paper evaluates auto-vectorizing capabilities of modern optimizing compilers Intel C/C++, GCC C/C++, LLVM/Clang and PGI C/C++ on Intel 64 and Intel Xeon Phi architectures. We use the Extended Test Suite for Vectorizing Compilers consisting of 151 loops. In this work, we estimate speedup by running the loops in scalar and vector modes for different data types and determine loop classes which the compilers used in the study fail to vectorize. We use the dual CPU system (NUMA, 2 x Intel Xeon E5-2620v4, Intel Broadwell microarchitecture) with the Intel Xeon Phi 3120A co-processor for our experiments.
TL;DR: An algorithm from the LRnLA family, DiamondTetris, for stencil computation is constructed aimed for Many-Integrated-Core processors of the Xeon Phi family and its strong points are locality, efficient use of memory hierarchy, and, most importantly, seamless vectorization.
Abstract: An algorithm from the LRnLA family, DiamondTetris, for stencil computation is constructed. It is aimed for Many-Integrated-Core processors of the Xeon Phi family. The algorithm and its implementation is described for the wave equation based simulation. Its strong points are locality, efficient use of memory hierarchy, and, most importantly, seamless vectorization. Specifically, only 1 vector rearrange operation is necessary per cell value update. The performance is estimated with the roofline model. The algorithm is implemented in code and tested on Xeon and Xeon Phi machines.
TL;DR: An experimental assessment of parallel algorithm efficiency was conducted by finding the numeric solution to several hundred randomly generated multidimensional multiextremal problems with non-convex constraints.
Abstract: This work considers a parallel algorithm for solving multiextremal problems with non-convex constraints. The distinctive feature of this algorithm, which does not use penalty functions, is the separate consideration of each problem constraint. The search process can be conducted by reducing the original multidimensional problem to a number of related one-dimensional problems and solving this set of problems in parallel. An experimental assessment of parallel algorithm efficiency was conducted by finding the numeric solution to several hundred randomly generated multidimensional multiextremal problems with non-convex constraints.
TL;DR: An efficient method for parallel solving the time-consuming multicriterial optimization problems, where the optimality criteria can be multiextremal, and the computation of the criteria values can require a large amount of computations is proposed.
Abstract: In the present paper, an efficient method for parallel solving the time-consuming multicriterial optimization problems, where the optimality criteria can be multiextremal, and the computation of the criteria values can require a large amount of computations, is proposed. The proposed scheme of parallel computations allows obtaining several efficient decisions of a multicriterial problem. During performing the computations, the maximum use of the search information is provided. The results of the numerical experiments have demonstrated such an approach to allow reducing the computational costs of solving the multicriterial optimization problems essentially – several tens and hundred times.
TL;DR: This paper proposes a probabilistic but efficient causal broadcast mechanism for large systems with changing membership that uses few integer timestamps.
Abstract: Causal broadcast is a classical communication primitive that has been studied for more then three decades and several implementations have been proposed The implementation of such a primitive has a non negligible cost either in terms of extra information messages have to carry or in time delays needed for the delivery of messages It has been proved that messages need to carry a control information the size of which is linear with the size of the system This problem has gained more interest due to new application domains such that collaborative applications are widely used and are becoming massive and social semantic web and linked-data the implementation of which needs causal ordering of messagesThis paper proposes a probabilistic but efficient causal broadcast mechanism for large systems with changing membership that uses few integer timestamps
TL;DR: This paper faces the challenge of harnessing the heterogeneous nature of SMP/NUMA communications for a complex scientific application which implements the Multidimensional Positive Definite Advection Transport Algorithm (MPDATA), consisting of a set of heterogeneous stencil computations.
Abstract: SMP/NUMA systems are powerful HPC platforms which could be applied for a wide range of real-life applications. These systems provide large capacity of shared memory, and allow using the shared-variable programming model to take advantages of shared memory for inter-process communications and synchronizations. However, as data can be physically dispersed over many nodes, the access to various data items may require significantly different times. In this paper, we face the challenge of harnessing the heterogeneous nature of SMP/NUMA communications for a complex scientific application which implements the Multidimensional Positive Definite Advection Transport Algorithm (MPDATA), consisting of a set of heterogeneous stencil computations.
TL;DR: The model of the algal bloom, causing suffocations in shallow waters takes into account the transport of water environment; microturbulent diffusion; gravitational sedimentation of pollutants and plankton; nonlinear interaction of plankton populations; biogenic, temperature and oxygen regimes; influence of salinity.
Abstract: The model of the algal bloom, causing suffocations in shallow waters takes into account the follows: the transport of water environment; microturbulent diffusion; gravitational sedimentation of pollutants and plankton; nonlinear interaction of plankton populations; biogenic, temperature and oxygen regimes; influence of salinity. The computational accuracy is significantly increased and computational time is decreased at using schemes of high order of accuracy for discretization of the model. The practical significance is the software implementation of the proposed model, the limits and prospects of it practical use are defined. Experimental software was developed based on multiprocessor computer system and intended for mathematical modeling of possible progress scenarios of shallow waters ecosystems on the example of the Azov Sea in the case of suffocation. We used decomposition methods of grid domains in parallel implementation for computationally laborious convection-diffusion problems, taking into account the architecture and parameters of multiprocessor computer system. The advantage of the developed software is also the use of hydrodynamical model including the motion equations in the three coordinate directions.
TL;DR: The results of the hybrid metaheuristics approach (ACO+SA) are very near to the global optimal solutions, but they are produced much faster than using the deterministic Branch-and-Bound approach.
Abstract: We propose a metaheuristics-based approach to the optimal design of multi-product batch plants, with a particular application example of chemical-engineering systems. Our hybrid approach combines two metaheuristics: Ant Colony Optimization (ACO) and Simulated Annealing (SA). We develop a sequential implementation of the proposed method and we parallelize it on Graphics Processing Units (GPU) using the CUDA programming environment. We experimentally demonstrate that the results of our hybrid metaheuristic approach (ACO+SA) are very near to the global optimal solutions, but they are produced much faster than using the deterministic Branch-and-Bound approach.
TL;DR: A simulation model for the traffic in software-defined networks segments of virtual data centers involved in processing user requests to cloud application and services within a network environment is developed and enables to implement the traffic management algorithm of cloud applications and optimize the access to storage systems through the effective use of data transmission channels.
Abstract: Nowadays one of the problems of optimization is the control of the traffic in cloud applications and services in the network environment of virtual data center. Taking into account the multitier architecture of modern data centers, we need to pay a special attention to this task. The advantage of modern infrastructure virtualization is the possibility to use software-defined networks and software-defined data storages. However, the existing optimization of algorithmic solutions does not take into account the specific features of the heterogeneous network traffic routing with multiple application types. The task of optimizing traffic distribution for cloud applications and services can be solved by using software-defined infrastructure of virtual data centers. We have developed a simulation model for the traffic in software-defined networks segments of virtual data centers involved in processing user requests to cloud application and services within a network environment. Our model enables to implement the traffic management algorithm of cloud applications and optimize the access to storage systems through the effective use of data transmission channels. During the experimental studies, we have found that the use of our algorithm enables to decrease the response time of cloud applications and services and, therefore, increase the productivity of user requests processing and reduce the number of refusals.
TL;DR: Over the years from its introduction Mapreduce technology proved to be very effective parallel programming technique to process large volumes of data.
Abstract: Over the years from its introduction Mapreduce technology proved to be very effective parallel programming technique to process large volumes of data. One of the most prevalent implementations of Mapreduce is Hadoop framework and Google proprietary Mapreduce system.
TL;DR: An accelerated spam filtering mechanism that uses GPUs is presented that utilizes an enhanced version of Knuth Morris Pratt pattern matching algorithm that outperforms the serial versions up to 12x and also performs more efficiently compared to other parallel versions.
Abstract: Spam filtering is one of the most important applications in email services that has become increasingly sophisticated due to the enormous usage of Internet. Traditionally, spam filters have been implemented on the CPU with a pattern matching algorithm. In this paper, an accelerated spam filtering mechanism that uses GPUs is presented. The filtering process utilizes an enhanced version of Knuth Morris Pratt pattern matching algorithm that outperforms the serial versions up to 12x and also performs more efficiently compared to other parallel versions. The parallel algorithm is to develop and advanced keyword based Naive Bayesian classifier speeds up the spam filtering up to 2 times compared to CPU.
TL;DR: Application of DVM and SAPFOR is considered in order to automate mapping of 3D elastic waves simulation method on high-performance heterogeneous clusters and efficiency and acceleration of the parallel program are estimated and performance of the DVMH based program is compared with a program obtained after manual parallelization using MPI programming technology.
Abstract: The paper considers application of DVM and SAPFOR in order to automate mapping of 3D elastic waves simulation method on high-performance heterogeneous clusters. A distinctive feature of the proposed method is the use of a curved three-dimensional grid, which is consistent with the geometry of free surface. Usage of curved grids considerably complicates both manual and automated parallelization. Technique to map curved grid on a structured grid has been presented to solve this problem. The sequential program based on the finite difference method on a structured grid, has been parallelized using Fortran-DVMH language. Application of SAPFOR analysis tools simplified this parallelization process. Features of automated parallelization are described. Authors estimate efficiency and acceleration of the parallel program and compare performance of the DVMH based program with a program obtained after manual parallelization using MPI programming technology.
TL;DR: The developed model takes into account the complex geometry of the steam turbine, does not require the significant changes in the processing of the design features and can be used to calculate the thermal processes other construction such as turbines.
Abstract: The aim of this paper is the development of the mathematical model of thermal processes in steam turbine based on the modern information technologies and computational methods, with help of which the accuracy of calculations of thermal modes. The practical significance of the paper are: the model of thermal processes in steam turbine is proposed and implemented, the information about the temperature modes of the steam turbine is derived, limits and prospects of the proposed mathematical model is defined. The thermal processes in the turbine are characterized by a strong non-uniformity of the heat flow, which has significantly influence to the reliability and efficiency of the facility. As a rule, it the influence of these parameters on the geometry is not considered in the designing of the system that results in premature wear of the machine. The developed model takes into account the complex geometry of the steam turbine, does not require the significant changes in the processing of the design features and can be used to calculate the thermal processes other construction such as turbines. Software solution was developed for two-dimensional simulation of thermal processes in steam turbine that takes into account the occupancy control volumes.
TL;DR: A cellular automaton model of electrochemical oxidation of the carbon using a two-dimensional sample of the electro-conductive carbon black “Ketjenblack ES DJ 600” and efficiency of the parallel code is analyzed.
Abstract: In the paper we present a cellular automaton model of electrochemical oxidation of the carbon. A two-dimensional sample of the electro-conductive carbon black “Ketjenblack ES DJ 600” is simulated. In the model the sample consists of a ring-formed granules of carbon. The carbon granules under the influence of the electrochemical oxidation are destroyed through a few successive stages. The rates of these oxidation stages are chosen to fit the simulation result with the experiment. In result of a computer simulation of carbon electrochemical oxidation the portions of surface atoms and atoms with different degree of oxidation were calculated and compared with the experimental data. In addition, a parallel implementation of the cellular automaton simulating the carbon corrosion is developed and efficiency of the parallel code is analyzed.
TL;DR: A fine-grained paralellization strategy that focuses on the work done w.r.t. each of the problem dimensions and does it in parallel, which is useful in computationally demanding optimization problems wherein the objective function has a very large number of dimensions.
Abstract: Particle Swarm Optimization (PSO) is a stochastic metaheuristics yet very robust. Real-world optimizations require a high computational effort to converge to a viable solution. In general, parallel PSO implementations provide good performance, but this depends on the parallelization strategy as well as the number and/or characteristics of the exploited processors. In this paper, we propose a fine-grained paralellization strategy that focuses on the work done w.r.t. each of the problem dimensions and does it in parallel. Moreover, all particles act in parallel. This strategy is useful in computationally demanding optimization problems wherein the objective function has a very large number of dimensions. We map the computation onto three different parallel high-performance multiprocessor architectures, which are based on many and multi-core architectures. The performance of the proposed strategy is evaluated for four well-known benchmarks with high-dimension and different complexity. The obtained speedups are very promising.
TL;DR: A performance comparison of different algorithm’s implementations including LuNA and Message Passing Interface are given and a fragmented numerical algorithm of IADE method is designed in terms of the data-flow graph.
Abstract: The fragmented programming system LuNA is based on the Fragmented Programming Technology. LuNA is a platform for building automatically tunable portable libraries of parallel numerical subroutines. This paper focuses on the parallel implementation of the IADE method for solving 1D partial differential equation (PDE) of parabolic type using LuNA programming system. A fragmented numerical algorithm of IADE method is designed in terms of the data-flow graph. A performance comparison of different algorithm’s implementations including LuNA and Message Passing Interface are given.
TL;DR: It will be shown how applications and context interactions can be better specified, analysed and controlled, with the help of some experiments done with a preliminary implementation of \(\text {ML}_\text {CoDa}\).
Abstract: Contextual information plays an increasingly crucial role in concurrent applications in the times of mobility and pervasiveness of computing. Context-Oriented Programming languages explicitly treat this kind of information. They provide primitive constructs to adapt the behaviour of a program, depending on the evolution of its operational environment, which is affected by other programs hosted therein independently and unpredictably. We discuss these issues and the challenges they pose, reporting on our recent work on \(\text {ML}_\text {CoDa}\), a language specifically designed for adaptation and equipped with a clear formal semantics and analysis tools. We will show how applications and context interactions can be better specified, analysed and controlled, with the help of some experiments done with a preliminary implementation of \(\text {ML}_\text {CoDa}\).
TL;DR: A conversion to a weighted string representation is proposed in this paper, together with a novel string kernel function called Kast Spectrum Kernel, which can be promisingly applied to other similarity problems involving tree-like structured data.
Abstract: Parallel I/O access patterns act as fingerprints of a parallel program. In order to extract meaningful information from these patterns, they have to be represented appropriately. Due to the fact that string objects can be easily compared using Kernel Methods, a conversion to a weighted string representation is proposed in this paper, together with a novel string kernel function called Kast Spectrum Kernel. The similarity matrices, obtained after applying the mentioned kernel over a set of examples from a real application, were analyzed using Kernel Principal Component Analysis (Kernel PCA) and Hierarchical Clustering. The evaluation showed that 2 out of 4 I/O access pattern groups were completely identified, while the other 2 conformed a single cluster due to the intrinsic similarity of their members. The proposed strategy can be promisingly applied to other similarity problems involving tree-like structured data.