TL;DR: This work presents a cost-efficient task-scheduling algorithm using two heuristic strategies that dynamically maps tasks to the most cost- efficient VMs based on the concept of Pareto dominance and reduces the monetary costs of non-critical tasks.
Abstract: Executing a large program using clouds is a promising approach, as this class of programs may be decomposed into multiple sequences of tasks that can be executed on multiple virtual machines (VMs) in a cloud. Such sequences of tasks can be represented as a directed acyclic graph (DAG), where nodes are tasks and edges are precedence constraints between tasks. Cloud users pay for what their programs actually use according to the pricing models of the cloud providers. Early task scheduling algorithms are focused on minimizing makespan, without mechanisms to reduce the monetary cost incurred in the setting of clouds. We present a cost-efficient task-scheduling algorithm using two heuristic strategies.The first strategy dynamically maps tasks to the most cost-efficient VMs based on the concept of Pareto dominance. The second strategy, a complement to the first strategy, reduces the monetary costs of non-critical tasks. We carry out extensive numerical experiments on large DAGs generated at random as well as on real applications. The simulation results show that our algorithm can substantially reduce monetary costs while producing makespan as good as the best known task-scheduling algorithm can provide.
TL;DR: In this study, through analysis, a comprehensive survey for describing resource allocation in various HPCs is reported and the system classification is used to identify approaches followed by the implementation of existing resource allocation strategies that are widely presented in the literature.
Abstract: Classification of high performance computing (HPC) systems is provided.Current HPC paradigms and industrial application suites are discussed.State of the art in HPC resource allocation is reported.Hardware and software solutions are discussed for optimized HPC systems. An efficient resource allocation is a fundamental requirement in high performance computing (HPC) systems. Many projects are dedicated to large-scale distributed computing systems that have designed and developed resource allocation mechanisms with a variety of architectures and services. In our study, through analysis, a comprehensive survey for describing resource allocation in various HPCs is reported. The aim of the work is to aggregate under a joint framework, the existing solutions for HPC to provide a thorough analysis and characteristics of the resource management and allocation strategies. Resource allocation mechanisms and strategies play a vital role towards the performance improvement of all the HPCs classifications. Therefore, a comprehensive discussion of widely used resource allocation strategies deployed in HPC environment is required, which is one of the motivations of this survey. Moreover, we have classified the HPC systems into three broad categories, namely: (a) cluster, (b) grid, and (c) cloud systems and define the characteristics of each class by extracting sets of common attributes. All of the aforementioned systems are cataloged into pure software and hybrid/hardware solutions. The system classification is used to identify approaches followed by the implementation of existing resource allocation strategies that are widely presented in the literature.
TL;DR: The results for strong and weak scaling analysis of incompressible flow computations demonstrate that GPU clusters offer significant benefits for large data sets, and a dual-level MPI-CUDA implementation with maximum overlapping of computation and communication provides substantial benefits in performance.
Abstract: We investigate multi-level parallelism on GPU clusters with MPI-CUDA and hybrid MPI-OpenMP-CUDA parallel implementations, in which all computations are done on the GPU using CUDA. We explore efficiency and scalability of incompressible flow computations using up to 256GPUs on a problem with approximately 17.2 billion cells. Our work addresses some of the unique issues faced when merging fine-grain parallelism on the GPU using CUDA with coarse-grain parallelism that use either MPI or MPI-OpenMP for communications. We present three different strategies to overlap computations with communications, and systematically assess their impact on parallel performance on two different GPU clusters. Our results for strong and weak scaling analysis of incompressible flow computations demonstrate that GPU clusters offer significant benefits for large data sets, and a dual-level MPI-CUDA implementation with maximum overlapping of computation and communication provides substantial benefits in performance. We also find that our tri-level MPI-OpenMP-CUDA parallel implementation does not offer a significant advantage in performance over the dual-level implementation on GPU clusters with two GPUs per node, but on clusters with higher GPU counts per node or with different domain decomposition strategies a tri-level implementation may exhibit higher efficiency than a dual-level implementation and needs to be investigated further.
TL;DR: A new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations is presented and compared to existing formats of the NVIDIA Cusp library using large sparse matrices.
Abstract: We propose the Sliced Coordinate Format (SCOO) for Sparse Matrix-Vector Multiplication on GPUs.An associated CUDA implementation which takes advantage of atomic operations is presented.We propose partitioning methods to transform a given sparse matrix into SCOO format.An efficient Dual-GPU implementation which overlaps computation and communication is described.Extensive performance comparisons of SCOO compared to other formats on GPUs and CPUs are provided. Existing formats for Sparse Matrix-Vector Multiplication (SpMV) on the GPU are outperforming their corresponding implementations on multi-core CPUs. In this paper, we present a new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations. We compare SCOO performance to existing formats of the NVIDIA Cusp library using large sparse matrices. Our results for single-precision floating-point matrices show that SCOO outperforms the COO and CSR format for all tested matrices and the HYB format for all tested unstructured matrices on a single GPU. Furthermore, our dual-GPU implementation achieves an efficiency of 94% on average. Due to the lower performance of existing CUDA-enabled GPUs for atomic operations on double-precision floating-point numbers the SCOO implementation for double-precision does not consistently outperform the other formats for every unstructured matrix. Overall, the average speedup of SCOO for the tested benchmark dataset is 3.33 (1.56) compared to CSR, 5.25 (2.42) compared to COO, 2.39 (1.37) compared to HYB for single (double) precision on a Tesla C2075. Furthermore, comparison to a Sandy-Bridge CPU shows that SCOO on a Fermi GPU outperforms the multi-threaded CSR implementation of the Intel MKL Library on an i7-2700K by a factor between 5.5 (2.3) and 18 (12.7) for single (double) precision.Source code is available at https://github.com/danghvu/cudaSpmv.
TL;DR: This work develops and evaluates strategies for efficient computation and propagation of wavefronts using a multi-level queue structure that improves the utilization of fast memories in a GPU and reduces synchronization overheads and develops a tile-based parallelization strategy to support execution on multiple CPUs and GPUs.
Abstract: We address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50× and 85× with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively.
TL;DR: The implementation of the new algorithm with the DAGuE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.
Abstract: This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed nodes, where a node is a multi-core processor. These platforms represent the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of nodes, in order to minimize the number of inter-processor communications (aka, ''communication-avoiding''), it is natural to consider hierarchical trees composed of an ''inter-node'' tree which acts on top of ''intra-node'' trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) ''TS level'' for cache-friendliness, (1) ''low-level'' for decoupled highly parallel inter-node reductions, (2) ''domino level'' to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-node and intra-node. Numerical experiments on a cluster of multi-core nodes (i) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (ii) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGuE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.
TL;DR: An MPI-parallelized fault tolerant MLMC version of an existing parallel MLMC code (ALSVID-UQ) is implemented, based on the User Level Failure Mitigation, a fault tolerant extension of MPI.
Abstract: The theory behind fault tolerant multi-level Monte Carlo (FT-MLMC) methods was recently developed and tested. These tests were made without a real fault tolerant implementation. We implemented an MPI-parallelized fault tolerant MLMC version of an existing parallel MLMC code (ALSVID-UQ). It is based on the User Level Failure Mitigation, a fault tolerant extension of MPI. We confirm our FT-MLMC theory by means of simulations of the two-dimensional stochastic Euler equations of gas dynamics.
TL;DR: In this paper, a fixed spike transfer latency ring topology interconnect for spike communication between neural tiles, using a novel timestamped spike broadcast flow control scheme, is proposed.
Abstract: Information in a Spiking Neural Network (SNN) is encoded as the relative timing between spikes. Distortion in spike timings can impact the accuracy of SNN operation by modifying the precise firing time of neurons within the SNN. Maintaining the integrity of spike timings is crucial for reliable operation of SNN applications. A packet switched Network on Chip (NoC) infrastructure offers scalable connectivity for spike communication in hardware SNN architectures. However, shared resources in NoC architectures can result in unwanted variation in spike packet transfer latency. This packet latency jitter distorts the timing information conveyed on the synaptic connections in the SNN, resulting in unreliable application behaviour. This paper presents a SystemC simulation based analysis of the synaptic information distortion in NoC based hardware SNNs. The paper proposes a fixed spike transfer latency ring topology interconnect for spike communication between neural tiles, using a novel timestamped spike broadcast flow control scheme. The proposed architectural technique is evaluated using spike rates employed in previously reported mesh topology NoC based hardware SNN applications, which exhibited spike latency jitter over NoC paths. Results indicate that the proposed interconnect offers fixed spike transfer latency and eliminates the associated information distortion. The paper presents the micro-architecture of the proposed ring router. The FPGA validated ring interconnect architecture has been synthesised using 65nm low-power CMOS technology. Silicon area comparisons for various ring sizes are presented. Scalability of the proposed architecture has been addressed by employing a hierarchical NoC architecture.
TL;DR: This implementation is based on running the map and the reduce functions concurrently in parallel by exchanging partial intermediate data between them in a pipeline fashion using MPI, which is an adapted structure of the MapReduce programming model for fast intensive data processing.
Abstract: MapReduce is a programming model proposed to simplify large-scale data processing. In contrast, the message passing interface (MPI) standard is extensively used for algorithmic parallelization, as it accommodates an efficient communication infrastructure. In the original implementation of MapReduce, the reduce function can only start processing following termination of the map function. If the map function is slow for any reason, this will affect the whole running time. In this paper, we propose MapReduce overlapping using MPI, which is an adapted structure of the MapReduce programming model for fast intensive data processing. Our implementation is based on running the map and the reduce functions concurrently in parallel by exchanging partial intermediate data between them in a pipeline fashion using MPI. At the same time, we maintain the usability and the simplicity of MapReduce. Experimental results based on three different applications (WordCount, Distributed Inverted Indexing and Distributed Approximate Similarity Search) show a good speedup compared to the earlier versions of MapReduce such as Hadoop and the available MPI-MapReduce implementations.
TL;DR: In this article, a combination of the time-parallel PFASST with a parallel multigrid method (PMG) in space and time is presented, resulting in a mesh-based solver for the three-dimensional heat equation with a uniquely high degree of efficient concurrency.
Abstract: The paper presents a combination of the time-parallel “parallel full approximation scheme in space and time” (PFASST) with a parallel multigrid method (PMG) in space, resulting in a mesh-based solver for the three-dimensional heat equation with a uniquely high degree of efficient concurrency. Parallel scaling tests are reported on the Cray XE6 machine “Monte Rosa” on up to $16,384$ cores and on the IBM Blue Gene/Q system “JUQUEEN” on up to $65,536$ cores. The efficacy of the combined spatial- and temporal parallelization is shown by demonstrating that using PFASST in addition to PMG significantly extends the strong-scaling limit. Implications of using spatial coarsening strategies in PFASST’s multi-level hierarchy in large-scale parallel simulations are discussed.
TL;DR: A framework that allows easy identification and qualification of serial node performance bottlenecks in parallel applications and simplifies the semantics of the performance counters into metrics that refer to processor functional units is presented.
Abstract: Modern supercomputers deliver large computational power, but it is difficult for an application to exploit such power. One factor that limits the application performance is the single node performance. While many performance tools use the microprocessor performance counters to provide insights on serial node performance issues, the complex semantics of these counters pose an obstacle to an inexperienced developer. We present a framework that allows easy identification and qualification of serial node performance bottlenecks in parallel applications. The output of the framework is precise and it is capable of correlating performance inefficiencies with small regions of code within the application. The framework not only points to regions of code but also simplifies the semantics of the performance counters into metrics that refer to processor functional units. With such information the developer can focus on the identified code and improve it by knowing which processor execution unit is degrading the performance. To demonstrate the usefulness of the framework we apply it to three already optimized applications using realistic inputs and, according to the results, modify their source code. By doing modifications that require little effort, we successfully increase the applications' performance from 10% to 30% and thus shorten the time required to reach the solution and/or allow facing increased problem sizes.
TL;DR: A load model for linear initial value runs with GENE is introduced for effective load balancing for the sparse grid combination technique, which will equip GENE for exascale computing.
Abstract: Massively parallel simulations of plasma microturbulence using GENE are facing the curse of dimensionality, since the discretization of the fivedimensional gyrokinetic equations requires a large amount of grid points even for only moderate resolutions The sparse grid combination technique can be used to tackle the curse of dimensionality Being based on a superposition of anisotropic full grid solutions that can be computed independently of each other, it introduces a second layer of parallelism that will equip GENE for exascale computing Since the anisotropy of the discretizations of the partial solutions results in massive load imbalances, effective scheduling is crucial in order to exploit this parallelism In this paper a load model for linear initial value runs with GENE is introduced for effective load balancing for the combination technique
TL;DR: Develop performance models and present numerical results of solving large-scale eigenvalue problems arising from simulations of modeling accelerator cavities and identify the crossover point, where SSEig becomes faster than TRLan.
Abstract: We study the performance of a parallel nonlinear eigensolver SSEig which is based on a contour integral method. We focus on symmetric generalized eigenvalue problems (GEPs) of computing interior eigenvalues. We chose to focus on GEPs because we can then compare the performance of SSEig with that of a publicly-available software package TRLan, which is based on a thick restart Lanczos method. To solve this type of problems, SSEig requires the solution of independent linear systems with different shifts, while TRLan solves a sequence of linear systems with a single shift. Therefore, while SSEig typically has a computational cost greater than that of TRLan, it also has greater parallel scalability. To compare the performance of these two solvers, in this paper, we develop performance models and present numerical results of solving large-scale eigenvalue problems arising from simulations of modeling accelerator cavities. In particular, we identify the crossover point, where SSEig becomes faster than TRLan. The parallel performance of SSEig solving nonlinear eigenvalue problems is also studied.
TL;DR: In this paper, a parallel algorithm for solving block-tridiagonal systems of equations is presented, which is an effective and simple set of procedures for solving engineering tasks on a supercomputer.
Abstract: In this study, we develop a new parallel algorithm for solving systems of linear algebraic equations with the same block-tridiagonal matrix but with different right-hand sides. The method is a generalization of the parallel dichotomy algorithm for solving systems of linear equations with tridiagonal matrices [1] . Using this approach, we propose a parallel realization of the domain decomposition method (the Schur complement method). The calculation of acoustic wave fields using the spectral-difference technique improves the efficiency of the parallel algorithms. A near-linear dependence of the speedup with the number of processors is attained using both several and several thousands of processors. This study is innovative because the parallel algorithm developed for solving block-tridiagonal systems of equations is an effective and simple set of procedures for solving engineering tasks on a supercomputer.
TL;DR: A Lagrangian relaxation based heuristic for the multi-dimensional assignment problem with decomposable costs that can be largely implemented in a map-reduce framework and thus easily distributed across a cluster of computers is described.
Abstract: Data Association framed as multidimensional assignment with decomposable costs.Distribution of a Lagrangian relaxation heuristic using a map-reduce framework.Parallel computation of Lagrange multipliers.New feasibility procedure for the relaxed multidimensional assignment problem.High quality and scalable solutions to large data association problems. Data association is the problem of identifying when multiple data sources have observed the same entity. Central to this effort is the multidimensional assignment problem, which is often used to formulate data association problems. The nature of data association problems dictate that solution methods for the multidimensional assignment problem must return results promptly, and work on large data sets. The contribution of this work is to describe a Lagrangian relaxation based heuristic for the multi-dimensional assignment problem with decomposable costs that can be largely implemented in a map-reduce framework and thus easily distributed across a cluster of computers. Distribution allows the heuristic to address run time and large data requirements of data association. The developed algorithm is tested on a synthesized dataset, and shown to achieve an optimality gap ranging from 0.00008% to 0.6% for dense (no filtering) problems having 10,000 observation. Distribution of the algorithm was found to offer a significant reduction in run time on 30,000 observation problems for an 8 node computing cluster with 96 processors over a single node with 12 processors.
TL;DR: This paper presents a library-based approach that enables arbitrary patterns of parallelism with minimal effort for the user and is the first generic approach to express parallelism that requires neither explicit synchronizations nor a detail of the dependencies of the parallel tasks.
Abstract: Synchronization in parallel applications can be achieved either implicitly or explicitly. Implicit synchronization is typical of programming environments that provide predefined, and often simple, patterns of parallelism such as data-parallel libraries and languages and skeletal operations. Nevertheless, more flexible approaches that allow to express arbitrary task-level parallel computations without a predefined structure request in turn that the user explicitly specifies the synchronization needed among the parallel tasks. In this paper we present a library-based approach that enables arbitrary patterns of parallelism with minimal effort for the user. Our proposal is the first generic approach to express parallelism we know of that requires neither explicit synchronizations nor a detail of the dependencies of the parallel tasks. Our strategy relies on expressing the parallel tasks as functions that convey their dependencies implicitly by means of their arguments. These function arguments are analyzed by our library, called DepSpawn, when a parallel task is spawned in order to enforce its dependencies. Our experiments indicate that DepSpawn is very competitive, both in terms of performance and programmability, with respect to a widespread high-level approach like OpenMP.
TL;DR: This paper describes the lightweight infrastructure-bootstrapping infrastructure (LIBI), both a bootstrapping API specification and a reference implementation and presents a performance evaluation of different process launching schemes based on the LIBI prototype.
Abstract: As the sizes of high-end computing systems continue to grow to massive scales, efficient bootstrapping for distributed software infrastructures is becoming a greater challenge. Distributed software infrastructure bootstrapping is the procedure of instantiating all processes of the distributed system on the appropriate hardware nodes and disseminating to these processes the information that they need to complete the infrastructure's start-up phase. In this paper, we describe the lightweight infrastructure-bootstrapping infrastructure (LIBI), both a bootstrapping API specification and a reference implementation. We describe a classification system for process launching mechanism and then present a performance evaluation of different process launching schemes based on our LIBI prototype.
TL;DR: It is shown that the final parallel code gave a substantial acceleration on the Trubal, and an average speedup of 4.69 in computational time was obtained.
Abstract: Particulate flows are commonly encountered in both engineering and environmental applications. The discrete element method (DEM) has attracted plentiful attentions since it can predict the whole motion of the particulate flow by monitoring every single particle. However the computational capability of the method relies strongly on the numerical scheme as well as the hardware environment. In this study, a parallelization of a DEM based code titled Trubal was implemented. Numerical simulations were carried out to show the benefits of this research. It is shown that the final parallel code gave a substantial acceleration on the Trubal. By simulating 6,000 particles using a NVIDIA Tesla C2050 card together with Intel Core-Dual 2.93 GHz CPU, an average speedup of 4.69 in computational time was obtained.
TL;DR: Experimental results show that the proposed architecture significantly improves the performance up to 75% by replacing 2D static routers with adaptive 2D routers in heterogeneous 3D NoCs, while keeping the maximum clock frequency, power and energy consumption of the adaptive router nearly at the same level as the static router.
Abstract: Three-dimensional Networks-on-Chips (3D NoCs) have recently been proposed to address the on-chip communication demands of future highly dense 3D multi-core systems. Homogeneous 3D NoC topologies have many Through Silicon Vias (TSVs) which have a costly and complex manufacturing process. Also, 3D routers use more memory and are more power hungry than conventional 2D routers. Alternatively, heterogeneous 3D NoCs combine both the area and performance benefits of 2D and 3D static router architectures by using a limited number of TSVs. To improve the performance of heterogeneous 3D NoCs, we propose an adaptive router architecture which balances the traffic in such NoCs. Particularly, experimental results show that our proposed architecture significantly improves the performance up to 75% by replacing 2D static routers with adaptive 2D routers in heterogeneous 3D NoCs, while keeping the maximum clock frequency, power and energy consumption of the adaptive router nearly at the same level as the static router.
TL;DR: This paper proposes and evaluates a Divide and Conquer, D&C, algorithm to efficiently parallelize the FEM assembly, and compares this hybrid approach to the pure domain decomposition and to a state-of-the-art hybrid approach using mesh coloring.
Abstract: Relying solely on domain decomposition and distributed memory parallelism can limit the performance on current supercomputers. At scale, a larger number of smaller domains can lead to an increased communication volume and to load balancing issues. Moreover, the decreasing memory per core is not compatible with the memory overhead of a finer domain decomposition. A popular alternative is to use shared memory parallelism in addition to the domain decomposition. In the context of Finite Element Method, FEM, one of the challenging steps to parallelize in shared memory is the matrix assembly. In this paper, we propose and evaluate a Divide and Conquer, D&C, algorithm to efficiently parallelize the FEM assembly. We compare this hybrid approach using D&C to the pure domain decomposition and to a state-of-the-art hybrid approach using mesh coloring. Our target application is an industrial fluid dynamics code, developed by Dassault Aviation and parallelized with MPI domain decomposition. The original Fortran code has been modified with minimum intrusion. Our D&C approach uses task parallelism with Intel Cilk+. Preliminary results show a good data locality and a 14% performance improvement on a 12 cores 2 sockets Westmere-EP node.
TL;DR: The integration and combined use of Score-P and the CAPS compilers are presented as one approach to efficiently parallelize and optimize codes and the PHMPP profiling interface is described, it’s implementation inscore-P, and the presentation of preliminary results in CUBE.
Abstract: In heterogeneous environments with multi-core systems and accelerators, programming and optimizing large parallel applications turns into a time-intensive and hardware-dependent challenge. To assist application developers in this process, a number of tools and high-level compilers have been developed. Directive-based programming models such as HMPP and OpenACC provide abstractions over lowlevel GPU programming models, such as CUDA or OpenCL. The compilers developed by CAPS automatically transform the pragma-annotated application code into low-level code, thereby allowing the parallelization and optimization for a given accelerator hardware. To analyze the performance of parallel applications, multiple partners in Germany and the US jointly develop the community measurement infrastructure Score-P. Score-P gathers performance execution profiles, which can be presented and analyzed within the CUBE result browser, and collects detailed event traces to be processed by post-mortem analysis tools such as Scalasca and Vampir. In this paper we present the integration and combined use of Score-P and the CAPS compilers as one approach to efficiently parallelize and optimize codes. Specifically, we describe the PHMPP profiling interface, it’s implementation in Score-P, and the presentation of preliminary results in CUBE.
TL;DR: A deadlock detection-based scheduling (DDS) algorithm that can achieve high performance by making the best use of the available storage resources and achieve higher performance than some deadlock avoidance methods in synthetic and real workflow computations.
Abstract: Workflow-based workloads usually consist of multiple instances of the same workflow, which are jobs with control or data dependencies, to carry out a well-defined scientific computation task, with each instance acting on its own input data. To maximize throughput performance, a high degree of concurrency is achievable by running multiple instances simultaneously. However, deadlock is a potential problem when storage is constrained. To address this problem, we design and evaluate a deadlock detection-based scheduling (DDS) algorithm that can achieve high performance by making the best use of the available storage resources. Our algorithm takes advantages of the dataflow information of the workflow to speculatively schedule each instance if the instant storage is sufficient for some constituent jobs, but not necessarily for the whole workflow instance. Whenever deadlock or a performance anomaly is detected, some selected in-progress workflow instances are required to be rollbacked to release storage for other blocked jobs. We develop a suite of strategies to select the victims and beneficiaries (instances or jobs) and evaluate their performance via a simulation-based study. Our results show that the DDS algorithm can adapt the job concurrency to the available storage resources and achieve higher performance than some deadlock avoidance methods in our synthetic and real workflow computations.
TL;DR: The experimental results on a set of benchmarks show the proposed Greedy Simulated Annealing Algorithm (GSAA) can improve the performance by 34.96% and 18.85% on average when comparing with a pure greedy algorithm and a pure simulating annealing algorithm, respectively.
Abstract: The hardware/software codedesign technique traditionally is taken to design embedded systems. The hardware/software partitioning is a key problem in hardware/software codedesign. In this paper, we propose Greedy Simulated Annealing Algorithm (GSAA) to implement an approximately optimal or optimal partition on reconfigurable System-on-Chip (SoC) in embedded system. The experimental results on a set of benchmarks show the proposed GSA algorithm can improve the performance by 34.96% and 18.85% on average when comparing with a pure greedy algorithm and a pure simulating annealing algorithm, respectively. So our algorithm is an effective hardware/software partitioning algorithm.
TL;DR: An automatic parallelization approach for Modelica models using Transmission Line Modeling (TLM), which re-uses the dependency analysis from the sequential translation step of OMC to introduce parallelism into the system.
Abstract: In today’s world of high tech manufacturing and computer-aided design simulations of models is at the heart of the whole manufacturing process. Trying to represent and study the variables of real world models using simulation computer programs can turn out to be a very expensive and time consuming task. On the other hand advancements in modern multi-core CPUs promise remarkable computational power. Modern modeling environments provide different optimization and parallelization options to take advantage of the available computational power. Some of these parallelization approaches are based on automatically extracting parallelism with the help of the model compiler or translator. Another approach is to provide the model programmers with the necessary language constructs to express any potential parallelism in their models.In this paper we present an automatic parallelization approach for Modelica models using Transmission Line Modeling (TLM). TLM is suitable for parallel simulations because larger models can be partitioned into smaller independent sub-models. TLM introduces parallelism into the system by decoupling subsystems using delays greater than the step size of the numerical solver. A prototype has been implemented in the OpenModelica Compiler (OMC) framework. Our approach re-uses the dependency analysis from the sequential translation step of OMC. With the help of the dependency analysis information the set of equations for a model is partitioned into a number of sub-systems. The resulting independent sub-systems are scheduled and executed in parallel. The run-time system for OMC has been improved to provide thread safety and handle parallelism while keeping the introduced overhead to minimum for normal sequential operation and maintaining portability.
TL;DR: A new technique for the discovery of resources in grids which can be used in the case of multi-attribute and range queries and upon users' requests is presented.
Abstract: A key point for the efficient use of large grid systems is the discovery of resources, and this task becomes more complicated as the size of the system grows up. In this case, large amounts of information on the available resources must be stored and kept up-to-date along the system so that it can be queried by users to find resources meeting specific requirements (e.g. a given operating system or available memory). Thus, three tasks must be performed, (1) information on resources must be gathered and processed, (2) such processed information has to be disseminated over the system, and (3) upon users' requests, the system must be able to discover resources meeting some requirements using the processed information. This paper presents a new technique for the discovery of resources in grids which can be used in the case of multi-attribute (e.g. {OS=Linux &memory=4GB}) and range queries (e.g. {50GB
TL;DR: This is the first analytical model that tackles the behavior of multithreaded applications on realistic shared caches without requiring profiling and the experimental results show that the model predictions are precise and very fast and that it can help a compiler or programmer choose the best parallelization strategy.
Abstract: Multicores are the norm nowadays and in many of them there are cores that share one or several levels of cache. The theoretical performance gain expected when several cores cooperate in the parallel execution of an application can be reduced in some cases by a cache access bottleneck, as the data accessed by them can interfere in the shared cache levels. In other cases the performance gain can be increased due to a greater reuse of the data loaded in the cache. This paper presents an analytical model that can predict the behavior of shared caches when executing applications parallelized at loop level. To the best of our knowledge, this is the first analytical model that tackles the behavior of multithreaded applications on realistic shared caches without requiring profiling. The experimental results show that the model predictions are precise and very fast and that the model can help a compiler or programmer choose the best parallelization strategy.
TL;DR: The NoC protocol stack is explored to determine the best layer for implementing the off-chip bridge, a generic hardware architecture for the bridge is proposed, and a new software architecture is developed enabling seamless configuration and communication of multi-chip NoC-based SoCs.
Abstract: Recent embedded systems integrate a growing number of intellectual property cores into increasingly large designs. Implementation, prototyping, and verification of such large systems has become very challenging. One of the reasons is that chips/FPGAs resources are limited and therefore it is not always possible to implement the whole design in the traditional system-on-a-chip solutions. The state-of-the-art is to partition such systems into smaller sub-systems to implement each on a separate chip. Consequently, it requires interconnecting separate chips/FPGAs. Since Networks-on-Chip (NoCs) have become common interconnection solutions in embedded designs, we propose to bridge NoC-based SoCs enabling a generic multi-chip systems interconnection. In this context, the contribution of this paper is threefold, (i) we explore the NoC protocol stack to determine the best layer for implementing the off-chip bridge, (ii) we propose a generic hardware architecture for the bridge, and (iii) we develop a new software architecture enabling seamless configuration and communication of multi-chip NoC-based SoCs. Finally, we demonstrate performance, i.e., bandwidth and latency, of the bridge in a multi-FPGA platform, while the bridge guarantees QoS of traffic. The synthesis results indicate the implementation area cost of the bridge is only 1% of Xilinx Virtex6 FPGA.