TL;DR: The need for new algorithms that would split the computation in a way that would fully exploit the power that each of the hybrid components possesses is motivated, and the need for a DLA library similar to LAPACK but for hybrid manycore/GPU systems is envisioned.
Abstract: We highlight the trends leading to the increased appeal of using hybrid multicore+GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with the development of a hybrid LU factorization algorithm where we split the computation over a multicore and a graphics processor, and use particular techniques to reduce the amount of pivoting and communication between the hybrid components. This results in an efficient algorithm with balanced use of a multicore processor and a graphics processor.
TL;DR: It is shown how different algorithmic variations can be used to best effect depending upon the cluster structure of the graph being labelled and how features of the GPU architectures and host CPUs can be combined tobest effect into a cluster component labelling algorithm for use in high performance simulations.
Abstract: Graph component labelling, which is a subset of the general graph colouring problem, is a computationally expensive operation that is of importance in many applications and simulations. A number of data-parallel algorithmic variations to the component labelling problem are possible and we explore their use with general purpose graphical processing units (GPGPUs) and with the CUDA GPU programming language. We discuss implementation issues and performance results on GPUs using CUDA. We present results for regular mesh graphs as well as arbitrary structured and topical graphs such as small-world and scale-free structures. We show how different algorithmic variations can be used to best effect depending upon the cluster structure of the graph being labelled and consider how features of the GPU architectures and host CPUs can be combined to best effect into a cluster component labelling algorithm for use in high performance simulations.
TL;DR: In this paper, the authors analyzed the impact of the migration topology on the performance of a parallel global optimization algorithm using the island model, in particular parallel Differential Evolution and simulated Annealing with Adaptive Neighborhood.
Abstract: Parallel Global Optimization Algorithms (PGOA) provide an efficient way of dealing with hard optimization problems. One method of parallelization of GOAs that is frequently applied and commonly found in the contemporary literature is the so-called Island Model (IM). In this paper, we analyze the impact of the migration topology on the performance of a PGOA which uses the Island Model. In particular we consider parallel Differential Evolution and Simulated Annealing with Adaptive Neighborhood and draw first conclusions that emerge from the conducted experiments.
TL;DR: A Hessenberg reduction (HR) algorithm for hybrid systems of homogeneous multicore with GPU accelerators that can exceed 25x the performance of the corresponding LAPACK algorithm running on current homogeneity multicores is presented.
Abstract: We present a Hessenberg reduction (HR) algorithm for hybrid systems of homogeneous multicore with GPU accelerators that can exceed 25x the performance of the corresponding LAPACK algorithm running on current homogeneous multicores. This enormous acceleration is due to proper matching of algorithmic requirements to architectural strengths of the system's hybrid components. The results described in this paper are significant because the HR has not been properly accelerated before on homogeneous multicore architectures, and it plays a significant role in solving non-symmetric eigenvalue problems. Moreover, the ideas from the hybrid HR are used to develop a hybrid tridiagonal reduction algorithm (for symmetric eigenvalue problems) and a bidiagonal reduction algorithm (for singular value decomposition problems). Our approach demonstrates a methodology that streamlines the development of a large and important class of algorithms on modern computer architectures of multicore and GPUs. The new algorithms can be directly used in the software stack that relies on LAPACK.
TL;DR: The feasibility of large-scale modeling based on an implicit numerical scheme and a nonlinear constitutive model for soil nonlinearity is demonstrated and an application to regional scale modeling in the French Riviera is illustrated.
Abstract: We present finite-element numerical simulations of seismic wave propagation in non linear inelastic geological media. We demonstrate the feasibility of large-scale modeling based on an implicit numerical scheme and a nonlinear constitutive model. We illustrate our methodology with an application to regional scale modeling in the French Riviera, which is prone to earthquakes. The PaStiX direct solver is used to handle large matrix numerical factorizations based on hybrid parallelism to reduce memory overhead. A specific methodology is introduced for the parallel assembly in the context of soil nonlinearity. We analyse the scaling of the parallel algorithms on large-scale configurations and we discuss the physical results.
TL;DR: The approach considers the full lifecycle of service-based systems including service engineering, Service Level Agreement (SLA) negotiation and management, service provisioning and monitoring, and QoS parameters at application, platform and infrastructure levels.
Abstract: The advancements in distributed computing have driven the emergence of service-based infrastructures that allow for on-demand provision of ICT assets. Taking into consideration the complexity of distributed environments, significant challenges exist in providing and managing the offered on-demand resources with the required level of Quality of Service (QoS), especially for real-time interactive and streaming applications. In this paper we propose an approach for providing real-time QoS guarantees by enhancing service oriented infrastructures with coherent and consistent real-time attributes at various levels (application, network, storage, processing). The approach considers the full lifecycle of service-based systems including service engineering, Service Level Agreement (SLA) negotiation and management, service provisioning and monitoring. QoS parameters at application, platform and infrastructure levels are given specific attention as the basis for provisioning policies in the context of temporal constraints
TL;DR: A massively parallel implementation of symmetric sparse matrix-vector product for modern clusters with scalar multi-core CPUs and it is shown that reducing the storage requirements for symmetric matrices results in roughly two times speedup.
Abstract: We present a massively parallel implementation of symmetric sparse matrix-vector product for modern clusters with scalar multi-core CPUs. Matrices with highly variable structure and density arising from unstructured three-dimensional FEM discretizations of mechanical and diffusion problems are studied. A metric of the effective memory bandwidth is introduced to analyze the impact on performance of a set of simple, well-known optimizations: matrix reordering, manual prefetching, and blocking. A modification to the CRS storage improving the performance on multi-core Opterons is shown. The performance of an entire SMP blade rather than the per-core performance is optimized. Even for the simplest 4 node mechanical element our code utilizes close to 100% of the per-blade available memory bandwidth. We show that reducing the storage requirements for symmetric matrices results in roughly two times speedup. Blocking brings further storage savings and a proportional performance increase. Our results are compared to existing state-of-the-art implementations of SpMV, and to the dense BLAS2 performance. Parallel efficiency on 5400 Opteron cores of the Cray XT4 cluster is around 80-90% for problems with approximately 25^3 mesh nodes per core. For a problem with 820 million degrees of freedom the code runs with a sustained performance of 5.2 TeraFLOPs, over 20% of the theoretical peak.
TL;DR: A profiling tool for discovering thread-level parallelism is presented, and the belief that programs with complex control and data flow contain significant amounts of exploitable coarse-grain pipeline parallelism in the program's outer loops is presented.
Abstract: Traditional static analysis fails to auto-parallelize programs with a complex control and data flow. Furthermore, thread-level parallelism in such programs is often restricted to pipeline parallelism, which can be hard to discover by a programmer. In this paper we propose a tool that, based on profiling information, helps the programmer to discover parallelism. The programmer hand-picks the code transformations from among the proposed candidates which are then applied by automatic code transformation techniques. This paper contributes to the literature by presenting a profiling tool for discovering thread-level parallelism. We track dependencies at the whole-data structure level rather than at the element level or byte level in order to limit the profiling overhead. We perform a thorough analysis of the needs and costs of this technique. Furthermore, we present and validate the belief that programs with complex control and data flow contain significant amounts of exploitable coarse-grain pipeline parallelism in the program's outer loops. This observation validates our approach to whole-data structure dependencies. As state-of-the-art compilers focus on loops iterating over data structure members, this observation also explains why our approach finds coarse-grain pipeline parallelism in cases that have remained out of reach for state-of-the-art compilers. In cases where traditional compilation techniques do find parallelism, our approach allows to discover higher degrees of parallelism, allowing a 40% speedup over traditional compilation techniques. Moreover, we demonstrate real speedups on multiple hardware platforms.
TL;DR: This work works towards a general framework for creating auto-tuned data-parallel algorithms, using these techniques for common algorithms with varying characteristics, including tuning a set of algorithms with a variety of computational patterns.
Abstract: We introduce a variety of techniques toward autotuning data-parallel algorithms on the GPU. Our techniques tune these algorithms independent of hardware architecture, and attempt to select near-optimum parameters. We work towards a general framework for creating auto-tuned data-parallel algorithms, using these techniques for common algorithms with varying characteristics. Our contributions include tuning a set of algorithms with a variety of computational patterns, with the goal in mind of building a general framework from these results. Our tuning strategy focuses first on identifying the computational patterns an algorithm shows, and then reducing our tuning model based on these observed patterns.
TL;DR: This paper outlines how to address the two most important ones, namely the unification of local identifiers at measurement finalization as well as collating and displaying analysis reports in Scalasca.
Abstract: Scalasca is an open-source toolset that can be used to analyze the performance behavior of parallel applications and to identify opportunities for optimization. Target applications include simulation codes from science and engineering based on the parallel programming interfaces MPI and/or OpenMP. Scalasca, which has been specifically designed for use on large-scale machines such as IBM Blue Gene and Cray XT, integrates runtime summaries suitable to obtain a performance overview with in-depth studies of concurrent behavior via event tracing. Although Scalasca was already successfully used with codes running with 294,912 cores on a 72-rack Blue Gene/P system, the current software design shows scalability limitations that adversely affect user experience and that will present a serious obstacle on the way to mastering larger scales in the future. In this paper, we outline how to address the two most important ones, namely the unification of local identifiers at measurement finalization as well as collating and displaying analysis reports.
TL;DR: The execution semantics of CnC are formally described and proved and it is proved that this model guarantees deterministic computation.
Abstract: We introduce the Concurrent Collections (CnC) programming model. In this model, programs are written in terms of high-level operations. These operations are partially ordered according to only their semantic constraints. These partial orderings correspond to data dependences and control dependences. The role of the domain expert, whose interest and expertise is in the application domain, and the role of the tuning expert, whose interest and expertise is in performance on a specific architecture, can be viewed as separate concerns. The CnC programming model provides a high-level specification that can be used as a common language between the two experts, raising the level of their discourse. The model facilitates a significant degree of separation, which simplifies the task of the domain expert, who can focus on the application rather than scheduling concerns and mapping to the target architecture. This separation also simplifies the work of the tuning expert, who is given the maximum possible freedom to map the computation onto the target architecture and is not required to understand the details of the domain. However, the domain and tuning expert may still be the same person. We formally describe the execution semantics of CnC and prove that this model guarantees deterministic computation. We evaluate the performance of CnC implementations on several applications and show that CnC can effectively exploit several different kinds of parallelism and offer performance and scalability that is equivalent to or better than that offered by the current low-level parallel programming models. Further, with respect to ease of programming, we discuss the tradeoffs between CnC and other parallel programming models on these applications.
TL;DR: A state-of-the-art shallow water simulator running on multiple GPUs based on an explicit high-resolution finite volume scheme suitable for modeling dam breaks and flooding and shows near perfect weak and strong scaling.
Abstract: We present a state-of-the-art shallow water simulator running on multiple GPUs. Our implementation is based on an explicit high-resolution finite volume scheme suitable for modeling dam breaks and flooding. We use row domain decomposition to enable multi-GPU computations, and perform traditional CUDA block decomposition within each GPU for further parallelism. Our implementation shows near perfect weak and strong scaling, and enables simulation of domains consisting of up-to 235 million cells at a rate of over 1.2 gigacells per second using four Fermi-generation GPUs. The code is thoroughly benchmarked using three different systems, both high-performance and commodity-level systems.
TL;DR: The issues involved when solving the sparse systems on parallel computers are looked at and the design of an efficient solution phase is discussed, in particular their solution in a limited memory environment when out-of-core working is required.
Abstract: We study the forward and backward substitution phases of a sparse multifrontal factorization. These phases are often neglected in papers on sparse direct factorization but, in many applications, they can be the bottleneck so it is crucial to implement them efficiently. In this work, we assume that the factors have been written on disk during the factorization phase, and we discuss the design of an efficient solution phase. We will look at the issues involved when we are solving the sparse systems on parallel computers and will consider in particular their solution in a limited memory environment when out-of-core working is required. Two different approaches are presented to read data from the disk, with a discussion on the advantages and the drawbacks of each. We present some experiments on realistic test problems using an out-of-core version of a sparse multifrontal code called MUltifrontal Massively Parallel Solver (MUMPS).
TL;DR: A parallel query processing method for the design and construction of web search engines to efficiently deal with dynamic variations in query traffic and automatically re-organizes parallel computations to benefit from the advantages of two mixed modes of operation.
Abstract: A parallel query processing method is proposed for the design and construction of web search engines to efficiently deal with dynamic variations in query traffic. The method allows for the efficient use of different distributed indexing and query processing strategies in server clusters consisting of multiple computational/storage nodes. It also enables a better utilization of local and distributed hardware resources as it automatically re-organizes parallel computations to benefit from the advantages of two mixed modes of operation, namely: a newly proposed synchronous mode and the standard asynchronous computing mode. Switching between modes is facilitated by a round-robin strategy devised to grant each query a fair share of the hardware resources and properly predict query throughput. Performance is evaluated by experimental methods and two case studies serve to show how to develop efficient parallel query processing algorithms for large-scale search engines based on the proposed paradigm.
TL;DR: This monograph presents examples of best practices when combining bioinspired algorithms with parallel architectures and offers a map with the main paths already explored and new ways towards the future.
Abstract: This monograph presents examples of best practices when combining bioinspired algorithms with parallel architectures. The book includes recent work by leading researchers in the field and offers a map with the main paths already explored and new ways towards the future. Parallel Architectures and Bioinspired Algorithms will be of value to both specialists in Bioinspired Algorithms, Parallel and Distributed Computing, as well as computer science students trying to understand the present and the future of Parallel Architectures and Bioinspired Algorithms.
TL;DR: A parallel algorithm for solving a series of matrix equations with a constant tridiagonal matrix and different right-hand sides and an original algorithm for calculating share components of the solution vector is proposed and studied.
Abstract: A parallel algorithm for solving a series of matrix equations with a constant tridiagonal matrix and different right-hand sides is proposed and studied. The process of solving the problem is represented in two steps. The first preliminary step is calculating some rows of the inverse matrix of system of linear algebraic equations. The second step consists in calculating solutions for all right-hand sides. For reducing the communication interactions, based on the formulated and proved the main Gaussian Parallel Elimination Theorem for tridiagonal system of equations, we propose an original algorithm for calculating share components of the solution vector. Theoretical estimates validating the efficiency of the approach for both the common- and distributed-memory supercomputers are obtained. Direct and iterative methods of solving a 2D Poisson equation, which include procedures of tridiagonal matrix inversion, are realized using the MPI paradigm. Results of computational experiments on a multicomputer demonstrate a high efficiency and scalability of the parallel Dichotomy Algorithm.
TL;DR: This paper proposes not one, but several classes of heuristics for scheduling divisible loads in systems with limited memory, and demonstrates advantages and weaknesses of certain heuristic types for wide range of changing system parameters.
Abstract: In this paper scheduling divisible loads in systems with limited memory is examined. Divisible loads are parallel computations which can be arbitrarily divided into parts independently processed on remote processors. The scheduling problem consists in distributing the load, taking into account communication and computation time, and limited memory buffers, so that the total processing time is as short as possible. The amount of memory available on the remote processors is too small to hold the whole load at once. Hence, the load must be distributed in many small messages. Since the problem is computationally hard, we propose not one, but several classes of heuristics. In a series of computational experiments we demonstrate advantages and weaknesses of certain heuristic types for wide range of changing system parameters.
TL;DR: The results indicate that aggregation-based algorithms have advantages compared to algorithms based on the concept of C-F-splitting in terms of computing time.
Abstract: We examine the performance in terms of computing time of different parallel AMG algorithms that are applied within the context of industrial computational fluid dynamics (CFD) problems. We give an overview over the most important classes of algorithms described in literature, pick out four fundamentally different algorithms and perform numerical experiments on up to 16 processors with two benchmarks representing an important class of CFD-problems. The results indicate that aggregation-based algorithms have advantages compared to algorithms based on the concept of C-F-splitting.
TL;DR: An efficient implementation of the Modified SParse Approximate Inverse (MSPAI) preconditioner is presented, introducing two new optimization techniques that reduce the runtime spent on the most demanding numerical parts as the code switches to sparse QR decomposition methods wherever profitable.
Abstract: We present an efficient implementation of the Modified SParse Approximate Inverse (MSPAI) preconditioner. MSPAI generalizes the class of preconditioners based on Frobenius norm minimizations, the class of modified preconditioners such as MILU, as well as interface probing techniques in domain decomposition: it adds probing constraints to the basic SPAI formulation, and one can thus optimize the preconditioner relative to certain subspaces. We demonstrate MSPAI's qualities for iterative regularization problems arising from image deblurring. Such applications demand for a fast and parallel preconditioner realization. We present such an implementation introducing two new optimization techniques: First, we avoid redundant calculations using a dictionary. Second, our implementation reduces the runtime spent on the most demanding numerical parts as the code switches to sparse QR decomposition methods wherever profitable. The optimized code runs in parallel with a dynamic load balancing.
TL;DR: A way of efficiently implementing the tile QR factorization on a system with a powerful GPU and many multicore CPUs is presented.
Abstract: The tile QR factorization provides an efficient and scalable way for factoring a dense matrix in parallel on multicore processors. This article presents a way of efficiently implementing the algorithm on a system with a powerful GPU and many multicore CPUs.
TL;DR: This paper explores stencil operations in CUDA to optimize on GPUs the Jacobi method for solving Laplace's differential equation, showing great scalability when moving towards a more sophisticated GPU architecture and/or more demanding problem sizes.
Abstract: We are witnessing the consolidation of the GPUs streaming paradigm in parallel computing. This paper explores stencil operations in CUDA to optimize on GPUs the Jacobi method for solving Laplace's differential equation. The code keeps constant the access pattern through a large number of loop iterations, that way being representative of a wide set of iterative linear algebra algorithms. Optimizations are focused on data parallelism, threads deployment and the GPU memory hierarchy, whose management is explicit by the CUDA programmer. Experimental results are shown on Nvidia Teslas C870 and C1060 GPUs and compared to a counterpart version optimized on a quadcore Intel CPU. The speed-up factor for our set of GPU optimizations reaches 3-4x and the execution times defeat those of the CPU by a wide margin, also showing great scalability when moving towards a more sophisticated GPU architecture and/or more demanding problem sizes.
TL;DR: This paper proposes a mechanism to reduce interconnect power consumption that combines two alternative techniques: dynamically switching on and off network links as a function of traffic and dynamically reducing the available network bandwidth when traffic becomes low.
Abstract: The high level of computing power required for some applications can only be achieved by multiprocessor systems. These systems consist of several processors that communicate by means of an interconnection network. The huge increase both in size and complexity of high-end multiprocessor systems has triggered up their power consumption. Complex cooling systems are needed, which, in turn, increases power consumption. Power consumption reduction techniques are being applied everywhere in computer systems and the interconnection network is not an exception, as its contribution is not negligible. In this paper, we propose a mechanism to reduce interconnect power consumption that combines two alternative techniques: (i) dynamically switching on and off network links as a function of traffic (any link can be switched off, provided that network connectivity is guaranteed), (ii) dynamically reducing the available network bandwidth when traffic becomes low. In both cases, the topology of the network is not modified. Therefore, the same routing algorithm can be used regardless of the power saving actions taken, thus simplifying router design. Our simulation results show that the network power consumption can be greatly reduced, at the expense of some increase in latency. However, the achieved power reduction is always higher than the latency penalty.
TL;DR: It is shown that pure meta-heuristics are inefficient for large-scale, non-linear inverse problems, and that the 'no-free-lunch' theorem holds, and a consequence of this theorem is that algorithms adapted to the mathematical structure of the problem perform more efficiently thanpure meta- heuristics.
Abstract: For non-linear inverse problems, the mathematical structure of the mapping from model parameters to data is usually unknown or partly unknown. Absence of information about the mathematical structure of this function prevents us from presenting an analytical solution, so our solution depends on our ability to produce efficient search algorithms. Such algorithms may be completely problem-independent (which is the case for the so-called 'meta-heuristics' or 'blind-search' algorithms), or they may be designed with the structure of the concrete problem in mind.
We show that pure meta-heuristics are inefficient for large-scale, non-linear inverse problems, and that the 'no-free-lunch' theorem holds. We discuss typical objections to the relevance of this theorem.
A consequence of the no-free-lunch theorem is that algorithms adapted to the mathematical structure of the problem perform more efficiently than pure meta-heuristics. We study problem-adapted inversion algorithms that exploit the knowledge of the smoothness of the misfit function of the problem. Optimal sampling strategies exist for such problems, but many of these problems remain hard.
TL;DR: In this paper, a finite-difference technique based on locally refined time-and-space grids is used to simulate the interaction of seismic waves with cavernous/fractured reservoirs.
Abstract: In order to simulate the interaction of seismic waves with cavernous/fractured reservoirs, a finite-difference technique based on locally refined time-and-space grids is used. The need to use these grids is due primarily to the differing scale of heterogeneities in the reference medium and the reservoir. Domain Decomposition methods allow for the separation of the target area into subdomains containing the reference medium (coarse grid) and reservoir (fine grid). Computations for each subdomain can be carried out in parallel. The data exchange between each subdomain within a group is done using MPI through nonblocking iSend/iReceive commands. The data exchange between the two groups is done simultaneously by coupling the coarse and fine grids.
The results of a numerical simulation of a carbonate reservoir are presented and discussed.
TL;DR: This work introduces and validate a model for an optimal execution on a public cloud infrastructure by means of time, cost and a metric involving both.
Abstract: Until now, several heuristics for scheduling parameter sweep applications in environments such as cluster and grid have been introduced. Cloud computing has revolutionized the way applications are executed in distributed environments, as now it is the infrastructure which is adapted to the application and not vice versa. In the present contribution an astronomy application from the next mission to Planet Mars with Finnish-Russian-Spanish flag is ported on to a cloud environment, resulting in a parameter sweep profile. The number of needed executions and the deadline provided required a big quantity of computing resources in a short term and punctual situations. For this reason, we introduce and validate a model for an optimal execution on a public cloud infrastructure by means of time, cost and a metric involving both.
TL;DR: In this paper, the authors investigate portfolio selection with an alternative objective function in a distributed computing environment and propose to solve the problem with a heuristic method called threshold acceptance (TA), which is a flexible technique as it requires no simplifications of the problem and allows for a straightforward implementation of all kinds of constraints.
Abstract: We investigate portfolio selection with an alternative objective function in a distributed computing environment. More specifically, we optimise a portfolio's 'Omega' which is the ratio of two partial moments of the portfolio's return distribution. Since finding optimal portfolios under such a performance measure and realistic constraints is a non-convex problem, we suggest to solve the problem with a heuristic method called Threshold Accepting (TA). TA is a very flexible technique as it requires no simplifications of the problem and allows for a straightforward implementation of all kinds of constraints. Applying the algorithm to actual data, we find that TA is well-capable of solving this type of problem. Furthermore, we show that the computations can easily be distributed which leads to considerable speedups.
TL;DR: Experimental results show that the ReSHAPE framework can significantly improve individual job turn-around time and overall system throughput, even with very simple application scheduling policies.
Abstract: This paper describes the design and initial implementation of a software framework for exploiting resizability in distributed-memory parallel applications. By ''resizable'' we mean the ability at run-time to expand or contract the number of processes participating in a parallel application. The ReSHAPE framework described here includes a cluster scheduler, a library supporting data redistribution and process remapping, and an application programming interface (API) which allows applications to interact with the scheduler and resizing library with only minor code modifications. Parallel applications executed using the ReSHAPE framework can expand to take advantage of additional free processors or contract to accommodate a high priority application without being suspended. Experimental results show that the ReSHAPE framework can significantly improve individual job turn-around time and overall system throughput, even with very simple application scheduling policies. In addition, the framework serves as a convenient platform for research into much more sophisticated cluster scheduling policies and methods.
TL;DR: This work proposes a novel, multilevel, 2D coarsening-based 2D matrix partitioning method and implements it using the PaToH MATLAB Matrix Partitioning Interface, finding that the new method can compete with the fine-grain method while also suggesting new research directions.
Abstract: We present the PaToH MATLAB Matrix Partitioning Interface. The interface provides support for hypergraph-based sparse matrix partitioning methods which are used for efficient parallelization of sparse matrix-vector multiplication operations. The interface also offers tools for visualizing and measuring the quality of a given matrix partition. We propose a novel, multilevel, 2D coarsening-based 2D matrix partitioning method and implement it using the interface. We have performed extensive comparison of the proposed method against our implementation of orthogonal recursive bisection and fine-grain methods on a large set of publicly available test matrices. The conclusion of the experiments is that the new method can compete with the fine-grain method while also suggesting new research directions.
TL;DR: This work implements two parameterization methods, both based on randomized quasi-Monte Carlo, and applies them to pricing digital options and collateralized mortgage obligations.
Abstract: We present a theoretical framework where any randomized quasi-Monte Carlo method can be viewed and analyzed as a parameterization method for parallel quasi-Monte Carlo. We present deterministic and stochastic error bounds when different processors of the computing environment run at different speeds. We implement two parameterization methods, both based on randomized quasi-Monte Carlo, and apply them to pricing digital options and collateralized mortgage obligations. Numerical results are used to compare the parameterization methods by their parallel performance as well as their Monte Carlo efficiency.
TL;DR: This paper introduces a change of numeraire in the stochastic processes for risks sources, thus providing estimates under the forward risk-neutral measure that result in a gain in accuracy and parallelizes the Monte Carlo method to speed-up the simulation process.
Abstract: In this paper we discuss the development of a valuation system of asset-liability management of portfolios of life insurance policies on advanced architectures. According to the new rules of the Solvency II project, numerical simulations must provide reliable estimates of the relevant quantities involved in the contracts; therefore, valuation processes have to rely on accurate algorithms able to provide solutions in a suitable turnaround time. Our target is to develop an effective valuation software. At this aim we first introduce a change of numeraire in the stochastic processes for risks sources, thus providing estimates under the forward risk-neutral measure that result in a gain in accuracy. We then parallelize the Monte Carlo method to speed-up the simulation process.