Top 22 papers presented at Languages and Compilers for Parallel Computing in 2012

Showing papers presented at "Languages and Compilers for Parallel Computing in 2012"

Book Chapter•10.1007/978-3-642-37658-0_6•

Parallel clustered low-rank approximation of graphs and its application to link prediction

[...]

Xin Sui¹, Tsung-Hsien Lee¹, Joyce Jiyoung Whang¹, Berkant Savas², Saral Jain¹, Keshav Pingali¹, Inderjit S. Dhillon¹ - Show less +3 more•Institutions (2)

University of Texas at Austin¹, Linköping University²

11 Sep 2012

TL;DR: This paper describes the first parallel implementation of a clustered low-rank approximation algorithm for large social network graphs, and uses it to perform link prediction in parallel and shows that this implementation scales well on large distributed-memory machines.

...read moreread less

Abstract: Social network analysis has become a major research area that has impact in diverse applications ranging from search engines to product recommendation systems. A major problem in implementing social network analysis algorithms is the sheer size of many social networks, for example, the Facebook graph has more than 900 million vertices and even small networks may have tens of millions of vertices. One solution to dealing with these large graphs is dimensionality reduction using spectral or SVD analysis of the adjacency matrix of the network, but these global techniques do not necessarily take into account local structures or clusters of the network that are critical in network analysis. A more promising approach is clustered low-rank approximation: instead of computing a global low-rank approximation, the adjacency matrix is first clustered, and then a low-rank approximation of each cluster (i.e., diagonal block) is computed. The resulting algorithm is challenging to parallelize not only because of the large size of the data sets in social network analysis, but also because it requires computing with very diverse data structures ranging from extremely sparse matrices to dense matrices. In this paper, we describe the first parallel implementation of a clustered low-rank approximation algorithm for large social network graphs, and use it to perform link prediction in parallel. Experimental results show that this implementation scales well on large distributed-memory machines; for example, on a Twitter graph with roughly 11 million vertices and 63 million edges, our implementation scales by a factor of 86 on 128 processes and takes less than 2300 seconds, while on a much larger Twitter graph with 41 million vertices and 1.2 billion edges, our implementation scales by a factor of 203 on 256 processes with a running time about 4800 seconds.

...read moreread less

22 citations

Book Chapter•10.1007/978-3-642-37658-0_7•

OmpSs-OpenCL Programming Model for Heterogeneous Systems

[...]

Vinoth Krishnan Elangovan¹, Vinoth Krishnan Elangovan², Rosa M. Badia¹, Rosa M. Badia³, Eduard Ayguadé Parra², Eduard Ayguadé Parra¹ - Show less +2 more•Institutions (3)

Barcelona Supercomputing Center¹, Polytechnic University of Catalonia², Spanish National Research Council³

11 Sep 2012

TL;DR: This paper focuses on integrating OpenCL framework with the OmpSs task based programming model using Nanos run time infrastructure to address shortcomings of OpenCL, and would enable the programmer to skip cumbersome OpenCL constructs and write a sequential program with annotated pragmas.

...read moreread less

Abstract: The advent of heterogeneous computing has forced programmers to use platform specific programming paradigms in order to achieve maximum performance. This approach has a steep learning curve for programmers and also has detrimental influence on productivity and code re-usability. To help with this situation, OpenCL an open-source, parallel computing API for cross platform computations was conceived. OpenCL provides a homogeneous view of the computational resources (CPU and GPU) thereby enabling software portability across different platforms. Although OpenCL resolves software portability issues, the programming paradigm presents low programmability and additionally falls short in performance. In this paper we focus on integrating OpenCL framework with the OmpSs task based programming model using Nanos run time infrastructure to address these shortcomings. This would enable the programmer to skip cumbersome OpenCL constructs including OpenCL plaform creation, compilation, kernel building, kernel argument setting and memory transfers, instead write a sequential program with annotated pragmas. Our proposal mainly focuses on how to exploit the best of the underlying hardware platform with greater ease in programming and to gain significant performance using the data parallelism offered by the OpenCL run time for GPUs and multicore architectures. We have evaluated the platform with important benchmarks and have noticed substantial ease in programming with comparable performance.

...read moreread less

20 citations

Book Chapter•10.1007/978-3-642-37658-0_10•

A Study on the Impact of Compiler Optimizations on High-Level Synthesis

[...]

Jason Cong¹, Bin Liu¹, Raghu Prabhakar¹, Peng Zhang¹•Institutions (1)

University of California¹

11 Sep 2012

TL;DR: In this article, the effects of both source-level and IR optimizations and phase ordering on high-level synthesis are explored, and three commonly used sourcelevel optimizations are studied in isolation and then simple yet effective heuristics to apply them to obtain a reasonable latency-area tradeoff.

...read moreread less

Abstract: High-level synthesis is a design process that takes an untimed, behavioral description in a high-level language like C and produces register-transfer-level (RTL) code that implements the same behavior in hardware. In this design flow, the quality of the generated RTL is greatly influenced by the high-level description of the language. Hence it follows that both source-level and IR-level compiler optimizations could either improve or hurt the quality of the generated RTL. The problem of ordering compiler optimization passes, also known as the phase-ordering problem, has been an area of active research over the past decade. In this paper, we explore the effects of both source-level and IR optimizations and phase ordering on high-level synthesis. The parameters of the generated RTL are very sensitive to high-level optimizations. We study three commonly used source-level optimizations in isolation and then propose simple yet effective heuristics to apply them to obtain a reasonable latency-area tradeoff. We also study the phase-ordering problem for IR-level optimizations from a HLS perspective and compare it to a CPU-based setting. Our initial results show that an input-specific order can achieve a significant reduction in the latency of the generated RTL, and opens up this technology for future research.

...read moreread less

19 citations

Book Chapter•10.1007/978-3-642-37658-0_17•

Beyond Do Loops: Data Transfer Generation with Convex Array Regions

[...]

Serge Guelton¹, Mehdi Amini², Béatrice Creusillet•Institutions (2)

École nationale supérieure des télécommunications de Bretagne¹, Mines ParisTech²

11 Sep 2012

TL;DR: The scope of this study ranges from inter-procedural analysis in simple loop nests with function calls, to inter-iteration data reuse optimization and arbitrary control flow in loop bodies.

...read moreread less

Abstract: Automatic data transfer generation is a critical step for guided or automatic code generation for accelerators using distributed memories. Although good results have been achieved for loop nests, more complex control flows such as switches or while loops are generally not handled. This paper shows how to leverage the convex array regions abstraction to generate data transfers. The scope of this study ranges from inter-procedural analysis in simple loop nests with function calls, to inter-iteration data reuse optimization and arbitrary control flow in loop bodies. Generated transfers are approximated when an exact solution cannot be found. Array regions are also used to extend redundant load store elimination to array variables. The approach has been successfully applied to GPUs and domain-specific hardware accelerators.

...read moreread less

18 citations

Book Chapter•10.1007/978-3-642-37658-0_12•

Task Parallelism and Data Distribution: An Overview of Explicit Parallel Programming Languages

[...]

Dounia Khaldi¹, Pierre Jouvelot¹, Corinne Ancourt¹, François Irigoin¹•Institutions (1)

Mines ParisTech¹

11 Sep 2012

TL;DR: This study surveys six popular parallel language designs and suggests that, even though there are many keywords and notions introduced by these languages, they boil down, as far as control issues are concerned, to three key task concepts: creation, synchronization and atomicity.

...read moreread less

Abstract: Efficiently programming parallel computers would ideally require a language that provides high-level programming constructs to avoid the programming errors frequent when expressing parallelism. Since task parallelism is considered more error-prone than data parallelism, we survey six popular parallel language designs that tackle this difficult issue: Cilk, Chapel, X10, Habanero-Java, OpenMP and OpenCL. Using the parallel computation of the Mandelbrot set as running example, this paper describes how the fundamentals of task parallel programming are dealt with in these languages. Our study suggests that, even though there are many keywords and notions introduced by these languages, they boil down, as far as control issues are concerned, to three key task concepts: creation, synchronization and atomicity. These languages adopt one of three memory models: shared, message passing and Partitioned Global Address Space. The paper is designed to give users and language and compiler designers an up-to-date comparative overview of current parallel languages.

...read moreread less

17 citations

Book Chapter•10.1007/978-3-642-37658-0_16•

Compiler Automatic Discovery of OmpSs Task Dependencies

[...]

Sara Royuela¹, Alejandro Duran², Alejandro Duran¹, Xavier Martorell¹•Institutions (2)

Barcelona Supercomputing Center¹, Intel²

11 Sep 2012

TL;DR: An algorithm based on the discovery of code concurrent to a task and liveness analysis is developed that enables the compiler to automatically determine the dependencies of OmpSs tasks, thus releasing users from the task of manually defining these dependencies.

...read moreread less

Abstract: Dependence analysis is an essential step for many compiler optimizations, from simple loop transformations to automatic parallelization. Parallel programming models require specific dependence analyses that take into account multi-threaded execution. Furthermore, asynchronous parallelism introduced by OpenMP tasks has promoted the development of new dependency analysis techniques. In these terms, OmpSs parallel programming model extends OpenMP tasks with the definition of intertask dependencies. This extension allows run-time dependency detection, which potentially improves the performance when load balancing or locality rule the execution time. On the other side, the extension requires the user to figure out data-sharing attributes and the type of access to each data in all tasks in order to correctly specify the dependencies. We aim to enhance the programmability of OmpSs with a new methodology that enables the compiler to automatically determine the dependencies of OmpSs tasks, thus releasing users from the task of manually defining these dependencies. In this context, we have developed an algorithm based on the discovery of code concurrent to a task and liveness analysis. The algorithm first finds out all code concurrent with a given task. Then, it computes the data-sharing attributes of the variables appearing in the task. Finally, it analyzes the liveness properties of the task’s shared variables. With this information, the algorithm figures out the proper dependencies of the task. We have implemented this algorithm in the Mercurium source-to-source compiler. We have tested the results with several benchmarks proving that the algorithm is able to correctly find a large number of dependency expressions.

...read moreread less

12 citations

Book Chapter•10.1007/978-3-642-37658-0_3•

Compiler Optimizations: Machine Learning versus O3

[...]

Yuriy Kashnikov¹, Jean Christophe Beyler², William Jalby¹•Institutions (2)

Versailles Saint-Quentin-en-Yvelines University¹, Intel²

11 Sep 2012

TL;DR: This paper extensively tests the other performance options available and concludes that, although old compiler versions could benefit from compiler flag combinations, modern compilers perform admirably at the commonly used -O3 level.

...read moreread less

Abstract: Software engineers are highly dependent on compiler technology to create efficient programs. Optimal execution time is currently the most important criteria in the HPC field; to achieve this the user applies the common compiler option -O3. The following paper extensively tests the other performance options available and concludes that, although old compiler versions could benefit from compiler flag combinations, modern compilers perform admirably at the commonly used -O3 level.

...read moreread less

10 citations

Book Chapter•10.1007/978-3-642-37658-0_1•

Just in Time Load Balancing

[...]

Rosario Cammarota¹, Alexandru Nicolau¹, Alexander V. Veidenbaum¹•Institutions (1)

University of California¹

11 Sep 2012

TL;DR: A rapid increase in the number of on-chip cores and the ways such cores share on- chip resources - such as pipeline and memory hierarchy, leads to an increase inThe number of possible high-performance configurations makes attaining peak performance through the exploitation of LLP an increasingly complex problem.

...read moreread less

Abstract: Leveraging Loop Level Parallelism (LLP) is one of the most attractive techniques for improving program performance on emerging multi-cores Ordinary programs contain a large amount of parallel and DOALL loops, however emerging multi-core designs feature a rapid increase in the number of on-chip cores and the ways such cores share on-chip resources - such as pipeline and memory hierarchy, leads to an increase in the number of possible high-performance configurations This trend in emerging multi-core design makes attaining peak performance through the exploitation of LLP an increasingly complex problem

...read moreread less

5 citations

Book Chapter•10.1007/978-3-642-37658-0_18•

Finish Accumulators: An Efficient Reduction Construct for Dynamic Task Parallelism

[...]

Jun Shirako¹, Vincent Cavé¹, Jisheng Zhao¹, Vivek Sarkar¹•Institutions (1)

Rice University¹

11 Sep 2012

TL;DR: Experimental results demonstrate that the Java-based implementation of finish accumulators delivers comparable or better performance for computing reductions relative to Java’s atomic variables and concurrent collections.

...read moreread less

Abstract: Parallel reductions represent a common pattern for computing the aggregation of an associative and commutative operation, such as summation, across multiple pieces of data supplied by parallel tasks. In this poster, we introduce finish accumulators, a unified construct that supports predefined and user-defined parallel reductions for dynamic task parallelism. Finish accumulators are designed to be integrated into structured task parallelism constructs, such as the async and finish constructs found in the X10 and Habanero-Java (HJ) languages, so as to guarantee determinism for accumulation and to avoid any possible race conditions in referring to intermediate results. In contrast to lower-level reduction constructs such as atomic variables, the high-level semantics of finish accumulators allows for a wide range of implementations with different accumulation policies, e.g., eager-computation vs. lazy-computation. The best implementation can thus be selected based on a given application and target platform. We have integrated finish accumulators into the Habanero-Java task parallel language, and used them for research and teaching. In addition to their higher-level semantics, experimental results demonstrate that our Java-based implementation of finish accumulators delivers comparable or better performance for computing reductions relative to Java’s atomic variables and concurrent collections.

...read moreread less

4 citations

Book Chapter•10.1007/978-3-642-37658-0_19•

FlashbackSTM: Improving STM Performance by Remembering the Past

[...]

Hugo Rito¹, João Cachopo¹•Institutions (1)

Instituto Superior Técnico¹

11 Sep 2012

TL;DR: Software Transactional Memory is one promising abstraction to simplify this task because when using an STM programmers may ignore low-level synchronization details and simply specify which operations must execute atomically inside transactions.

...read moreread less

Abstract: As multicore machines become pervasive, an ever growing number of programmers face the challenge of building highly parallel applications that take full advantage of modern parallel hardware architectures. Software Transactional Memory (STM) [3] is one promising abstraction to simplify this task because when using an STM programmers may ignore low-level synchronization details and simply specify which operations must execute atomically inside transactions. It is then the STM’s responsibility to preserve the program’s semantics, while maintaining as much parallelism and concurrency as possible.

...read moreread less

3 citations

Book Chapter•10.1007/978-3-642-37658-0_14•

A Software-Based Method-Level Speculation Framework for the Java Platform

[...]

Ivo Anjo¹, João Cachopo¹•Institutions (1)

Technical University of Lisbon¹

11 Sep 2012

TL;DR: This work intends to tackle the former issue of parallelizing existing sequential applications and designing new parallel applications with multicore processors.

...read moreread less

Abstract: With multicore processors becoming ubiquitous on computing devices, the need for both parallelizing existing sequential applications and designing new parallel applications is greatly intensified. With our work, we intend to tackle the former issue.

...read moreread less

Book Chapter•10.1007/978-3-642-37658-0_9•

UCIFF: Unified Cluster Assignment Instruction Scheduling and Fast Frequency Selection for Heterogeneous Clustered VLIW Cores

[...]

Vasileios Porpodas¹, Marcelo Cintra¹•Institutions (1)

University of Edinburgh¹

11 Sep 2012

TL;DR: Heterogeneous clustered VLIW processors however, support dynamic voltage and frequency scaling (DVFS) independently per cluster, and effectively controlling DVFS, to selectively decrease the frequency of clusters with a lot of slack in their schedule, can lead to significant energy savings.

...read moreread less

Abstract: Clustered VLIW processors are scalable wide-issue statically scheduled processors. Their design is based on physically partitioning the otherwise shared hardware resources, a design which leads to both high performance and low energy consumption. In traditional clustered VLIW processors, all clusters operate at the same frequency. Heterogeneous clustered VLIW processors however, support dynamic voltage and frequency scaling (DVFS) independently per cluster. Effectively controlling DVFS, to selectively decrease the frequency of clusters with a lot of slack in their schedule, can lead to significant energy savings.

...read moreread less

Book Chapter•10.1007/978-3-642-37658-0_13•

A Fast Parallel Graph Partitioner for Shared-Memory Inspector/Executor Strategies

[...]

Christopher D. Krieger¹, Michelle Mills Strout¹•Institutions (1)

Colorado State University¹

11 Sep 2012

TL;DR: This paper presents a shared memory parallel graph partitioner, ParCubed, for use in the context of sparse tiling run-time data and computation reordering and compares the presented hierarchical clustering partitioner with GPart and METIS in terms of partitioning speed, partitioning quality, and the effect the generated seed partitions have on executor speed.

...read moreread less

Abstract: Graph partitioners play an important role in many parallel work distribution and locality optimization approaches. Surprisingly, however, to our knowledge there is no freely available parallel graph partitioner designed for execution on a shared memory multicore system. This paper presents a shared memory parallel graph partitioner, ParCubed, for use in the context of sparse tiling run-time data and computation reordering. Sparse tiling is a run-time scheduling technique that schedules groups of iterations across loops together when they access the same data and one or more of the loops contains indirect array accesses. For sparse tiling, which is implemented with an inspector/executor strategy, the inspector needs to find an initial seed partitioning of adequate quality very quickly. We compare our presented hierarchical clustering partitioner, ParCubed, with GPart and METIS in terms of partitioning speed, partitioning quality, and the effect the generated seed partitions have on executor speed. We find that the presented partitioner is 25 to 100 times faster than METIS on a 16 core machine. The total edge cut of the partitioning generated by ParCubed was found not to exceed 1.27x that of the partitioning found by METIS.

...read moreread less

Book Chapter•10.1007/978-3-642-37658-0_15•

Ant: A Debugging Framework for MPI Parallel Programs

[...]

Jae-Woo Lee¹, Leonardo R. Bachega¹, Samuel P. Midkiff¹, Yu Charlie Hu¹•Institutions (1)

Purdue University¹

11 Sep 2012

TL;DR: Ant’s instrumentation strategy reduces the overhead of monitoring by over 14 times with less impact on accuracy than a scheme that simply distributes monitoring over all processes executing the program.

...read moreread less

Abstract: This paper describes Ant, a debugging framework targeting MPI parallel programs. The Ant framework statically analyzes programs, marking code regions as being executed by all processes or executed by only some of the processes. The analyzed program is then instrumented with calls to an invariant violation monitoring and detection library. The analysis allows regions to be instrumented based on whether all, or less than all, processes execute the region. Ant’s instrumentation strategy allows sampled monitoring across processes in regions executed by all processes. We present a case study using Ant with C-DIDUCE (a variant of DIDUCE for C) to find violations of value invariants in parallel C/MPI programs. Ant’s instrumentation strategy reduces the overhead of monitoring by over 14 times with less impact on accuracy than a scheme that simply distributes monitoring over all processes executing the program.

...read moreread less

Book Chapter•10.1007/978-3-642-37658-0_21•

Language and Architecture Independent Software Thread-Level Speculation

[...]

Zhen Cao¹, Clark Verbrugge¹•Institutions (1)

McGill University¹

11 Sep 2012

TL;DR: Pure software designs to TLS have relatively recently become of interest, trading increased overhead concerns for the potential of providing new and user-friendly approaches to extracting parallelism, and making use of commodity multiprocessors without the need for new hardware.

...read moreread less

Abstract: Thread-level speculation (TLS) has historically been investigated in the context of novel hardware designs Chen and Olukotun, 2003, Steffan et al., 2005 Quinones et al., 2005. Pure software designs to TLS, however, have relatively recently become of interest, trading increased overhead concerns for the potential of providing new and user-friendly approaches to extracting parallelism, and making use of commodity multiprocessors without the need for new hardware Pickett and Verbrugge, 2005, Oancea and Mycroft, 2008. Investigation of such approaches, however, tends to be hampered by the need for such systems to build on specific language or execution contexts with implicit source-level requirements, and lack of integration with a realistic compiler infrastructure.

...read moreread less

Book Chapter•10.1007/978-3-642-37658-0_20•

Kaira: Generating Parallel Libraries and Their Usage with Octave

[...]

Stanislav Böhm¹, Marek Běhálek¹, Ondřej Meca¹•Institutions (1)

Technical University of Ostrava¹

11 Sep 2012

TL;DR: The main development goal is to create a practically usable general-purpose high-level visual programming tool for the area of High Performance Computing (HPC), especially for distributed memory systems.

...read moreread less

Abstract: We are developing a tool Kaira[1,2] Our main development goal is to create a practically usable general-purpose high-level visual programming tool for the area of High Performance Computing (HPC), especially for distributed memory systemsWe feel that there is a space for this research Tools used by practitioners in this area are usually low-level ones (like Message Passing Interface – MPI) or domain specific tools

...read moreread less

Book Chapter•10.1007/978-3-642-37658-0_22•

Abstractions for Defining Semi-Regular Grids Orthogonally from Stencils

[...]

Andrew Stone¹, Michelle Mills Strout¹•Institutions (1)

Colorado State University¹

11 Sep 2012

TL;DR: The GridLib library as mentioned in this paper provides a separation of grid, algorithm, and parallelization for semi-regular grids, where subdomains of the grid are regular (e.g., can be stored in an array) but boundaries between sub-domains connect in an irregular fashion.

...read moreread less

Abstract: In various applications including atmospheric and ocean simulation programs, stencil computations occur on grids where sub-domains of the grid are regular (e.g., can be stored in an array) but boundaries between sub-domains connect in an irregular fashion. We call this class of grids semi-regular. Implementations of stencils on semi-regular grids often have grid-structure details tangled with the stencil computation code. This tangling of details requires programmers to have full knowledge of the current grid structure to make changes to the stencil computations and makes changing the grid structure extremely expensive. Existing libraries and tools [1-7] for stencil computations have not focused on this class of grid, focusing instead on purely regular or irregular grids. In this poster we introduce abstractions for the class of semi-regular grids and describe the GridLib library where we have implemented these abstractions. These abstractions enable a separation of grid, algorithm, and parallelization for semi-regular grids.

...read moreread less

Book Chapter•10.1007/978-3-642-37658-0_2•

AlphaZ: A System for Design Space Exploration in the Polyhedral Model

[...]

Tomofumi Yuki¹, Gautam Gupta, DaeGon Kim, Tanveer Pathan, Sanjay Rajopadhye¹ - Show less +1 more•Institutions (1)

Colorado State University¹

11 Sep 2012

TL;DR: The polyhedral model is now a well established and effective formalism for program optimization and parallelization, however, finding optimal transformations is a long-standing open problem and tools that allow practitioners to explore different choices through script-driven or user-guided transformations are needed.

...read moreread less

Abstract: The polyhedral model is now a well established and effective formalism for program optimization and parallelization However, finding optimal transformations is a long-standing open problem It is therefore important to develop tools that, rather than following predefined optimization criteria, allow practitioners to explore different choices through script-driven or user-guided transformations More than practitioners, such flexibility is even more important for compiler researchers and auto-tuner developers In addition, tools must also raise the level of abstraction by representing and manipulating reductions and scans explicitly And third, the tools must also be able to explore transformation choices that consider memory (re)-allocation

...read moreread less

Book Chapter•10.1007/978-3-642-37658-0_11•

FlowPools: A Lock-Free Deterministic Concurrent Dataflow Abstraction

[...]

Aleksandar Prokopec¹, Heather Miller¹, Tobias Schlatter¹, Philipp Haller, Martin Odersky¹ - Show less +1 more•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

11 Sep 2012

TL;DR: This paper presents the design and implementation of a fundamental data structure for composable deterministic parallel dataflow computation through the use of functional programming abstractions, and provides a correctness proof, showing that the implementation is linearizable, lock-free, and deterministic.

...read moreread less

Abstract: Implementing correct and deterministic parallel programs is challenging. Even though concurrency constructs exist in popular programming languages to facilitate the task of deterministic parallel programming, they are often too low level, or do not compose well due to underlying blocking mechanisms. In this paper, we present the design and implementation of a fundamental data structure for composable deterministic parallel dataflow computation through the use of functional programming abstractions. Additionally, we provide a correctness proof, showing that the implementation is linearizable, lock-free, and deterministic. Finally, we show experimental results which compare our FlowPool against corresponding operations on other concurrent data structures, and show that in addition to offering new capabilities, FlowPools reduce insertion time by 49 − 54% on a 4-core i7 machine with respect to comparable concurrent queue data structures in the Java standard library.

...read moreread less

Book Chapter•10.1007/978-3-642-37658-0_4•

The stapl Parallel Graph Library

[...]

Harshvardhan¹, Adam Fidel¹, Nancy M. Amato¹, Lawrence Rauchwerger¹•Institutions (1)

Texas A&M University¹

11 Sep 2012

TL;DR: The library introduces pGraph pViews that separate algorithm design from the container implementation, and supports three graph processing algorithmic paradigms, level-synchronous, asynchronous and coarse-grained, and provides common graph algorithms based on them.

...read moreread less

Abstract: This paper describes the stapl Parallel Graph Library, a high-level framework that abstracts the user from data-distribution and parallelism details and allows them to concentrate on parallel graph algorithm development. It includes a customizable distributed graph container and a collection of commonly used parallel graph algorithms. The library introduces pGraph pViews that separate algorithm design from the container implementation. It supports three graph processing algorithmic paradigms, level-synchronous, asynchronous and coarse-grained, and provides common graph algorithms based on them. Experimental results demonstrate improved scalability in performance and data size over existing graph libraries on more than 16,000 cores and on internet-scale graphs containing over 16 billion vertices and 250 billion edges.

...read moreread less

Book Chapter•10.1007/978-3-642-37658-0_8•

Compiler Optimizations for Industrial Unstructured Mesh CFD Applications on GPUs

[...]

Carlo Bertolli¹, Adam Betts¹, Nicolas Loriant¹, Gihan R. Mudalige², David Radford³, David A. Ham¹, Michael B. Giles², Paul H. J. Kelly¹ - Show less +4 more•Institutions (3)

Imperial College London¹, University of Oxford², Rolls-Royce Holdings³

11 Sep 2012

TL;DR: Using three techniques for GPU optimization of unstructured mesh applications: a technique able to split a highly complex loop into simpler loops, a kernel specific alternative code synthesis, and configuration parameter tuning that improves the GPU performance relative to the multicore CPU.

...read moreread less

Abstract: Graphical Processing Units (GPUs) have shown acceleration factors over multicores for structured mesh-based Computational Fluid Dynamics (CFD). However, the value remains unclear for dynamic and irregular applications. Our motivating example is HYDRA, an unstructured mesh application used in production at Rolls-Royce for the simulation of turbomachinery components of jet engines. We describe three techniques for GPU optimization of unstructured mesh applications: a technique able to split a highly complex loop into simpler loops, a kernel specific alternative code synthesis, and configuration parameter tuning. Using these optimizations systematically on HYDRA improves the GPU performance relative to the multicore CPU. We show how these optimizations can be automated in a compiler, through user annotations. Performance analysis of a large number of complex loops enables us to study the relationship between optimizations and resource requirements of loops, in terms of registers and shared memory, which directly affect the loop performance.

...read moreread less

Book Chapter•10.1007/978-3-642-37658-0_5•

Set and Relation Manipulation for the Sparse Polyhedral Framework

[...]

Michelle Mills Strout¹, Geri Georg¹, Catherine Olschanowsky¹•Institutions (1)

Colorado State University¹

11 Sep 2012

TL;DR: Algorithms for manipulating sets and relations with uninterpreted function symbols to enable the Sparse Polyhedral Framework are presented and implemented.

...read moreread less

Abstract: The Sparse Polyhedral Framework (SPF) extends the Polyhedral Model by using the uninterpreted function call abstraction for the compile-time specification of run-time reordering transformations such as loop and data reordering and sparse tiling approaches that schedule irregular sets of iteration across loops. The Polyhedral Model represents sets of iteration points in imperfectly nested loops with unions of polyhedral and represents loop transformations with affine functions applied to such polyhedra sets. Existing tools such as ISL, Cloog, and Omega manipulate polyhedral sets and affine functions, however the ability to represent the sets and functions where some of the constraints include uninterpreted function calls such as those needed in the SPF is non-existant or severely restricted. This paper presents algorithms for manipulating sets and relations with uninterpreted function symbols to enable the Sparse Polyhedral Framework. The algorithms have been implemented in an open source, C++ library called IEGenLib (The Inspector/Executor Generator Library).

...read moreread less