Scispace (Formerly Typeset)
  1. Home
  2. Conferences
  3. Languages and Compilers for Parallel Computing
  4. 2012
  1. Home
  2. Conferences
  3. Languages and Compilers for Parallel Computing
  4. 2012
Showing papers presented at "Languages and Compilers for Parallel Computing in 2012"
Book Chapter•10.1007/978-3-642-37658-0_6•
Parallel clustered low-rank approximation of graphs and its application to link prediction

[...]

Xin Sui1, Tsung-Hsien Lee1, Joyce Jiyoung Whang1, Berkant Savas2, Saral Jain1, Keshav Pingali1, Inderjit S. Dhillon1 •
University of Texas at Austin1, Linköping University2
11 Sep 2012
TL;DR: This paper describes the first parallel implementation of a clustered low-rank approximation algorithm for large social network graphs, and uses it to perform link prediction in parallel and shows that this implementation scales well on large distributed-memory machines.
Abstract: Social network analysis has become a major research area that has impact in diverse applications ranging from search engines to product recommendation systems. A major problem in implementing social network analysis algorithms is the sheer size of many social networks, for example, the Facebook graph has more than 900 million vertices and even small networks may have tens of millions of vertices. One solution to dealing with these large graphs is dimensionality reduction using spectral or SVD analysis of the adjacency matrix of the network, but these global techniques do not necessarily take into account local structures or clusters of the network that are critical in network analysis. A more promising approach is clustered low-rank approximation: instead of computing a global low-rank approximation, the adjacency matrix is first clustered, and then a low-rank approximation of each cluster (i.e., diagonal block) is computed. The resulting algorithm is challenging to parallelize not only because of the large size of the data sets in social network analysis, but also because it requires computing with very diverse data structures ranging from extremely sparse matrices to dense matrices. In this paper, we describe the first parallel implementation of a clustered low-rank approximation algorithm for large social network graphs, and use it to perform link prediction in parallel. Experimental results show that this implementation scales well on large distributed-memory machines; for example, on a Twitter graph with roughly 11 million vertices and 63 million edges, our implementation scales by a factor of 86 on 128 processes and takes less than 2300 seconds, while on a much larger Twitter graph with 41 million vertices and 1.2 billion edges, our implementation scales by a factor of 203 on 256 processes with a running time about 4800 seconds.

22 citations

Book Chapter•10.1007/978-3-642-37658-0_7•
OmpSs-OpenCL Programming Model for Heterogeneous Systems

[...]

Vinoth Krishnan Elangovan1, Vinoth Krishnan Elangovan2, Rosa M. Badia1, Rosa M. Badia3, Eduard Ayguadé Parra2, Eduard Ayguadé Parra1 •
Barcelona Supercomputing Center1, Polytechnic University of Catalonia2, Spanish National Research Council3
11 Sep 2012
TL;DR: This paper focuses on integrating OpenCL framework with the OmpSs task based programming model using Nanos run time infrastructure to address shortcomings of OpenCL, and would enable the programmer to skip cumbersome OpenCL constructs and write a sequential program with annotated pragmas.
Abstract: The advent of heterogeneous computing has forced programmers to use platform specific programming paradigms in order to achieve maximum performance. This approach has a steep learning curve for programmers and also has detrimental influence on productivity and code re-usability. To help with this situation, OpenCL an open-source, parallel computing API for cross platform computations was conceived. OpenCL provides a homogeneous view of the computational resources (CPU and GPU) thereby enabling software portability across different platforms. Although OpenCL resolves software portability issues, the programming paradigm presents low programmability and additionally falls short in performance. In this paper we focus on integrating OpenCL framework with the OmpSs task based programming model using Nanos run time infrastructure to address these shortcomings. This would enable the programmer to skip cumbersome OpenCL constructs including OpenCL plaform creation, compilation, kernel building, kernel argument setting and memory transfers, instead write a sequential program with annotated pragmas. Our proposal mainly focuses on how to exploit the best of the underlying hardware platform with greater ease in programming and to gain significant performance using the data parallelism offered by the OpenCL run time for GPUs and multicore architectures. We have evaluated the platform with important benchmarks and have noticed substantial ease in programming with comparable performance.

20 citations

Book Chapter•10.1007/978-3-642-37658-0_10•
A Study on the Impact of Compiler Optimizations on High-Level Synthesis

[...]

Jason Cong1, Bin Liu1, Raghu Prabhakar1, Peng Zhang1•
University of California1
11 Sep 2012
TL;DR: In this article, the effects of both source-level and IR optimizations and phase ordering on high-level synthesis are explored, and three commonly used sourcelevel optimizations are studied in isolation and then simple yet effective heuristics to apply them to obtain a reasonable latency-area tradeoff.
Abstract: High-level synthesis is a design process that takes an untimed, behavioral description in a high-level language like C and produces register-transfer-level (RTL) code that implements the same behavior in hardware. In this design flow, the quality of the generated RTL is greatly influenced by the high-level description of the language. Hence it follows that both source-level and IR-level compiler optimizations could either improve or hurt the quality of the generated RTL. The problem of ordering compiler optimization passes, also known as the phase-ordering problem, has been an area of active research over the past decade. In this paper, we explore the effects of both source-level and IR optimizations and phase ordering on high-level synthesis. The parameters of the generated RTL are very sensitive to high-level optimizations. We study three commonly used source-level optimizations in isolation and then propose simple yet effective heuristics to apply them to obtain a reasonable latency-area tradeoff. We also study the phase-ordering problem for IR-level optimizations from a HLS perspective and compare it to a CPU-based setting. Our initial results show that an input-specific order can achieve a significant reduction in the latency of the generated RTL, and opens up this technology for future research.

19 citations

Book Chapter•10.1007/978-3-642-37658-0_17•
Beyond Do Loops: Data Transfer Generation with Convex Array Regions

[...]

Serge Guelton1, Mehdi Amini2, Béatrice Creusillet•
École nationale supérieure des télécommunications de Bretagne1, Mines ParisTech2
11 Sep 2012
TL;DR: The scope of this study ranges from inter-procedural analysis in simple loop nests with function calls, to inter-iteration data reuse optimization and arbitrary control flow in loop bodies.
Abstract: Automatic data transfer generation is a critical step for guided or automatic code generation for accelerators using distributed memories. Although good results have been achieved for loop nests, more complex control flows such as switches or while loops are generally not handled. This paper shows how to leverage the convex array regions abstraction to generate data transfers. The scope of this study ranges from inter-procedural analysis in simple loop nests with function calls, to inter-iteration data reuse optimization and arbitrary control flow in loop bodies. Generated transfers are approximated when an exact solution cannot be found. Array regions are also used to extend redundant load store elimination to array variables. The approach has been successfully applied to GPUs and domain-specific hardware accelerators.

18 citations

Book Chapter•10.1007/978-3-642-37658-0_12•
Task Parallelism and Data Distribution: An Overview of Explicit Parallel Programming Languages

[...]

Dounia Khaldi1, Pierre Jouvelot1, Corinne Ancourt1, François Irigoin1•
Mines ParisTech1
11 Sep 2012
TL;DR: This study surveys six popular parallel language designs and suggests that, even though there are many keywords and notions introduced by these languages, they boil down, as far as control issues are concerned, to three key task concepts: creation, synchronization and atomicity.
Abstract: Efficiently programming parallel computers would ideally require a language that provides high-level programming constructs to avoid the programming errors frequent when expressing parallelism. Since task parallelism is considered more error-prone than data parallelism, we survey six popular parallel language designs that tackle this difficult issue: Cilk, Chapel, X10, Habanero-Java, OpenMP and OpenCL. Using the parallel computation of the Mandelbrot set as running example, this paper describes how the fundamentals of task parallel programming are dealt with in these languages. Our study suggests that, even though there are many keywords and notions introduced by these languages, they boil down, as far as control issues are concerned, to three key task concepts: creation, synchronization and atomicity. These languages adopt one of three memory models: shared, message passing and Partitioned Global Address Space. The paper is designed to give users and language and compiler designers an up-to-date comparative overview of current parallel languages.

17 citations

Book Chapter•10.1007/978-3-642-37658-0_16•
Compiler Automatic Discovery of OmpSs Task Dependencies

[...]

Sara Royuela1, Alejandro Duran2, Alejandro Duran1, Xavier Martorell1•
Barcelona Supercomputing Center1, Intel2
11 Sep 2012
TL;DR: An algorithm based on the discovery of code concurrent to a task and liveness analysis is developed that enables the compiler to automatically determine the dependencies of OmpSs tasks, thus releasing users from the task of manually defining these dependencies.
Abstract: Dependence analysis is an essential step for many compiler optimizations, from simple loop transformations to automatic parallelization. Parallel programming models require specific dependence analyses that take into account multi-threaded execution. Furthermore, asynchronous parallelism introduced by OpenMP tasks has promoted the development of new dependency analysis techniques. In these terms, OmpSs parallel programming model extends OpenMP tasks with the definition of intertask dependencies. This extension allows run-time dependency detection, which potentially improves the performance when load balancing or locality rule the execution time. On the other side, the extension requires the user to figure out data-sharing attributes and the type of access to each data in all tasks in order to correctly specify the dependencies. We aim to enhance the programmability of OmpSs with a new methodology that enables the compiler to automatically determine the dependencies of OmpSs tasks, thus releasing users from the task of manually defining these dependencies. In this context, we have developed an algorithm based on the discovery of code concurrent to a task and liveness analysis. The algorithm first finds out all code concurrent with a given task. Then, it computes the data-sharing attributes of the variables appearing in the task. Finally, it analyzes the liveness properties of the task’s shared variables. With this information, the algorithm figures out the proper dependencies of the task. We have implemented this algorithm in the Mercurium source-to-source compiler. We have tested the results with several benchmarks proving that the algorithm is able to correctly find a large number of dependency expressions.

12 citations

Book Chapter•10.1007/978-3-642-37658-0_3•
Compiler Optimizations: Machine Learning versus O3

[...]

Yuriy Kashnikov1, Jean Christophe Beyler2, William Jalby1•
Versailles Saint-Quentin-en-Yvelines University1, Intel2
11 Sep 2012
TL;DR: This paper extensively tests the other performance options available and concludes that, although old compiler versions could benefit from compiler flag combinations, modern compilers perform admirably at the commonly used -O3 level.
Abstract: Software engineers are highly dependent on compiler technology to create efficient programs. Optimal execution time is currently the most important criteria in the HPC field; to achieve this the user applies the common compiler option -O3. The following paper extensively tests the other performance options available and concludes that, although old compiler versions could benefit from compiler flag combinations, modern compilers perform admirably at the commonly used -O3 level.

10 citations

Book Chapter•10.1007/978-3-642-37658-0_1•
Just in Time Load Balancing

[...]

Rosario Cammarota1, Alexandru Nicolau1, Alexander V. Veidenbaum1•
University of California1
11 Sep 2012
TL;DR: A rapid increase in the number of on-chip cores and the ways such cores share on- chip resources - such as pipeline and memory hierarchy, leads to an increase inThe number of possible high-performance configurations makes attaining peak performance through the exploitation of LLP an increasingly complex problem.
Abstract: Leveraging Loop Level Parallelism (LLP) is one of the most attractive techniques for improving program performance on emerging multi-cores Ordinary programs contain a large amount of parallel and DOALL loops, however emerging multi-core designs feature a rapid increase in the number of on-chip cores and the ways such cores share on-chip resources - such as pipeline and memory hierarchy, leads to an increase in the number of possible high-performance configurations This trend in emerging multi-core design makes attaining peak performance through the exploitation of LLP an increasingly complex problem

5 citations

Book Chapter•10.1007/978-3-642-37658-0_18•
Finish Accumulators: An Efficient Reduction Construct for Dynamic Task Parallelism

[...]

Jun Shirako1, Vincent Cavé1, Jisheng Zhao1, Vivek Sarkar1•
Rice University1
11 Sep 2012
TL;DR: Experimental results demonstrate that the Java-based implementation of finish accumulators delivers comparable or better performance for computing reductions relative to Java’s atomic variables and concurrent collections.
Abstract: Parallel reductions represent a common pattern for computing the aggregation of an associative and commutative operation, such as summation, across multiple pieces of data supplied by parallel tasks. In this poster, we introduce finish accumulators, a unified construct that supports predefined and user-defined parallel reductions for dynamic task parallelism. Finish accumulators are designed to be integrated into structured task parallelism constructs, such as the async and finish constructs found in the X10 and Habanero-Java (HJ) languages, so as to guarantee determinism for accumulation and to avoid any possible race conditions in referring to intermediate results. In contrast to lower-level reduction constructs such as atomic variables, the high-level semantics of finish accumulators allows for a wide range of implementations with different accumulation policies, e.g., eager-computation vs. lazy-computation. The best implementation can thus be selected based on a given application and target platform. We have integrated finish accumulators into the Habanero-Java task parallel language, and used them for research and teaching. In addition to their higher-level semantics, experimental results demonstrate that our Java-based implementation of finish accumulators delivers comparable or better performance for computing reductions relative to Java’s atomic variables and concurrent collections.

4 citations

Book Chapter•10.1007/978-3-642-37658-0_19•
FlashbackSTM: Improving STM Performance by Remembering the Past

[...]

Hugo Rito1, João Cachopo1•
Instituto Superior Técnico1
11 Sep 2012
TL;DR: Software Transactional Memory is one promising abstraction to simplify this task because when using an STM programmers may ignore low-level synchronization details and simply specify which operations must execute atomically inside transactions.
Abstract: As multicore machines become pervasive, an ever growing number of programmers face the challenge of building highly parallel applications that take full advantage of modern parallel hardware architectures. Software Transactional Memory (STM) [3] is one promising abstraction to simplify this task because when using an STM programmers may ignore low-level synchronization details and simply specify which operations must execute atomically inside transactions. It is then the STM’s responsibility to preserve the program’s semantics, while maintaining as much parallelism and concurrency as possible.

3 citations

Book Chapter•10.1007/978-3-642-37658-0_14•
A Software-Based Method-Level Speculation Framework for the Java Platform

[...]

Ivo Anjo1, João Cachopo1•
Technical University of Lisbon1
11 Sep 2012
TL;DR: This work intends to tackle the former issue of parallelizing existing sequential applications and designing new parallel applications with multicore processors.
Abstract: With multicore processors becoming ubiquitous on computing devices, the need for both parallelizing existing sequential applications and designing new parallel applications is greatly intensified. With our work, we intend to tackle the former issue.
Book Chapter•10.1007/978-3-642-37658-0_9•
UCIFF: Unified Cluster Assignment Instruction Scheduling and Fast Frequency Selection for Heterogeneous Clustered VLIW Cores

[...]

Vasileios Porpodas1, Marcelo Cintra1•
University of Edinburgh1
11 Sep 2012
TL;DR: Heterogeneous clustered VLIW processors however, support dynamic voltage and frequency scaling (DVFS) independently per cluster, and effectively controlling DVFS, to selectively decrease the frequency of clusters with a lot of slack in their schedule, can lead to significant energy savings.
Abstract: Clustered VLIW processors are scalable wide-issue statically scheduled processors. Their design is based on physically partitioning the otherwise shared hardware resources, a design which leads to both high performance and low energy consumption. In traditional clustered VLIW processors, all clusters operate at the same frequency. Heterogeneous clustered VLIW processors however, support dynamic voltage and frequency scaling (DVFS) independently per cluster. Effectively controlling DVFS, to selectively decrease the frequency of clusters with a lot of slack in their schedule, can lead to significant energy savings.
Book Chapter•10.1007/978-3-642-37658-0_13•
A Fast Parallel Graph Partitioner for Shared-Memory Inspector/Executor Strategies

[...]

Christopher D. Krieger1, Michelle Mills Strout1•
Colorado State University1
11 Sep 2012
TL;DR: This paper presents a shared memory parallel graph partitioner, ParCubed, for use in the context of sparse tiling run-time data and computation reordering and compares the presented hierarchical clustering partitioner with GPart and METIS in terms of partitioning speed, partitioning quality, and the effect the generated seed partitions have on executor speed.
Abstract: Graph partitioners play an important role in many parallel work distribution and locality optimization approaches. Surprisingly, however, to our knowledge there is no freely available parallel graph partitioner designed for execution on a shared memory multicore system. This paper presents a shared memory parallel graph partitioner, ParCubed, for use in the context of sparse tiling run-time data and computation reordering. Sparse tiling is a run-time scheduling technique that schedules groups of iterations across loops together when they access the same data and one or more of the loops contains indirect array accesses. For sparse tiling, which is implemented with an inspector/executor strategy, the inspector needs to find an initial seed partitioning of adequate quality very quickly. We compare our presented hierarchical clustering partitioner, ParCubed, with GPart and METIS in terms of partitioning speed, partitioning quality, and the effect the generated seed partitions have on executor speed. We find that the presented partitioner is 25 to 100 times faster than METIS on a 16 core machine. The total edge cut of the partitioning generated by ParCubed was found not to exceed 1.27x that of the partitioning found by METIS.
Book Chapter•10.1007/978-3-642-37658-0_15•
Ant: A Debugging Framework for MPI Parallel Programs

[...]

Jae-Woo Lee1, Leonardo R. Bachega1, Samuel P. Midkiff1, Yu Charlie Hu1•
Purdue University1
11 Sep 2012
TL;DR: Ant’s instrumentation strategy reduces the overhead of monitoring by over 14 times with less impact on accuracy than a scheme that simply distributes monitoring over all processes executing the program.
Abstract: This paper describes Ant, a debugging framework targeting MPI parallel programs. The Ant framework statically analyzes programs, marking code regions as being executed by all processes or executed by only some of the processes. The analyzed program is then instrumented with calls to an invariant violation monitoring and detection library. The analysis allows regions to be instrumented based on whether all, or less than all, processes execute the region. Ant’s instrumentation strategy allows sampled monitoring across processes in regions executed by all processes. We present a case study using Ant with C-DIDUCE (a variant of DIDUCE for C) to find violations of value invariants in parallel C/MPI programs. Ant’s instrumentation strategy reduces the overhead of monitoring by over 14 times with less impact on accuracy than a scheme that simply distributes monitoring over all processes executing the program.
Book Chapter•10.1007/978-3-642-37658-0_21•
Language and Architecture Independent Software Thread-Level Speculation

[...]

Zhen Cao1, Clark Verbrugge1•
McGill University1
11 Sep 2012
TL;DR: Pure software designs to TLS have relatively recently become of interest, trading increased overhead concerns for the potential of providing new and user-friendly approaches to extracting parallelism, and making use of commodity multiprocessors without the need for new hardware.
Abstract: Thread-level speculation (TLS) has historically been investigated in the context of novel hardware designs Chen and Olukotun, 2003, Steffan et al., 2005 Quinones et al., 2005. Pure software designs to TLS, however, have relatively recently become of interest, trading increased overhead concerns for the potential of providing new and user-friendly approaches to extracting parallelism, and making use of commodity multiprocessors without the need for new hardware Pickett and Verbrugge, 2005, Oancea and Mycroft, 2008. Investigation of such approaches, however, tends to be hampered by the need for such systems to build on specific language or execution contexts with implicit source-level requirements, and lack of integration with a realistic compiler infrastructure.
Book Chapter•10.1007/978-3-642-37658-0_20•
Kaira: Generating Parallel Libraries and Their Usage with Octave

[...]

Stanislav Böhm1, Marek Běhálek1, Ondřej Meca1•
Technical University of Ostrava1
11 Sep 2012
TL;DR: The main development goal is to create a practically usable general-purpose high-level visual programming tool for the area of High Performance Computing (HPC), especially for distributed memory systems.
Abstract: We are developing a tool Kaira[1,2] Our main development goal is to create a practically usable general-purpose high-level visual programming tool for the area of High Performance Computing (HPC), especially for distributed memory systemsWe feel that there is a space for this research Tools used by practitioners in this area are usually low-level ones (like Message Passing Interface – MPI) or domain specific tools
Book Chapter•10.1007/978-3-642-37658-0_22•
Abstractions for Defining Semi-Regular Grids Orthogonally from Stencils

[...]

Andrew Stone1, Michelle Mills Strout1•
Colorado State University1
11 Sep 2012
TL;DR: The GridLib library as mentioned in this paper provides a separation of grid, algorithm, and parallelization for semi-regular grids, where subdomains of the grid are regular (e.g., can be stored in an array) but boundaries between sub-domains connect in an irregular fashion.
Abstract: In various applications including atmospheric and ocean simulation programs, stencil computations occur on grids where sub-domains of the grid are regular (e.g., can be stored in an array) but boundaries between sub-domains connect in an irregular fashion. We call this class of grids semi-regular. Implementations of stencils on semi-regular grids often have grid-structure details tangled with the stencil computation code. This tangling of details requires programmers to have full knowledge of the current grid structure to make changes to the stencil computations and makes changing the grid structure extremely expensive. Existing libraries and tools [1-7] for stencil computations have not focused on this class of grid, focusing instead on purely regular or irregular grids. In this poster we introduce abstractions for the class of semi-regular grids and describe the GridLib library where we have implemented these abstractions. These abstractions enable a separation of grid, algorithm, and parallelization for semi-regular grids.
Book Chapter•10.1007/978-3-642-37658-0_2•
AlphaZ: A System for Design Space Exploration in the Polyhedral Model

[...]

Tomofumi Yuki1, Gautam Gupta, DaeGon Kim, Tanveer Pathan, Sanjay Rajopadhye1 •
Colorado State University1
11 Sep 2012
TL;DR: The polyhedral model is now a well established and effective formalism for program optimization and parallelization, however, finding optimal transformations is a long-standing open problem and tools that allow practitioners to explore different choices through script-driven or user-guided transformations are needed.
Abstract: The polyhedral model is now a well established and effective formalism for program optimization and parallelization However, finding optimal transformations is a long-standing open problem It is therefore important to develop tools that, rather than following predefined optimization criteria, allow practitioners to explore different choices through script-driven or user-guided transformations More than practitioners, such flexibility is even more important for compiler researchers and auto-tuner developers In addition, tools must also raise the level of abstraction by representing and manipulating reductions and scans explicitly And third, the tools must also be able to explore transformation choices that consider memory (re)-allocation
Book Chapter•10.1007/978-3-642-37658-0_11•
FlowPools: A Lock-Free Deterministic Concurrent Dataflow Abstraction

[...]

Aleksandar Prokopec1, Heather Miller1, Tobias Schlatter1, Philipp Haller, Martin Odersky1 •
École Polytechnique Fédérale de Lausanne1
11 Sep 2012
TL;DR: This paper presents the design and implementation of a fundamental data structure for composable deterministic parallel dataflow computation through the use of functional programming abstractions, and provides a correctness proof, showing that the implementation is linearizable, lock-free, and deterministic.
Abstract: Implementing correct and deterministic parallel programs is challenging. Even though concurrency constructs exist in popular programming languages to facilitate the task of deterministic parallel programming, they are often too low level, or do not compose well due to underlying blocking mechanisms. In this paper, we present the design and implementation of a fundamental data structure for composable deterministic parallel dataflow computation through the use of functional programming abstractions. Additionally, we provide a correctness proof, showing that the implementation is linearizable, lock-free, and deterministic. Finally, we show experimental results which compare our FlowPool against corresponding operations on other concurrent data structures, and show that in addition to offering new capabilities, FlowPools reduce insertion time by 49 − 54% on a 4-core i7 machine with respect to comparable concurrent queue data structures in the Java standard library.
Book Chapter•10.1007/978-3-642-37658-0_4•
The stapl Parallel Graph Library

[...]

Harshvardhan1, Adam Fidel1, Nancy M. Amato1, Lawrence Rauchwerger1•
Texas A&M University1
11 Sep 2012
TL;DR: The library introduces pGraph pViews that separate algorithm design from the container implementation, and supports three graph processing algorithmic paradigms, level-synchronous, asynchronous and coarse-grained, and provides common graph algorithms based on them.
Abstract: This paper describes the stapl Parallel Graph Library, a high-level framework that abstracts the user from data-distribution and parallelism details and allows them to concentrate on parallel graph algorithm development. It includes a customizable distributed graph container and a collection of commonly used parallel graph algorithms. The library introduces pGraph pViews that separate algorithm design from the container implementation. It supports three graph processing algorithmic paradigms, level-synchronous, asynchronous and coarse-grained, and provides common graph algorithms based on them. Experimental results demonstrate improved scalability in performance and data size over existing graph libraries on more than 16,000 cores and on internet-scale graphs containing over 16 billion vertices and 250 billion edges.
Book Chapter•10.1007/978-3-642-37658-0_8•
Compiler Optimizations for Industrial Unstructured Mesh CFD Applications on GPUs

[...]

Carlo Bertolli1, Adam Betts1, Nicolas Loriant1, Gihan R. Mudalige2, David Radford3, David A. Ham1, Michael B. Giles2, Paul H. J. Kelly1 •
Imperial College London1, University of Oxford2, Rolls-Royce Holdings3
11 Sep 2012
TL;DR: Using three techniques for GPU optimization of unstructured mesh applications: a technique able to split a highly complex loop into simpler loops, a kernel specific alternative code synthesis, and configuration parameter tuning that improves the GPU performance relative to the multicore CPU.
Abstract: Graphical Processing Units (GPUs) have shown acceleration factors over multicores for structured mesh-based Computational Fluid Dynamics (CFD). However, the value remains unclear for dynamic and irregular applications. Our motivating example is HYDRA, an unstructured mesh application used in production at Rolls-Royce for the simulation of turbomachinery components of jet engines. We describe three techniques for GPU optimization of unstructured mesh applications: a technique able to split a highly complex loop into simpler loops, a kernel specific alternative code synthesis, and configuration parameter tuning. Using these optimizations systematically on HYDRA improves the GPU performance relative to the multicore CPU. We show how these optimizations can be automated in a compiler, through user annotations. Performance analysis of a large number of complex loops enables us to study the relationship between optimizations and resource requirements of loops, in terms of registers and shared memory, which directly affect the loop performance.
Book Chapter•10.1007/978-3-642-37658-0_5•
Set and Relation Manipulation for the Sparse Polyhedral Framework

[...]

Michelle Mills Strout1, Geri Georg1, Catherine Olschanowsky1•
Colorado State University1
11 Sep 2012
TL;DR: Algorithms for manipulating sets and relations with uninterpreted function symbols to enable the Sparse Polyhedral Framework are presented and implemented.
Abstract: The Sparse Polyhedral Framework (SPF) extends the Polyhedral Model by using the uninterpreted function call abstraction for the compile-time specification of run-time reordering transformations such as loop and data reordering and sparse tiling approaches that schedule irregular sets of iteration across loops. The Polyhedral Model represents sets of iteration points in imperfectly nested loops with unions of polyhedral and represents loop transformations with affine functions applied to such polyhedra sets. Existing tools such as ISL, Cloog, and Omega manipulate polyhedral sets and affine functions, however the ability to represent the sets and functions where some of the constraints include uninterpreted function calls such as those needed in the SPF is non-existant or severely restricted. This paper presents algorithms for manipulating sets and relations with uninterpreted function symbols to enable the Sparse Polyhedral Framework. The algorithms have been implemented in an open source, C++ library called IEGenLib (The Inspector/Executor Generator Library).

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve