Top 29 papers presented at Languages and Compilers for Parallel Computing in 2000

Showing papers presented at "Languages and Compilers for Parallel Computing in 2000"

Improving Locality for Adaptive Irregular Scientific Codes

[...]

Hwansoo Han¹, Chau-Wen Tseng¹•Institutions (1)

10 Aug 2000

TL;DR: A cost model is developed which can be employed to calculate an efficient optimization frequency and may be applied dynamically instrumenting the program to measure execution time per time-step iteration and shows locality optimization may be used to improve performance even for adaptive codes.

...read moreread less

Abstract: Irregular scientific codes experience poor cache performance due to their memory access patterns. In this paper, we examine two issues for locality optimizations for irregular computations. First, we experimentally find locality optimization can improve performance for parallel codes, but is dependent on the parallelization techniques used. Second, we show locality optimization may be used to improve performance even for adaptive codes. We develop a cost model which can be employed to calculate an efficient optimization frequency; it may be applied dynamically instrumenting the program to measure execution time per time-step iteration. Our results are validated through experiments on three representative irregular scientific codes.

...read moreread less

46 citations

Book Chapter•10.1007/3-540-45574-4_11•

Improving Offset Assignment for Embedded Processors

[...]

Sunil Atri¹, J. Ramanujam¹, Mahmut Kandemir²•Institutions (2)

Louisiana State University¹, Pennsylvania State University²

10 Aug 2000

TL;DR: This paper presents new code optimization techniques for embedded fixed point DSP processors which have limited on-chip program ROM and include indirect addressing modes using post-increment and decrement operations, and presents a heuristic to reduce code size by taking advantage of these addressing modes.

...read moreread less

Abstract: Embedded systems consisting of the application program ROM, RAM, the embedded processor core, and any custom hardware on a single wafer are becoming increasingly common in application domains such as signal processing. Given the rapid deployment of these systems, programming on such systems has shifted from assembly language to high-level languages such as C, C++, and Java. The processors used in such systems are usually targeted toward specific application domains, e.g., digital signal processing (DSP). As a result, these embedded processors include application-specific instruction sets, complex and irregular data paths, etc., thereby rendering code generation for these processors difficult. In this paper, we present new code optimization techniques for embedded fixed point DSP processors which have limited on-chip program ROM and include indirect addressing modes using post-increment and decrement operations. We present a heuristic to reduce code size by taking advantage of these addressing modes. Our solution aims at improving the offset assignment produced by Liao et al.'s solution. It finds a layout of variables in RAM, so that it is possible to subsume explicit address register manipulation instructions into other instructions as a post-increment or post-decrement operation. Experimental results show the effectiveness of our solution.

...read moreread less

37 citations

Book Chapter•10.1007/3-540-45574-4_15•

Optimizing the Use of High Performance Software Libraries

[...]

Samuel Z. Guyer¹, Calvin Lin¹•Institutions (1)

University of Texas at Austin¹

10 Aug 2000

TL;DR: This paper describes how the use of software libraries, which is prevalent in high performance computing, can benefit from compiler optimizations in much the same way that conventional programming languages do.

...read moreread less

Abstract: This paper describes how the use of software libraries, which is prevalent in high performance computing, can benefit from compiler optimizations in much the same way that conventional programming languages do. We explain how the compilation of these informal languages differs from the compilation of more conventional languages. In particular, such compilation requires precise pointer analysis, domain-specific information about the library's semantics, and a configurable compilation scheme. We describe a solution that combines dataflow analysis and pattern matching to perform configurable optimizations.

...read moreread less

31 citations

Book Chapter•10.1007/3-540-45574-4_4•

An Empirical Study of Selective Optimization

[...]

Matthew Arnold¹, Matthew Arnold², Michael Hind², Barbara G. Ryder¹•Institutions (2)

Rutgers University¹, IBM²

10 Aug 2000

TL;DR: The results show that selective optimization can offer substantial improvement over an optimize-all-methods strategy for short-running applications, and for longer- Running applications there is a significant range of methods that can be selectively optimized to achieve close to optimal performance.

...read moreread less

Abstract: This paper describes an empirical study of selective optimization using the Jalapeno Java virtual machine The goal of the study is to provide insight into the design and implementation of an adaptive system by investigating the performance potential of selective optimization and identifying the classes of applications for which this performance can be expected Two types of offline profiling information are used to guide selective optimization, and several strategies for selecting the methods to optimize are compared The results show that selective optimization can offer substantial improvement over an optimize-all-methods strategy for short-running applications, and for longer-running applications there is a significant range of methods that can be selectively optimized to achieve close to optimal performance The results also show that a coarse-grained sampling system can provide enough accuracy to successfully guide selective optimization

...read moreread less

30 citations

Book Chapter•10.1007/3-540-45574-4_21•

OpenMP Extensions for Thread Groups and Their Run-Time Support

[...]

Marc Gonzalez, José Luis Hervás Oliver, Xavier Martorell, Eduard Ayguadé, Jesús Labarta, Nacho Navarro - Show less +2 more

10 Aug 2000

TL;DR: A set of proposals for the OpenMP shared-memory programming model oriented towards the definition of thread groups in the framework of nested parallelism and the additional functionalities required in the runtime library supporting the parallel execution are presented.

...read moreread less

Abstract: This paper presents a set of proposals for the OpenMP shared-memory programming model oriented towards the definition of thread groups in the framework of nested parallelism. The paper also describes the additional functionalities required in the runtime library supporting the parallel execution. The extensions have been implemented in the OpenMP NanosCompiler and evaluated in a set of real applications and benchmarks. In this paper we present experimental results for one of these applications.

...read moreread less

29 citations

Book Chapter•10.1007/3-540-45574-4_14•

Compiler Synthesis of Task Graphs for Parallel Program Performance Prediction

[...]

Vikram Adve¹, Vikram Adve², Rizos Sakellariou³•Institutions (3)

National Center for Supercomputing Applications¹, University of Illinois at Urbana–Champaign², University of Manchester³

10 Aug 2000

TL;DR: This paper focuses on the use of task graphs in parallel programming systems, which have used them as a programming notation for expressing parallelism, as an internal representation in the compiler for computation partitioning and communication generation, and as a runtime representation for scheduling and execution of parallel programs.

...read moreread less

Abstract: Task graphs and their equivalents have proved to be a valuable abstraction for representing the execution of parallel programs in a number of different applications. Perhaps the most widespread use of task graphs has been for performance modeling of parallel programs, including quantitative analytical models [3],[19],[25],[26],[27], theoretical and abstract analytical models [14], and program simulation [5],[13]. A second important use of task graphs is in parallel programming systems. Parallel programming environments such as PYRROS [28], CODE [24], HENCE [24], and Jade [20] have used task graphs at three different levels: as a programming notation for expressing parallelism, as an internal representation in the compiler for computation partitioning and communication generation, and as a runtime representation for scheduling and execution of parallel programs. Although the task graphs used in these systems differ in representation and semantics (e.g., whether task graph edges capture purely precedence constraints or also dataflow requirements), there are close similarities. Perhaps most importantly, they all capture the parallel structure of a program separately from the sequential computations, by breaking down the program into computational “tasks”, precedence relations between tasks, and (in some cases) explicit communication or synchronization operations between tasks.

...read moreread less

22 citations

Book Chapter•10.1007/3-540-45574-4_8•

Searching for the Best FFT Formulas with the SPL Compiler

[...]

Jeremy Johnson¹, Robert W. Johnson, David Padua², Jianxin Xiong²•Institutions (2)

Drexel University¹, University of Illinois at Urbana–Champaign²

10 Aug 2000

TL;DR: This paper presents an application of a approach to implementing and optimizing fast signal transforms based on a domain-specific computer language, called SPL, to the implementation of the FFT.

...read moreread less

Abstract: This paper discuss an approach to implementing and optimizing fast signal transforms based on a domain-specific computer language, called SPL. SPL programs, which are essentially mathematical formulas, represent matrix factorizations, which provide fast algorithms for computing many important signal transforms. A special purpose compiler translates SPL programs into efficient FORTRAN programs. Since there are many formulas for a given transform, a fast implementation can be obtained by generating alternative formulas and searching for the one with the fastest execution time. This paper presents an application of this methodology to the implementation of the FFT.

...read moreread less

19 citations

Book Chapter•10.1007/3-540-45574-4_16•

Compiler Techniques for Flat Neighborhood Networks

[...]

Henry G. Dietz¹, Timothy I. Mattox¹•Institutions (1)

University of Kentucky¹

10 Aug 2000

TL;DR: This paper centers on the use of a set of genetic search algorithms to compile the network wiring pattern, basic routing tables, and code for specific communication patterns that will use an optimized schedule rather than simply applying the basic routing.

...read moreread less

Abstract: A Flat Neighborhood Network (FNN) is a new interconnection network architecture that can provide very low latency and high bisection bandwidth at a minimal cost for large clusters However, unlike more traditional designs, FNNs generally are not symmetric Thus, although an FNN by definition offers a certain base level of performance for random communication patterns, both the network design and communication (routing) schedules can be optimized to make specific communication patterns achieve significantly more than the basic performance The primary mechanism for design of both the network and communication schedules is a set of genetic search algorithms (GAs) that derive good designs from specifications of particular communication patterns This paper centers on the use of these GAs to compile the network wiring pattern, basic routing tables, and code for specific communication patterns that will use an optimized schedule rather than simply applying the basic routing

...read moreread less

18 citations

Book Chapter•10.1007/3-540-45574-4_10•

Experimental Evaluation of Energy Behavior of Iteration Space Tiling

[...]

Mahmut Kandemir¹, Narayanan Vijaykrishnan¹, Mary Jane Irwin¹, H. S. Kim¹•Institutions (1)

Pennsylvania State University¹

10 Aug 2000

TL;DR: The results show that the choice of tile size and input size critically impacts the system energy consumption and reveal that tiling should be applied more or less aggressively based on whether the low power objective is to prolong the battery life or to limit the energy dissipated within a package.

...read moreread less

Abstract: Optimizing compilers have traditionally focused on enhancing the performance of a given piece of code.With the proliferation of embedded software, it is becoming important to identify the energy impact of these traditional performance-oriented optimizations and to develop new energy-aware schemes. Towards this goal, this paper explores the energy consumption behavior of one of the widely-used loop-level compiler optimizations, iteration space tiling, by varying a set of software and hardware parameters. Our results show that the choice of tile size and input size critically impacts the system energy consumption. Specifically, we find that the best tile size for the least energy consumed is different from that for the best performance. Also, tailoring tile size to the input size generates better energy results than working with a fixed tile size. Our results also reveal that tiling should be applied more or less aggressively based on whether the low power objective is to prolong the battery life or to limit the energy dissipated within a package.

...read moreread less

13 citations

Book Chapter•10.1007/3-540-45574-4_7•

Extending Scalar Optimizations for Arrays

[...]

David Wonnacott¹•Institutions (1)

Haverford College¹

10 Aug 2000

TL;DR: It is shown that the valuebased dependence relations produced by the Omega Test can be used as a basis for generalizations of several scalar optimizations.

...read moreread less

Abstract: Traditional techniques for array analysis do not provide dataflow information, and thus traditional dataflow-based scalar optimizations have not been applied to array elements. A number of techniques have recently been developed for producing information about array dataflow, raising the possibility that dataflow-based optimizations could be applied to array elements. In this paper, we show that the valuebased dependence relations produced by the Omega Test can be used as a basis for generalizations of several scalar optimizations.

...read moreread less

12 citations

Proceedings Article•

A Matlab Just-In-time Compiler

[...]

George Almási, David Padua, MaJIC

10 Aug 2000

TL;DR: This paper describes the experience with MaJIC, a just-intime compiler for MATLAB, and indicates large speedups when compared to the interpreter, and reasonable performance whenCompared to static compilers.

...read moreread less

Abstract: This paper describes our experience with MaJIC, a just-intime compiler for MATLAB. In the recent past, several compiler projects claimed large performance improvements when processing MATLAB code. Most of these projects are static compilers suited for batch processing; MaJIC is a just-in-time compiler. The compilation process is transparent to the user. This impacts the modus operandi of the compiler, resulting in a few interesting analysis techniques. Our experiments with MaJIC indicate large speedups when compared to the interpreter, and reasonable performance when compared to static compilers.

...read moreread less

Book Chapter•10.1007/3-540-45574-4_1•

Accurate Shape Analysis for Recursive Data Structures

[...]

Francisco Corbera¹, Rafael Asenjo¹, Emilio L. Zapata¹•Institutions (1)

University of Málaga¹

10 Aug 2000

TL;DR: This paper describes the framework and the compiler implemented to capture complex data structures generated, traversed, and modified in C codes to approximate the shape of the data structure after the execution of such a sentence.

...read moreread less

Abstract: Automatic parallelization of codes which use dynamic data structures is still a challenge. One of the first steps in such parallelization is the automatic detection of the dynamic data structure used in the code. In this paper we describe the framework and the compiler we have implemented to capture complex data structures generated, traversed, and modified in C codes. Our method assigns a Reduced Set of Reference Shape Graphs (RSRSG) to each sentence to approximate the shape of the data structure after the execution of such a sentence. With the properties and operations that define the behavior of our RSRSG, the method can accurately detect complex recursive data structures such as a doubly linked list of pointers to trees where the leaves point to additional lists. Other experiments are carried out with real codes to validate the capabilities of our compiler.

...read moreread less

Proceedings Article•

Improving variable placement for embedded processors

[...]

Sunil Atri, J. Ramanujam, Mahmut Kandemir

1 Jan 2000

Book Chapter•10.1007/3-540-45574-4_22•

Compiling Data Intensive Applications with Spatial Coordinates

[...]

Renato Ferreira¹, Gagan Agrawal², Ruoming Jin², Joel H. Saltz¹•Institutions (2)

University of Maryland, College Park¹, University of Delaware²

10 Aug 2000

TL;DR: A general compilation and execution strategy for data intensive applications with two important properties, which achieves high locality in disk accesses and a technique for hoisting conditionals which further improves efficiency in execution of such compiled codes.

...read moreread less

Abstract: Processing and analyzing large volumes of data plays an increasingly important role in many domains of scientific research. We are developing a compiler which processes data intensive applications written in a dialect of Java and compiles them for efficient execution on cluster of workstations or distributed memory machines. In this paper, we focus on data intensive applications with two important properties: 1) data elements have spatial coordinates associated with them and the distribution of the data is not regular with respect to these coordinates, and 2) the application processes only a subset of the available data on the basis of spatial coordinates. These applications arise in many domains like satellite data-processing and medical imaging. We present a general compilation and execution strategy for this class of applications which achieves high locality in disk accesses. We then present a technique for hoisting conditionals which further improves efficiency in execution of such compiled codes. Our preliminary experimental results showtha t the performance from our proposed execution strategy is nearly two orders of magnitude better than a naive strategy. Further, up to 30% improvement in performance is observed by applying the technique for hoisting conditionals.

...read moreread less

Book Chapter•10.1007/3-540-45574-4_19•

A Comparative Analysis of Dependence Testing Mechanisms

[...]

Jay Hoeflinger¹, Yunheung Paek²•Institutions (2)

University of Illinois at Urbana–Champaign¹, KAIST²

10 Aug 2000

TL;DR: This paper briefly describes the descriptor for representing memory accesses, the Linear Memory Access Descriptor (LMAD), and the Access Region Test (ART), and compares and contrast the mechanisms of the LMAD intersection algorithm with the internal mechanisms for a number of prior dependence tests.

...read moreread less

Abstract: The internal mechanism used for a dependence test constrains its accuracy and determines its speed. So does the form in which it represents array subscript expressions. The internal mechanism and representational form used for our Access Region Test (ART) is different from that used in any other dependence test, and therefore its constraints and characteristics are likewise different. In this paper, we briefly describe our descriptor for representing memory accesses, the Linear Memory Access Descriptor (LMAD) and the ART. We then describe the LMAD intersection algorithm in some detail. Finally, we compare and contrast the mechanisms of the LMAD intersection algorithm with the internal mechanisms for a number of prior dependence tests.

...read moreread less

Book Chapter•10.1007/3-540-45574-4_18•

A Performance Advisor Tool for Shared-Memory Parallel Programming

[...]

Seon Wook Kim¹, Insung Park¹, Rudolf Eigenmann¹•Institutions (1)

Purdue University¹

10 Aug 2000

TL;DR: A framework that addresses the problem of inexperienced programmers who lack the knowledge and intuition of advanced parallel programmers by automating the analysis of static program information and performance data, and offering active suggestions to programmers is developed.

...read moreread less

Abstract: Optimizing a parallel program is often difficult. This is true, in particular, for inexperienced programmers who lack the knowledge and intuition of advanced parallel programmers. We have developed a framework that addresses this problem by automating the analysis of static program information and performance data, and offering active suggestions to programmers. Our tool enables experts to transfer programming experience to new users. It complements today's parallelizing compilers in that it helps to tune the performance of a compiler-optimized parallel program. To show its applicability, we present two case studies that utilize this system. By simply following the suggestions of our system, we were able to reduce the execution time of benchmark programs by as much as 39%.

...read moreread less

Book Chapter•10.1007/3-540-45574-4_24•

Issues of the Automatic Generation of HPF Loop Programs

[...]

Peter Faber¹, Martin Griebl¹, Christian Lengauer¹•Institutions (1)

University of Passau¹

10 Aug 2000

TL;DR: This work reports on problems met during code generation for HPF, and existing methods that can be used to reduce some of these problems.

...read moreread less

Abstract: Writing correct and efficient programs for parallel computers remains a challenging task, even after some decades of research in this area. One way to generate parallel programs is to write sequential programs and let the compiler handle the details of extracting parallelism. LooPo is an automatic parallelizer that extracts parallelism from sequential loop nests by transformations in the polyhedron model. The generation of code from these transformed programs is an important step. We report on problems met during code generation for HPF, and existing methods that can be used to reduce some of these problems.

...read moreread less

Book Chapter•10.1007/3-540-45574-4_9•

On Materializations of Array-Valued Temporaries

[...]

Daniel J. Rosenkrantz¹, Lenore R. Mullin¹, Harry B. Hunt¹•Institutions (1)

University at Albany, SUNY¹

10 Aug 2000

TL;DR: Results are presented demonstrating the usefulness of monolithic program analysis and optimization prior to scalarization and models are developed for studying nonmaterialization in basic blocks consisting of a sequence of assignment statements involving array-valued variables.

...read moreread less

Abstract: We present results demonstrating the usefulness of monolithic program analysis and optimization prior to scalarization. In particular, models are developed for studying nonmaterialization in basic blocks consisting ofa sequence of assignment statements involving array-valued variables. We use these models to analyze the problem ofmi nimizing the number ofmat erializations in a basic block, and to develop an efficient algorithm for minimizing the number of materializations in certain cases.

...read moreread less

Book Chapter•10.1007/3-540-45574-4_20•

Safe Approximation of Data Dependencies in Pointer-Based Structures

[...]

D. K. Arvind¹, T. A. Lewis¹•Institutions (1)

University of Edinburgh¹

10 Aug 2000

TL;DR: A new approach to the analysis of dependencies in complex, pointer-based data structures using two-variable finite state automata (2FSA) to produce approximate, yet safe information.

...read moreread less

Abstract: This paper describes a new approach to the analysis of dependencies in complex, pointer-based data structures. Structural information is provided by the programmer in the form of two-variable finite state automata (2FSA). Our method extracts data dependencies. For restricted forms of recursion, the data dependencies can be exact; however in general, we produce approximate, yet safe (i.e. overestimates dependencies) information. The analysis method has been automated and results are presented in this paper.

...read moreread less

Book Chapter•10.1007/3-540-45574-4_28•

A Bytecode Optimizer to Engineer Bytecodes for Performance

[...]

Jian-Zhi Wu¹, Jenq Kuen Lee¹•Institutions (1)

National Tsing Hua University¹

10 Aug 2000

TL;DR: This work focuses on the aspect of the bytecode to bytecode optimizing system on the ability to optimize the performances of hardware stack machines and proposes a mechanism to report an allocation scheme for a given size of stack allocation according to the cost model.

...read moreread less

Abstract: We are interested in the issues on the bytecode transformation for performance improvements on programs.I n this work, we focus on the aspect of our bytecode to bytecode optimizing system on the ability to optimize the performances of hardware stack machines.Tw o categories of the problem are considered.F irst, we consider the stack allocations for intra-procedural cases with a family of Java processors. We propose a mechanism to report an allocation scheme for a given size of stack allocation according to our cost model.Se cond, we also extend our framework for stack allocations to deal with inter-procedural cases. Our initial experimental test-bed is based on an ITRI-made Java processor and Kaffe VM simulator[2].E arly experiments indicate our proposed methods are promising in speedup Java programs on Java processors with a fixed size of stack caches.

...read moreread less

Book Chapter•10.1007/3-540-45574-4_17•

Exploiting Ownership Sets in HPF

[...]

Pramod G. Joisha, Prithviraj Banerjee

10 Aug 2000

TL;DR: This paper arrives at a refined system that enables us to efficiently solve for the ownership set using the Fourier-Motzkin Elimination technique, and which requires the course vector as the only auxiliary vector.

...read moreread less

Abstract: Ownership sets are fundamental to the partitioning of program computations across processors by the owner-computes rule. These sets arise due to the mapping of data arrays onto processors. In this paper, we focus on how ownership sets can be efficiently determined in the context of the HPF language, and show how the structure of these sets can be symbolically characterized in the presence of arbitrary data alignment and data distribution directives. Our starting point is a system of equalities and inequalities due to Ancourt et al. that captures the array mapping problem in HPF. We arrive at a refined system that enables us to efficiently solve for the ownership set using the Fourier-Motzkin Elimination technique, and which requires the course vector as the only auxiliary vector. We develop important and general properties pertaining to HPF alignments and distributions, and show how they can be used to eliminate redundant communication due to array replication. We also show how the generation of communication code can be avoided when pairs of array references are ultimately mapped onto the same processors. Experimental data demonstrating the improved code performance that the latter optimization enables is presented and discussed.

...read moreread less

Book Chapter•10.1007/3-540-45574-4_25•

Run-Time Fusion of MPI Calls in a Parallel C++ Library

[...]

A. J. Field¹, Thomas L. Hansen¹, Paul H. J. Kelly•Institutions (1)

Imperial College London¹

10 Aug 2000

TL;DR: The results demonstrate the software engineering benefits that accrue from the CFL abstraction and show that performance close to that of manually optimised code can be achieved automatically in many cases.

...read moreread less

Abstract: CFL (Communication Fusion Library) is a C++ library for MPI programmers It uses overloading to distinguish private variables from replicated, shared variables, and automatically introduces MPI communication to keep such replicated data consistent This paper concerns a simple but surprisingly effective technique which improves performance substantially: CFL operators are executed lazily in order to expose opportunities for run-time, context-dependent, optimisation such as message aggregation and operator fusion We evaluate the idea in the context of a large-scale simulation of oceanic plankton ecology The results demonstrate the software engineering benefits that accrue from the CFL abstraction and show that performance close to that of manually optimised code can be achieved automatically in many cases

...read moreread less

Book Chapter•10.1007/3-540-45574-4_2•

Cost Hierarchies for Abstract Parallel Machines

[...]

John T. O'Donnell¹, Thomas Rauber², Gudula Rünger³•Institutions (3)

University of Glasgow¹, Martin Luther University of Halle-Wittenberg², Chemnitz University of Technology³

10 Aug 2000

TL;DR: In this paper, the authors add explicit cost models as the third component of an Abstract Parallel Machine (APM) system, which can be obtained either by analyzing a parallel operation definition, or by measuring performance on a real machine.

...read moreread less

Abstract: The Abstract Parallel Machine (APM) model separates the definitions of parallel operations from the application algorithm, which defines the sequence of parallel operations to be executed An APM contains a set of parallel operation definitions, which specify how the computation is organized into independent sites of computation and what data exchanges are required This paper adds explicit cost models as the third component of an APM system The costs of parallel operations can be obtained either by analyzing a parallel operation definition, or by measuring performance on a real machine Costs with monotonicity constraints allow the cost of an algorithm to be transformed automatically as the algorithm itself is transformed

...read moreread less

Book Chapter•10.1007/3-540-45574-4_23•

Efficient Dynamic Local Enumeration for HPF

[...]

Will Denissen¹, Henk Sips¹•Institutions (1)

Delft University of Technology¹

10 Aug 2000

TL;DR: This paper presents an efficient dynamic local enumeration method, which always selects the optimal solution at run-time and has no need for code duplication, compared with the PGI and the Adaptor compiler.

...read moreread less

Abstract: In translating HPF programs, a compiler has to generate local iteration and communication sets. Apart from local enumeration, local storage compression is an issue, because in HPF array alignment functions can introduce local storage inefficiencies. Storage compression, however, may not lead to serious performance penalties. A problem in semi-automatic translation is that a compiler should generate efficient code in all cases the user may expect efficient translation (no surprises). However, in current compilers this turns out to be not always true. A major cause for this inefficiencies is that compilers use the same fixed enumeration scheme in all cases. In this paper, we present an efficient dynamic local enumeration method, which always selects the optimal solution at run-time and has no need for code duplication. The method is compared with the PGI and the Adaptor compiler.

...read moreread less

Book Chapter•10.1007/3-540-45574-4_26•

Set Operations for Orthogonal Processor Groups

[...]

Thomas Rauber, Robert Reilein, Gudula Rünger

10 Aug 2000

TL;DR: A generalization of the SPMDpro gramming model for distributed memory machines based on orthogonal processor groups, which is implemented in MPI to express group-SPMDc omputations on different partitions of the processors.

...read moreread less

Abstract: We consider a generalization of the SPMDpro gramming model for distributed memory machines based on orthogonal processor groups In this model different partitions of the processors into disjoint processor groups exist and can be used simultaneously in a single parallel implementation Set operations on orthogonal groups are used to express group-SPMDc omputations on different partitions of the processors The set operations are implemented in MPI

...read moreread less

Book Chapter•10.1007/3-540-45574-4_27•

Compiler Based Scheduling of Java Mobile Agents

[...]

Srivatsan Narasimhan¹, Santosh Pande¹•Institutions (1)

University of Cincinnati¹

10 Aug 2000

TL;DR: This work presents compiler-based scheduling strategies for Java Mobile Agents using annotations and data sizes to show how the compiler produces the best schedule, taking dependence information into account.

...read moreread less

Abstract: This work presents compiler-based scheduling strategies for Java Mobile Agents. We analyze the program using annotations and data sizes. For the different strategies, the compiler produces the best schedule, taking dependence information into account.

...read moreread less

Book Chapter•10.1007/3-540-45574-4_6•

SmartApps: An Application Centric Approach to High Performance Computing

[...]

Lawrence Rauchwerger¹, Nancy M. Amato¹, Josep Torrellas²•Institutions (2)

Texas A&M University¹, University of Illinois at Urbana–Champaign²

10 Aug 2000

TL;DR: The overall architecture of Smartapps is described and the achievements to date are presented: Run-time optimizations, performance modeling, and moderately reconfigurable hardware.

...read moreread less

Abstract: State-of-the-art run-time systems are a poor match to diverse, dynamic distributed applications because they are designed to provide support to a wide variety of applications, without much customization to individual specific requirements. Little or no guiding information flows directly from the application to the run-time system to allow the latter to fully tailor its services to the application. As a result, the performance is disappointing. To address this problem, we propose application-centric computing, or SMART APPLICATIONS. In the executable of smart applications, the compiler embeds most run-time system services, and a performance-optimizing feedback loop that monitors the application's performance and adaptively reconfigures the application and the OS/hardware platform. At run-time, after incorporating the code's input and the system's resources and state, the SmartApp performs a global optimization. This optimization is instance specific and thus much more tractable than a global generic optimization between application, OS and hardware. The resulting code and resource customization should lead to major speedups. In this paper, we first describe the overall architecture of Smartapps and then present the achievements to date: Run-time optimizations, performance modeling, and moderately reconfigurable hardware.

...read moreread less

Book Chapter•10.1007/3-540-45574-4_13•

Automatic Coarse Grain Task Parallel Processing on SMP Using OpenMP

[...]

Hironori Kasahara¹, Motoki Obata¹, Kazuhisa Ishizaka¹•Institutions (1)

Waseda University¹

10 Aug 2000

TL;DR: The proposed scheme decomposes a Fortran program into coarse grain tasks, analyzes parallelism among tasks by "Earliest Executable Condition Analysis" considering control and data dependencies, statically schedules the coarse grainasks to threads or generates dynamic task scheduling codes to assign the tasks to threads and generates OpenMP Fortran source code for a SMP machine.

...read moreread less

Abstract: This paper proposes a simple and efficient implementation method for a hierarchical coarse grain task parallel processing scheme on a SMP machine. OSCAR multigrain parallelizing compiler automatically generates parallelized code including OpenMP directives and its performance is evaluated on a commercial SMP machine. The coarse grain task parallel processing is important to improve the effective performance of wide range of multiprocessor systems from a single chip multiprocessor to a high performance computer beyond the limit of the loop parallelism. The proposed scheme decomposes a Fortran program into coarse grain tasks, analyzes parallelism among tasks by "Earliest Executable Condition Analysis" considering control and data dependencies, statically schedules the coarse grain tasks to threads or generates dynamic task scheduling codes to assign the tasks to threads and generates OpenMP Fortran source code for a SMP machine. The thread parallel code using OpenMP generated by OSCAR compiler forks threads only once at the beginning of the program and joins only once at the end even though the program is processed in parallel based on hierarchical coarse grain task parallel processing concept. The performance of the scheme is evaluated on 8-processor SMP machine, IBM RS6000 SP 604e High Node, using a newly developed OpenMP backend of OSCAR multigrain compiler. The evaluation shows that OSCAR compiler with IBM XL Fortran compiler version 5.1 gives us 1.5 to 3 times larger speedup than the native XL Fortran compiler for SPEC 95fp SWIM, TOMCATV, HYDRO2D, MGRID and Perfect Benchmarks ARC2D.

...read moreread less

Book Chapter•10.1007/3-540-45574-4_3•

Recursion Unrolling for Divide and Conquer Programs

[...]

Radu Rugina¹, Martin Rinard¹•Institutions (1)

Massachusetts Institute of Technology¹

10 Aug 2000

TL;DR: Recursion unrolling inlines recursive calls to reduce control flow overhead and increase the size of the basic blocks in the computation, which in turn increases the effectiveness of standard compiler optimizations such as register allocation and instruction scheduling.

...read moreread less

Abstract: This paper presents recursion unrolling, a technique for improving the performance of recursive computations. Conceptually, recursion unrolling inlines recursive calls to reduce control flow overhead and increase the size of the basic blocks in the computation, which in turn increases the effectiveness of standard compiler optimizations such as register allocation and instruction scheduling. We have identified two transformations that significantly improve the effectiveness of the basic recursion unrolling technique. Conditional fusion merges conditionals with identical expressions, considerably simplifying the control flow in unrolled procedures. Recursion re-rolling rolls back the recursive part of the procedure to ensure that a large unrolled base case is always executed, regardless of the input problem size. We have implemented our techniques and applied them to an important class of recursive programs, divide and conquer programs. Our experimental results show that recursion unrolling can improve the performance of our programs by a factor of between 3.6 to 10.8 depending on the combination of the program and the architecture.

...read moreread less