Scispace (Formerly Typeset)
  1. Home
  2. Conferences
  3. Languages and Compilers for Parallel Computing
  4. 2000
  1. Home
  2. Conferences
  3. Languages and Compilers for Parallel Computing
  4. 2000
Showing papers presented at "Languages and Compilers for Parallel Computing in 2000"
Book Chapter•10.1007/3-540-45574-4_12•
Improving Locality for Adaptive Irregular Scientific Codes

[...]

Hwansoo Han1, Chau-Wen Tseng1•
University of Maryland, College Park1
10 Aug 2000
TL;DR: A cost model is developed which can be employed to calculate an efficient optimization frequency and may be applied dynamically instrumenting the program to measure execution time per time-step iteration and shows locality optimization may be used to improve performance even for adaptive codes.
Abstract: Irregular scientific codes experience poor cache performance due to their memory access patterns. In this paper, we examine two issues for locality optimizations for irregular computations. First, we experimentally find locality optimization can improve performance for parallel codes, but is dependent on the parallelization techniques used. Second, we show locality optimization may be used to improve performance even for adaptive codes. We develop a cost model which can be employed to calculate an efficient optimization frequency; it may be applied dynamically instrumenting the program to measure execution time per time-step iteration. Our results are validated through experiments on three representative irregular scientific codes.

46 citations

Book Chapter•10.1007/3-540-45574-4_11•
Improving Offset Assignment for Embedded Processors

[...]

Sunil Atri1, J. Ramanujam1, Mahmut Kandemir2•
Louisiana State University1, Pennsylvania State University2
10 Aug 2000
TL;DR: This paper presents new code optimization techniques for embedded fixed point DSP processors which have limited on-chip program ROM and include indirect addressing modes using post-increment and decrement operations, and presents a heuristic to reduce code size by taking advantage of these addressing modes.
Abstract: Embedded systems consisting of the application program ROM, RAM, the embedded processor core, and any custom hardware on a single wafer are becoming increasingly common in application domains such as signal processing. Given the rapid deployment of these systems, programming on such systems has shifted from assembly language to high-level languages such as C, C++, and Java. The processors used in such systems are usually targeted toward specific application domains, e.g., digital signal processing (DSP). As a result, these embedded processors include application-specific instruction sets, complex and irregular data paths, etc., thereby rendering code generation for these processors difficult. In this paper, we present new code optimization techniques for embedded fixed point DSP processors which have limited on-chip program ROM and include indirect addressing modes using post-increment and decrement operations. We present a heuristic to reduce code size by taking advantage of these addressing modes. Our solution aims at improving the offset assignment produced by Liao et al.'s solution. It finds a layout of variables in RAM, so that it is possible to subsume explicit address register manipulation instructions into other instructions as a post-increment or post-decrement operation. Experimental results show the effectiveness of our solution.

37 citations

Book Chapter•10.1007/3-540-45574-4_15•
Optimizing the Use of High Performance Software Libraries

[...]

Samuel Z. Guyer1, Calvin Lin1•
University of Texas at Austin1
10 Aug 2000
TL;DR: This paper describes how the use of software libraries, which is prevalent in high performance computing, can benefit from compiler optimizations in much the same way that conventional programming languages do.
Abstract: This paper describes how the use of software libraries, which is prevalent in high performance computing, can benefit from compiler optimizations in much the same way that conventional programming languages do. We explain how the compilation of these informal languages differs from the compilation of more conventional languages. In particular, such compilation requires precise pointer analysis, domain-specific information about the library's semantics, and a configurable compilation scheme. We describe a solution that combines dataflow analysis and pattern matching to perform configurable optimizations.

31 citations

Book Chapter•10.1007/3-540-45574-4_4•
An Empirical Study of Selective Optimization

[...]

Matthew Arnold1, Matthew Arnold2, Michael Hind2, Barbara G. Ryder1•
Rutgers University1, IBM2
10 Aug 2000
TL;DR: The results show that selective optimization can offer substantial improvement over an optimize-all-methods strategy for short-running applications, and for longer- Running applications there is a significant range of methods that can be selectively optimized to achieve close to optimal performance.
Abstract: This paper describes an empirical study of selective optimization using the Jalapeno Java virtual machine The goal of the study is to provide insight into the design and implementation of an adaptive system by investigating the performance potential of selective optimization and identifying the classes of applications for which this performance can be expected Two types of offline profiling information are used to guide selective optimization, and several strategies for selecting the methods to optimize are compared The results show that selective optimization can offer substantial improvement over an optimize-all-methods strategy for short-running applications, and for longer-running applications there is a significant range of methods that can be selectively optimized to achieve close to optimal performance The results also show that a coarse-grained sampling system can provide enough accuracy to successfully guide selective optimization

30 citations

Book Chapter•10.1007/3-540-45574-4_21•
OpenMP Extensions for Thread Groups and Their Run-Time Support

[...]

Marc Gonzalez, José Luis Hervás Oliver, Xavier Martorell, Eduard Ayguadé, Jesús Labarta, Nacho Navarro 
10 Aug 2000
TL;DR: A set of proposals for the OpenMP shared-memory programming model oriented towards the definition of thread groups in the framework of nested parallelism and the additional functionalities required in the runtime library supporting the parallel execution are presented.
Abstract: This paper presents a set of proposals for the OpenMP shared-memory programming model oriented towards the definition of thread groups in the framework of nested parallelism. The paper also describes the additional functionalities required in the runtime library supporting the parallel execution. The extensions have been implemented in the OpenMP NanosCompiler and evaluated in a set of real applications and benchmarks. In this paper we present experimental results for one of these applications.

29 citations

Book Chapter•10.1007/3-540-45574-4_14•
Compiler Synthesis of Task Graphs for Parallel Program Performance Prediction

[...]

Vikram Adve1, Vikram Adve2, Rizos Sakellariou3•
National Center for Supercomputing Applications1, University of Illinois at Urbana–Champaign2, University of Manchester3
10 Aug 2000
TL;DR: This paper focuses on the use of task graphs in parallel programming systems, which have used them as a programming notation for expressing parallelism, as an internal representation in the compiler for computation partitioning and communication generation, and as a runtime representation for scheduling and execution of parallel programs.
Abstract: Task graphs and their equivalents have proved to be a valuable abstraction for representing the execution of parallel programs in a number of different applications. Perhaps the most widespread use of task graphs has been for performance modeling of parallel programs, including quantitative analytical models [3],[19],[25],[26],[27], theoretical and abstract analytical models [14], and program simulation [5],[13]. A second important use of task graphs is in parallel programming systems. Parallel programming environments such as PYRROS [28], CODE [24], HENCE [24], and Jade [20] have used task graphs at three different levels: as a programming notation for expressing parallelism, as an internal representation in the compiler for computation partitioning and communication generation, and as a runtime representation for scheduling and execution of parallel programs. Although the task graphs used in these systems differ in representation and semantics (e.g., whether task graph edges capture purely precedence constraints or also dataflow requirements), there are close similarities. Perhaps most importantly, they all capture the parallel structure of a program separately from the sequential computations, by breaking down the program into computational “tasks”, precedence relations between tasks, and (in some cases) explicit communication or synchronization operations between tasks.

22 citations

Book Chapter•10.1007/3-540-45574-4_8•
Searching for the Best FFT Formulas with the SPL Compiler

[...]

Jeremy Johnson1, Robert W. Johnson, David Padua2, Jianxin Xiong2•
Drexel University1, University of Illinois at Urbana–Champaign2
10 Aug 2000
TL;DR: This paper presents an application of a approach to implementing and optimizing fast signal transforms based on a domain-specific computer language, called SPL, to the implementation of the FFT.
Abstract: This paper discuss an approach to implementing and optimizing fast signal transforms based on a domain-specific computer language, called SPL. SPL programs, which are essentially mathematical formulas, represent matrix factorizations, which provide fast algorithms for computing many important signal transforms. A special purpose compiler translates SPL programs into efficient FORTRAN programs. Since there are many formulas for a given transform, a fast implementation can be obtained by generating alternative formulas and searching for the one with the fastest execution time. This paper presents an application of this methodology to the implementation of the FFT.

19 citations

Book Chapter•10.1007/3-540-45574-4_16•
Compiler Techniques for Flat Neighborhood Networks

[...]

Henry G. Dietz1, Timothy I. Mattox1•
University of Kentucky1
10 Aug 2000
TL;DR: This paper centers on the use of a set of genetic search algorithms to compile the network wiring pattern, basic routing tables, and code for specific communication patterns that will use an optimized schedule rather than simply applying the basic routing.
Abstract: A Flat Neighborhood Network (FNN) is a new interconnection network architecture that can provide very low latency and high bisection bandwidth at a minimal cost for large clusters However, unlike more traditional designs, FNNs generally are not symmetric Thus, although an FNN by definition offers a certain base level of performance for random communication patterns, both the network design and communication (routing) schedules can be optimized to make specific communication patterns achieve significantly more than the basic performance The primary mechanism for design of both the network and communication schedules is a set of genetic search algorithms (GAs) that derive good designs from specifications of particular communication patterns This paper centers on the use of these GAs to compile the network wiring pattern, basic routing tables, and code for specific communication patterns that will use an optimized schedule rather than simply applying the basic routing

18 citations

Book Chapter•10.1007/3-540-45574-4_10•
Experimental Evaluation of Energy Behavior of Iteration Space Tiling

[...]

Mahmut Kandemir1, Narayanan Vijaykrishnan1, Mary Jane Irwin1, H. S. Kim1•
Pennsylvania State University1
10 Aug 2000
TL;DR: The results show that the choice of tile size and input size critically impacts the system energy consumption and reveal that tiling should be applied more or less aggressively based on whether the low power objective is to prolong the battery life or to limit the energy dissipated within a package.
Abstract: Optimizing compilers have traditionally focused on enhancing the performance of a given piece of code.With the proliferation of embedded software, it is becoming important to identify the energy impact of these traditional performance-oriented optimizations and to develop new energy-aware schemes. Towards this goal, this paper explores the energy consumption behavior of one of the widely-used loop-level compiler optimizations, iteration space tiling, by varying a set of software and hardware parameters. Our results show that the choice of tile size and input size critically impacts the system energy consumption. Specifically, we find that the best tile size for the least energy consumed is different from that for the best performance. Also, tailoring tile size to the input size generates better energy results than working with a fixed tile size. Our results also reveal that tiling should be applied more or less aggressively based on whether the low power objective is to prolong the battery life or to limit the energy dissipated within a package.

13 citations

Book Chapter•10.1007/3-540-45574-4_7•
Extending Scalar Optimizations for Arrays

[...]

David Wonnacott1•
Haverford College1
10 Aug 2000
TL;DR: It is shown that the valuebased dependence relations produced by the Omega Test can be used as a basis for generalizations of several scalar optimizations.
Abstract: Traditional techniques for array analysis do not provide dataflow information, and thus traditional dataflow-based scalar optimizations have not been applied to array elements. A number of techniques have recently been developed for producing information about array dataflow, raising the possibility that dataflow-based optimizations could be applied to array elements. In this paper, we show that the valuebased dependence relations produced by the Omega Test can be used as a basis for generalizations of several scalar optimizations.

12 citations

Proceedings Article•
A Matlab Just-In-time Compiler

[...]

George Almási, David Padua, MaJIC
10 Aug 2000
TL;DR: This paper describes the experience with MaJIC, a just-intime compiler for MATLAB, and indicates large speedups when compared to the interpreter, and reasonable performance whenCompared to static compilers.
Abstract: This paper describes our experience with MaJIC, a just-intime compiler for MATLAB. In the recent past, several compiler projects claimed large performance improvements when processing MATLAB code. Most of these projects are static compilers suited for batch processing; MaJIC is a just-in-time compiler. The compilation process is transparent to the user. This impacts the modus operandi of the compiler, resulting in a few interesting analysis techniques. Our experiments with MaJIC indicate large speedups when compared to the interpreter, and reasonable performance when compared to static compilers.
Book Chapter•10.1007/3-540-45574-4_1•
Accurate Shape Analysis for Recursive Data Structures

[...]

Francisco Corbera1, Rafael Asenjo1, Emilio L. Zapata1•
University of Málaga1
10 Aug 2000
TL;DR: This paper describes the framework and the compiler implemented to capture complex data structures generated, traversed, and modified in C codes to approximate the shape of the data structure after the execution of such a sentence.
Abstract: Automatic parallelization of codes which use dynamic data structures is still a challenge. One of the first steps in such parallelization is the automatic detection of the dynamic data structure used in the code. In this paper we describe the framework and the compiler we have implemented to capture complex data structures generated, traversed, and modified in C codes. Our method assigns a Reduced Set of Reference Shape Graphs (RSRSG) to each sentence to approximate the shape of the data structure after the execution of such a sentence. With the properties and operations that define the behavior of our RSRSG, the method can accurately detect complex recursive data structures such as a doubly linked list of pointers to trees where the leaves point to additional lists. Other experiments are carried out with real codes to validate the capabilities of our compiler.
Proceedings Article•
Improving variable placement for embedded processors

[...]

Sunil Atri, J. Ramanujam, Mahmut Kandemir
1 Jan 2000
Book Chapter•10.1007/3-540-45574-4_22•
Compiling Data Intensive Applications with Spatial Coordinates

[...]

Renato Ferreira1, Gagan Agrawal2, Ruoming Jin2, Joel H. Saltz1•
University of Maryland, College Park1, University of Delaware2
10 Aug 2000
TL;DR: A general compilation and execution strategy for data intensive applications with two important properties, which achieves high locality in disk accesses and a technique for hoisting conditionals which further improves efficiency in execution of such compiled codes.
Abstract: Processing and analyzing large volumes of data plays an increasingly important role in many domains of scientific research. We are developing a compiler which processes data intensive applications written in a dialect of Java and compiles them for efficient execution on cluster of workstations or distributed memory machines. In this paper, we focus on data intensive applications with two important properties: 1) data elements have spatial coordinates associated with them and the distribution of the data is not regular with respect to these coordinates, and 2) the application processes only a subset of the available data on the basis of spatial coordinates. These applications arise in many domains like satellite data-processing and medical imaging. We present a general compilation and execution strategy for this class of applications which achieves high locality in disk accesses. We then present a technique for hoisting conditionals which further improves efficiency in execution of such compiled codes. Our preliminary experimental results showtha t the performance from our proposed execution strategy is nearly two orders of magnitude better than a naive strategy. Further, up to 30% improvement in performance is observed by applying the technique for hoisting conditionals.
Book Chapter•10.1007/3-540-45574-4_19•
A Comparative Analysis of Dependence Testing Mechanisms

[...]

Jay Hoeflinger1, Yunheung Paek2•
University of Illinois at Urbana–Champaign1, KAIST2
10 Aug 2000
TL;DR: This paper briefly describes the descriptor for representing memory accesses, the Linear Memory Access Descriptor (LMAD), and the Access Region Test (ART), and compares and contrast the mechanisms of the LMAD intersection algorithm with the internal mechanisms for a number of prior dependence tests.
Abstract: The internal mechanism used for a dependence test constrains its accuracy and determines its speed. So does the form in which it represents array subscript expressions. The internal mechanism and representational form used for our Access Region Test (ART) is different from that used in any other dependence test, and therefore its constraints and characteristics are likewise different. In this paper, we briefly describe our descriptor for representing memory accesses, the Linear Memory Access Descriptor (LMAD) and the ART. We then describe the LMAD intersection algorithm in some detail. Finally, we compare and contrast the mechanisms of the LMAD intersection algorithm with the internal mechanisms for a number of prior dependence tests.
Book Chapter•10.1007/3-540-45574-4_18•
A Performance Advisor Tool for Shared-Memory Parallel Programming

[...]

Seon Wook Kim1, Insung Park1, Rudolf Eigenmann1•
Purdue University1
10 Aug 2000
TL;DR: A framework that addresses the problem of inexperienced programmers who lack the knowledge and intuition of advanced parallel programmers by automating the analysis of static program information and performance data, and offering active suggestions to programmers is developed.
Abstract: Optimizing a parallel program is often difficult. This is true, in particular, for inexperienced programmers who lack the knowledge and intuition of advanced parallel programmers. We have developed a framework that addresses this problem by automating the analysis of static program information and performance data, and offering active suggestions to programmers. Our tool enables experts to transfer programming experience to new users. It complements today's parallelizing compilers in that it helps to tune the performance of a compiler-optimized parallel program. To show its applicability, we present two case studies that utilize this system. By simply following the suggestions of our system, we were able to reduce the execution time of benchmark programs by as much as 39%.
Book Chapter•10.1007/3-540-45574-4_24•
Issues of the Automatic Generation of HPF Loop Programs

[...]

Peter Faber1, Martin Griebl1, Christian Lengauer1•
University of Passau1
10 Aug 2000
TL;DR: This work reports on problems met during code generation for HPF, and existing methods that can be used to reduce some of these problems.
Abstract: Writing correct and efficient programs for parallel computers remains a challenging task, even after some decades of research in this area. One way to generate parallel programs is to write sequential programs and let the compiler handle the details of extracting parallelism. LooPo is an automatic parallelizer that extracts parallelism from sequential loop nests by transformations in the polyhedron model. The generation of code from these transformed programs is an important step. We report on problems met during code generation for HPF, and existing methods that can be used to reduce some of these problems.
Book Chapter•10.1007/3-540-45574-4_9•
On Materializations of Array-Valued Temporaries

[...]

Daniel J. Rosenkrantz1, Lenore R. Mullin1, Harry B. Hunt1•
University at Albany, SUNY1
10 Aug 2000
TL;DR: Results are presented demonstrating the usefulness of monolithic program analysis and optimization prior to scalarization and models are developed for studying nonmaterialization in basic blocks consisting of a sequence of assignment statements involving array-valued variables.
Abstract: We present results demonstrating the usefulness of monolithic program analysis and optimization prior to scalarization. In particular, models are developed for studying nonmaterialization in basic blocks consisting ofa sequence of assignment statements involving array-valued variables. We use these models to analyze the problem ofmi nimizing the number ofmat erializations in a basic block, and to develop an efficient algorithm for minimizing the number of materializations in certain cases.
Book Chapter•10.1007/3-540-45574-4_20•
Safe Approximation of Data Dependencies in Pointer-Based Structures

[...]

D. K. Arvind1, T. A. Lewis1•
University of Edinburgh1
10 Aug 2000
TL;DR: A new approach to the analysis of dependencies in complex, pointer-based data structures using two-variable finite state automata (2FSA) to produce approximate, yet safe information.
Abstract: This paper describes a new approach to the analysis of dependencies in complex, pointer-based data structures. Structural information is provided by the programmer in the form of two-variable finite state automata (2FSA). Our method extracts data dependencies. For restricted forms of recursion, the data dependencies can be exact; however in general, we produce approximate, yet safe (i.e. overestimates dependencies) information. The analysis method has been automated and results are presented in this paper.
Book Chapter•10.1007/3-540-45574-4_28•
A Bytecode Optimizer to Engineer Bytecodes for Performance

[...]

Jian-Zhi Wu1, Jenq Kuen Lee1•
National Tsing Hua University1
10 Aug 2000
TL;DR: This work focuses on the aspect of the bytecode to bytecode optimizing system on the ability to optimize the performances of hardware stack machines and proposes a mechanism to report an allocation scheme for a given size of stack allocation according to the cost model.
Abstract: We are interested in the issues on the bytecode transformation for performance improvements on programs.I n this work, we focus on the aspect of our bytecode to bytecode optimizing system on the ability to optimize the performances of hardware stack machines.Tw o categories of the problem are considered.F irst, we consider the stack allocations for intra-procedural cases with a family of Java processors. We propose a mechanism to report an allocation scheme for a given size of stack allocation according to our cost model.Se cond, we also extend our framework for stack allocations to deal with inter-procedural cases. Our initial experimental test-bed is based on an ITRI-made Java processor and Kaffe VM simulator[2].E arly experiments indicate our proposed methods are promising in speedup Java programs on Java processors with a fixed size of stack caches.
Book Chapter•10.1007/3-540-45574-4_17•
Exploiting Ownership Sets in HPF

[...]

Pramod G. Joisha, Prithviraj Banerjee
10 Aug 2000
TL;DR: This paper arrives at a refined system that enables us to efficiently solve for the ownership set using the Fourier-Motzkin Elimination technique, and which requires the course vector as the only auxiliary vector.
Abstract: Ownership sets are fundamental to the partitioning of program computations across processors by the owner-computes rule. These sets arise due to the mapping of data arrays onto processors. In this paper, we focus on how ownership sets can be efficiently determined in the context of the HPF language, and show how the structure of these sets can be symbolically characterized in the presence of arbitrary data alignment and data distribution directives. Our starting point is a system of equalities and inequalities due to Ancourt et al. that captures the array mapping problem in HPF. We arrive at a refined system that enables us to efficiently solve for the ownership set using the Fourier-Motzkin Elimination technique, and which requires the course vector as the only auxiliary vector. We develop important and general properties pertaining to HPF alignments and distributions, and show how they can be used to eliminate redundant communication due to array replication. We also show how the generation of communication code can be avoided when pairs of array references are ultimately mapped onto the same processors. Experimental data demonstrating the improved code performance that the latter optimization enables is presented and discussed.
Book Chapter•10.1007/3-540-45574-4_25•
Run-Time Fusion of MPI Calls in a Parallel C++ Library

[...]

A. J. Field1, Thomas L. Hansen1, Paul H. J. Kelly•
Imperial College London1
10 Aug 2000
TL;DR: The results demonstrate the software engineering benefits that accrue from the CFL abstraction and show that performance close to that of manually optimised code can be achieved automatically in many cases.
Abstract: CFL (Communication Fusion Library) is a C++ library for MPI programmers It uses overloading to distinguish private variables from replicated, shared variables, and automatically introduces MPI communication to keep such replicated data consistent This paper concerns a simple but surprisingly effective technique which improves performance substantially: CFL operators are executed lazily in order to expose opportunities for run-time, context-dependent, optimisation such as message aggregation and operator fusion We evaluate the idea in the context of a large-scale simulation of oceanic plankton ecology The results demonstrate the software engineering benefits that accrue from the CFL abstraction and show that performance close to that of manually optimised code can be achieved automatically in many cases
Book Chapter•10.1007/3-540-45574-4_2•
Cost Hierarchies for Abstract Parallel Machines

[...]

John T. O'Donnell1, Thomas Rauber2, Gudula Rünger3•
University of Glasgow1, Martin Luther University of Halle-Wittenberg2, Chemnitz University of Technology3
10 Aug 2000
TL;DR: In this paper, the authors add explicit cost models as the third component of an Abstract Parallel Machine (APM) system, which can be obtained either by analyzing a parallel operation definition, or by measuring performance on a real machine.
Abstract: The Abstract Parallel Machine (APM) model separates the definitions of parallel operations from the application algorithm, which defines the sequence of parallel operations to be executed An APM contains a set of parallel operation definitions, which specify how the computation is organized into independent sites of computation and what data exchanges are required This paper adds explicit cost models as the third component of an APM system The costs of parallel operations can be obtained either by analyzing a parallel operation definition, or by measuring performance on a real machine Costs with monotonicity constraints allow the cost of an algorithm to be transformed automatically as the algorithm itself is transformed
Book Chapter•10.1007/3-540-45574-4_23•
Efficient Dynamic Local Enumeration for HPF

[...]

Will Denissen1, Henk Sips1•
Delft University of Technology1
10 Aug 2000
TL;DR: This paper presents an efficient dynamic local enumeration method, which always selects the optimal solution at run-time and has no need for code duplication, compared with the PGI and the Adaptor compiler.
Abstract: In translating HPF programs, a compiler has to generate local iteration and communication sets. Apart from local enumeration, local storage compression is an issue, because in HPF array alignment functions can introduce local storage inefficiencies. Storage compression, however, may not lead to serious performance penalties. A problem in semi-automatic translation is that a compiler should generate efficient code in all cases the user may expect efficient translation (no surprises). However, in current compilers this turns out to be not always true. A major cause for this inefficiencies is that compilers use the same fixed enumeration scheme in all cases. In this paper, we present an efficient dynamic local enumeration method, which always selects the optimal solution at run-time and has no need for code duplication. The method is compared with the PGI and the Adaptor compiler.
Book Chapter•10.1007/3-540-45574-4_26•
Set Operations for Orthogonal Processor Groups

[...]

Thomas Rauber, Robert Reilein, Gudula Rünger
10 Aug 2000
TL;DR: A generalization of the SPMDpro gramming model for distributed memory machines based on orthogonal processor groups, which is implemented in MPI to express group-SPMDc omputations on different partitions of the processors.
Abstract: We consider a generalization of the SPMDpro gramming model for distributed memory machines based on orthogonal processor groups In this model different partitions of the processors into disjoint processor groups exist and can be used simultaneously in a single parallel implementation Set operations on orthogonal groups are used to express group-SPMDc omputations on different partitions of the processors The set operations are implemented in MPI
Book Chapter•10.1007/3-540-45574-4_27•
Compiler Based Scheduling of Java Mobile Agents

[...]

Srivatsan Narasimhan1, Santosh Pande1•
University of Cincinnati1
10 Aug 2000
TL;DR: This work presents compiler-based scheduling strategies for Java Mobile Agents using annotations and data sizes to show how the compiler produces the best schedule, taking dependence information into account.
Abstract: This work presents compiler-based scheduling strategies for Java Mobile Agents. We analyze the program using annotations and data sizes. For the different strategies, the compiler produces the best schedule, taking dependence information into account.
Book Chapter•10.1007/3-540-45574-4_6•
SmartApps: An Application Centric Approach to High Performance Computing

[...]

Lawrence Rauchwerger1, Nancy M. Amato1, Josep Torrellas2•
Texas A&M University1, University of Illinois at Urbana–Champaign2
10 Aug 2000
TL;DR: The overall architecture of Smartapps is described and the achievements to date are presented: Run-time optimizations, performance modeling, and moderately reconfigurable hardware.
Abstract: State-of-the-art run-time systems are a poor match to diverse, dynamic distributed applications because they are designed to provide support to a wide variety of applications, without much customization to individual specific requirements. Little or no guiding information flows directly from the application to the run-time system to allow the latter to fully tailor its services to the application. As a result, the performance is disappointing. To address this problem, we propose application-centric computing, or SMART APPLICATIONS. In the executable of smart applications, the compiler embeds most run-time system services, and a performance-optimizing feedback loop that monitors the application's performance and adaptively reconfigures the application and the OS/hardware platform. At run-time, after incorporating the code's input and the system's resources and state, the SmartApp performs a global optimization. This optimization is instance specific and thus much more tractable than a global generic optimization between application, OS and hardware. The resulting code and resource customization should lead to major speedups. In this paper, we first describe the overall architecture of Smartapps and then present the achievements to date: Run-time optimizations, performance modeling, and moderately reconfigurable hardware.
Book Chapter•10.1007/3-540-45574-4_13•
Automatic Coarse Grain Task Parallel Processing on SMP Using OpenMP

[...]

Hironori Kasahara1, Motoki Obata1, Kazuhisa Ishizaka1•
Waseda University1
10 Aug 2000
TL;DR: The proposed scheme decomposes a Fortran program into coarse grain tasks, analyzes parallelism among tasks by "Earliest Executable Condition Analysis" considering control and data dependencies, statically schedules the coarse grainasks to threads or generates dynamic task scheduling codes to assign the tasks to threads and generates OpenMP Fortran source code for a SMP machine.
Abstract: This paper proposes a simple and efficient implementation method for a hierarchical coarse grain task parallel processing scheme on a SMP machine. OSCAR multigrain parallelizing compiler automatically generates parallelized code including OpenMP directives and its performance is evaluated on a commercial SMP machine. The coarse grain task parallel processing is important to improve the effective performance of wide range of multiprocessor systems from a single chip multiprocessor to a high performance computer beyond the limit of the loop parallelism. The proposed scheme decomposes a Fortran program into coarse grain tasks, analyzes parallelism among tasks by "Earliest Executable Condition Analysis" considering control and data dependencies, statically schedules the coarse grain tasks to threads or generates dynamic task scheduling codes to assign the tasks to threads and generates OpenMP Fortran source code for a SMP machine. The thread parallel code using OpenMP generated by OSCAR compiler forks threads only once at the beginning of the program and joins only once at the end even though the program is processed in parallel based on hierarchical coarse grain task parallel processing concept. The performance of the scheme is evaluated on 8-processor SMP machine, IBM RS6000 SP 604e High Node, using a newly developed OpenMP backend of OSCAR multigrain compiler. The evaluation shows that OSCAR compiler with IBM XL Fortran compiler version 5.1 gives us 1.5 to 3 times larger speedup than the native XL Fortran compiler for SPEC 95fp SWIM, TOMCATV, HYDRO2D, MGRID and Perfect Benchmarks ARC2D.
Book Chapter•10.1007/3-540-45574-4_3•
Recursion Unrolling for Divide and Conquer Programs

[...]

Radu Rugina1, Martin Rinard1•
Massachusetts Institute of Technology1
10 Aug 2000
TL;DR: Recursion unrolling inlines recursive calls to reduce control flow overhead and increase the size of the basic blocks in the computation, which in turn increases the effectiveness of standard compiler optimizations such as register allocation and instruction scheduling.
Abstract: This paper presents recursion unrolling, a technique for improving the performance of recursive computations. Conceptually, recursion unrolling inlines recursive calls to reduce control flow overhead and increase the size of the basic blocks in the computation, which in turn increases the effectiveness of standard compiler optimizations such as register allocation and instruction scheduling. We have identified two transformations that significantly improve the effectiveness of the basic recursion unrolling technique. Conditional fusion merges conditionals with identical expressions, considerably simplifying the control flow in unrolled procedures. Recursion re-rolling rolls back the recursive part of the procedure to ensure that a large unrolled base case is always executed, regardless of the input problem size. We have implemented our techniques and applied them to an important class of recursive programs, divide and conquer programs. Our experimental results show that recursion unrolling can improve the performance of our programs by a factor of between 3.6 to 10.8 depending on the combination of the program and the architecture.

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve