TL;DR: A compiler algorithm that automatically finds computation and data decompositions that optimize both parallelism and locality that is designed for use with both distributed and shared address space machines.
Abstract: Data locality is critical to achieving high performance on large-scale parallel machines. Non-local data accesses result in communication that can greatly impact performance. Thus the mapping, or decomposition, of the computation and data onto the processors of a scalable parallel machine is a key issue in compiling programs for these architectures. This paper describes a compiler algorithm that automatically finds computation and data decompositions that optimize both parallelism and locality. This algorithm is designed for use with both distributed and shared address space machines. The scope of our algorithm is dense matrix computations where the array accesses are affine functions of the loop indices. Our algorithm can handle programs with general nestings of parallel and sequential loops. We present a mathematical framework that enables us to systematically derive the decompositions. Our algorithm can exploit parallelism in both fully parallelizable loops as well as loops that require explicit synchronization. The algorithm will trade off extra degrees of parallelism to eliminate communication. If communication is needed, the algorithm will try to introduce the least expensive forms of communication into those parts of the program that are least frequently executed.
TL;DR: It is shown that the problems of communication code generation, local memory management, message aggregation and redundant data communication elimination can all be solved by projecting polyhedra represented by sets of inequalities onto lower dimensional spaces.
Abstract: This paper presents several algorithms to solve code generation and optimization problems specific to machines with distributed address spaces. Given a description of how the computation is to be partitioned across the processors in a machine, our algorithms produce an SPMD (single program multiple data) program to be run on each processor. Our compiler generated the necessary receive and send instructions, optimizes the communication by eliminating redundant communication and aggregating small messages into large messages, allocates space locally on each processor, and translates global data addresses to local addresses.Our techniques are based on an exact data-flow analysis on individual array element accesses. Unlike data dependence analysis, this analysis determines if two dynamic instances refer to the same value, and not just to the same location. Using this information, our compiler can handle more flexible data decompositions and find more opportunities for communication optimization than systems based on data dependence analysis.Our technique is based on a uniform framework, where data decompositions, computation decompositions and the data flow information are all represented as systems of linear inequalities. We show that the problems of communication code generation, local memory management, message aggregation and redundant data communication elimination can all be solved by projecting polyhedra represented by sets of inequalities onto lower dimensional spaces.
TL;DR: A new method for performing mutation analysis that uses program schemata to encode all mutants for a program into one metaprogram, which is subsequently compiled and run at speeds substantially higher than achieved by previous interpretive systems.
Abstract: Mutation analysis is a powerful technique for assessing and improving the quality of test data used to unit test software. Unfortunately, current automated mutation analysis systems suffer from severe performance problems. This paper presents a new method for performing mutation analysis that uses program schemata to encode all mutants for a program into one metaprogram, which is subsequently compiled and run at speeds substantially higher than achieved by previous interpretive systems. Preliminary performance improvements of over 300% are reported. This method has the additional advantages of being easier to implement than interpretive systems, being simpler to port across a wide range of hardware and software platforms, and using the same compiler and run-time support system that is used during development and/or deployment.
TL;DR: A human oriented object programming system as mentioned in this paper provides an interactive and dynamic modeling system to assist in the incremental building of computer programs which facilitates the development of complex computer programs such as operating systems and large applications with graphic user interfaces (GUIs).
Abstract: A human oriented object programming system provides an interactive and dynamic modeling system to assist in the incremental building of computer programs which facilitates the development of complex computer programs such as operating systems and large applications with graphic user interfaces (GUIs). A program is modeled as a collection of units called components. A component represents a single compilable language element such as a class or a function. The three major functionality are the database, the compiler and the build mechanism. The database stores the components and properties. The compiler, along with compiling the source code of a property, is responsible for calculating the dependencies associated with a component. The build mechanism uses properties of components along with the compiler generated dependencies to correctly and efficiently sequence the compilation of components during a build process.
TL;DR: A unified approach to exploiting both kinds of parallelism in a single framework with an existing language is taken and implemented a parallelizing Fortran compiler for the iWarp system based on this approach.
Abstract: For many applications, achieving good performance on a private memory parallel computer requires exploiting data parallelism as well as task parallelism. Depending on the size of the input data set and the number of nodes (i.e., processors), different tradeoffs between task and data parallelism are appropriate for a parallel system. Most existing compilers focus on only one of data parallelism and task parallelism. Therefore, to achieve the desired results, the programmer must separately program the data and task parallelism. We have taken a unified approach to exploiting both kinds of parallelism in a single framework with an existing language. This approach eases the task of programming and exposes the tradeoffs between data and task parallelism to compiler. We have implemented a parallelizing Fortran compiler for the iWarp system based on this approach. We discuss the design of our compiler, and present performance results to validate our approach.
TL;DR: The Cydra 5 is a VLIW minisupercomputer with hardware designed to accelerate a broad class of inner loops, presenting unique challenges to its compilers.
Abstract: The Cydra 5 is a VLIW minisupercomputer with hardware designed to accelerate a broad class of inner loops, presenting unique challenges to its compilers. We discuss the organization of its Fortran/77 compiler and several of the key approaches developed to fully exploit the hardware. These include the intermediate representation used; the preparation, overlapped scheduling, and register allocation of inner loops; the speculative execution model used to control global code motion; and the machine model and local instruction scheduling approach.
TL;DR: This research solves the problems of having the macro language be a minimal extension of the programming language, by introducing explicit code template operators into the macros language, and by using a type system to guarantee, at macro definition time, that all macros and macro functions only produce syntactically valid program fragments.
Abstract: Lisp has shown that a programmable syntax macro system acts as an adjunct to the compiler that gives the programmer important and powerful abstraction facilities not provided by the language. Unlike simple token substitution macros, such as are provided by CPP (the C preprocessor), syntax macros operate on Abstract Syntax Trees (ASTs). Programmable syntax macro systems have not yet been developed for syntactically rich languages such as C because rich concrete syntax requires the manual construction of syntactically valid program fragments, which is a tedious, difficult, and error prone process. Also, using two languages, one for writing the program, and one for writing macros, is another source of complexity. This research solves these problems by having the macro language be a minimal extension of the programming language, by introducing explicit code template operators into the macro language, and by using a type system to guarantee, at macro definition time, that all macros and macro functions only produce syntactically valid program fragments. The code template operators make the language context sensitive, which requires changes to the parser. The parser must perform type analysis in order to parse macro definitions, or to parse user code that invokes macros.
TL;DR: The Threaded Abstract Machine (TAM) as discussed by the authors is a self-scheduled machine language of parallel threads, which provides a path from data-flow-graph program representations to conventional control flow.
TL;DR: In this article, a development system of the present invention includes a compiler, a linker, and an interface for compiling source listings into object modules (which are initially stored in.OBJ files) and a librarian is provided for combining desired ones of the OOJ files into one or more library files.
Abstract: A development system of the present invention includes a compiler, a linker, and an interface. The compiler serves to compile source listings into object modules (which are initially stored in .OBJ files). A librarian is provided for combining desired ones of the .OBJ files into one or more library files. For each library file, the librarian provides an Extended Dictionary of the present invention, which includes a Dependency List and an Unresolved Externals List for each module of the library. Methods are described for linking object modules from .OBJ files and library files, where library object modules which are not needed for the link may be determined before the libraries are scanned during the first pass of the linker. In this manner, library object modules which are not needed during subsequent linking operations can be skipped.
TL;DR: Two mechanisms are described ways in which an HPF compiler can deal with irregular computations effectively, one of which invokes a user specified mapping procedure via a set of compiler directives and the other is a simple conservative method.
Abstract: The authors describe ways in which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). The authors present performance results for these mechanisms from a Fortran 90D compiler implementation.
TL;DR: A human oriented object programming system provides an interactive and dynamic process for the incremental building of computer programs which facilitates the development of complex computer programs such as operating systems and large applications with graphic user interfaces (GUIs) as discussed by the authors.
Abstract: A human oriented object programming system provides an interactive and dynamic process for the incremental building of computer programs which facilitates the development of complex computer programs such as operating systems and large applications with graphic user interfaces (GUIs). The program is modeled as a collection of units called components. A component represents a single compilable language element such as a class or a function. The major functionalities are the database, the compiler, build and link mechanism. The database stores the components and properties. The compiler, along with compiling the source code of a property, and generating object code is responsible for calculating the dependencies associated with a component. The build mechanism uses properties of components along with the compiler generated dependencies to correctly and efficiently sequence the compilation of components during a build process. The link mechanism links all object code as the component stores it in the component database. Only updated components require linking operations.
TL;DR: In this article, a three-phase algorithm for deciding when to clone a procedure is presented, which aims to avoid unnecessary code growth by considering how the information exposed by cloning will be used during optimization.
TL;DR: An overview of a parallelizing compiler to automatically generate efficient code for large-scale parallel architectures from sequential input programs based on loop-level parallelism in dense matrix computations is presented.
Abstract: This paper presents an overview of a parallelizing compiler to automatically generate efficient code for large-scale parallel architectures from sequential input programs. This research focuses on loop-level parallelism in dense matrix computations. We illustrate the basic techniques the compiler uses by describing the entire compilation process for a simple example.
TL;DR: In this article, a hybrid compiler-interpreter comprising a compiler for "compiling" source program code, and an interpreter for interpreting the "compiled" code, is provided to a computer system.
Abstract: A hybrid compiler-interpreter comprising a compiler for "compiling" source program code, and an interpreter for interpreting the "compiled" code, is provided to a computer system. The compiler comprises a code generator that generates code in intermediate form with data references made on a symbolic basis. The interpreter comprises a main interpretation routine, and two data reference handling routines, a dynamic field reference routine for handling symbolic references, and a static field reference routine for handling numeric references. The dynamic field reference routine, when invoked, resolves a symbolic reference and rewrites the symbolic reference into a numeric reference. After rewriting, the dynamic field reference routine returns to the main interpretation routine without advancing program execution to the next instruction, thereby allowing the rewritten instruction with numeric reference to be reexecuted. The static field reference routine, when invoked, obtain data for the program from a data object based on the numeric reference. After obtaining data, the static field reference routine advances program execution to the next instruction before returning to the interpretation routine. The main interpretation routine selectively invokes the two data reference handling routines depending on whether the data reference in an instruction is a symbolic or a numeric reference.
TL;DR: This chapter discusses Macros, the Extensible Language, and its applications in Functional Programming and Object-Oriented Lisp, which is a very good introduction to this kind of programming.
Abstract: 1. The Extensible Language. 2. Functions. 3. Functional Programming. 4. Utility Functions. 5. Returning Functions. 6. Functions as Representation. 7. Macros. 8. When to Use Macros. 9. Variable Capture. 10. Other Macro Pitfalls. 11. Classic Macros. 12. Generalized Variables. 13. Computation at Compile-Time. 14. Anaphoric Macros. 15. Macros Returning Functions. 16. Macro-Defining Macros. 17. Read Macros. 18. Destructuring. 19. A Query Compiler. 20. Continuations. 21. Multiple Processes. 22. Nondeterminism. 23. Parsing with ATNs. 24. Prolog. 25. Object-Oriented Lisp. Appendix: Packages. Notes. Index.
TL;DR: Mint generates memory reference traces that can be used to drive simulations of multiprocessor systems and is a fast interpreter that slows down a simulated program by a factor of 20 to 70 compared to its native execution time.
Abstract: This document describes Mint, a MIPS code interpreter for parallel programs Mint generates memory reference traces that can be used to drive simulations of multiprocessor systems Mint executes in a single address space and interprets MIPS R3000 object code programs For faster interpretation, blocks of straight-line code in the object program are executed natively by creating functions at run-time Unlike other memory tracers that compile the memory tracing calls into the simulated program, Mint does not require recompiling the simulated program Interpreting the object program has the advantage that no source is needed, the simulator is independent of the object program, and a total program trace, including library references, is easily generated Mint is a fast interpreter When generating events for every memory reference, the overhead of Mint typically slows down a simulated program by a factor of 20 to 70 compared to its native execution time
TL;DR: A set of isomorphic control transformations that allow the compiler to apply local scheduling techniques to acyclic subgraphs of the control flow graph are presented, and the code motion complexities of global scheduling are eliminated.
Abstract: In this paper we present a set of isomorphic control transformations that allow the compiler to apply local scheduling techniques to acyclic subgraphs of the control flow graph. Thus, the code motion complexities of global scheduling are eliminated. This approach relies on a new technique, Reverse If-Conversion (RIC), that transforms scheduled If-Converted code back to the control flow graph representation. This paper presents the predicate internal representation, the algorithms for RIC, and the correctness of RIC. In addition, the scheduling issues are addressed and an application to software pipelining is presented.
TL;DR: The authors describe the implementation of the runtime system, which provides the concurrency and communication primitives between objects in a distributed collection and includes a description and performance results on four benchmark programs.
Abstract: pC++ is a language extension to C++ designed to allow programmers to compose concurrent aggregate collection classes which can be aligned and distributed over the memory hierarchy of a parallel machine in a manner modeled on the High Performance Fortran Forum (HPFF) directives for Fortran 90. pC++ allows the user to write portable and efficient code which will run on a wide range of scalable parallel computer systems. The first version of the compiler is a preprocessor which generates Single Program Multiple Data (SPMD) C++ code. Currently, it runs on the Thinking Machine CM-5, the Intel Paragon, the BBN TC2000, the Kendall Square Research KSR-1, and the Sequent Symmetry. The authors describe the implementation of the runtime system, which provides the concurrency and communication primitives between objects in a distributed collection. To illustrate the behavior of the runtime system, they include a description and performance results on four benchmark programs.
TL;DR: It is believed that the methodology to process data distribution, computation partitioning, communication system design and the overall compiler design can be used by the implementors of HPF compilers.
Abstract: Fortran 90D/HPF is a data parallel language with special directives to enable users to specify data alignment and distributions. The authors describe the design and implementation of a Fortran 90D/HPF compiler. Techniques for data and computation partitioning, communication detection and generation, and the run-time support for the compiler are discussed. Initial performance results for the compiler are presented. It is believed that the methodology to process data distribution, computation partitioning, communication system design and the overall compiler design can be used by the implementors of HPF compilers.
TL;DR: The design of Scalable Software Libraries for Distributed Memory Concurrent Computers and Methodologies for Parallel Program Development are presented.
Abstract: Part 1: Communications and Computations Libraries for Multicomputers. The Design of Scalable Software Libraries for Distributed Memory Concurrent Computers (J. Choi, J.J. Dongarra, D.W. Walker). Level 3 BLAS for Distributed Memory Concurrent Computers (J. Choi, J.J. Dongarra, D.W. Walker). Two Dimensional Basic Linear Algebra Communication Subprograms (J.J. Dongarra, R. van de Geiin, C. Whaley). Implementation of Linear Algebra and Communications Libraries on the T.node Reconfigurable Machine (C. Bonello, F. Desprez). Part 2: Post-Mortem Visualization of Parallel Programs. Process and Processor Interaction: Architecture Independent Visualisation Schema (E. Zabala, R. Taylor). Program Visualization by Integration of Advanced Compiler Technology with Configurable Views (D. Kimelman, G. Sang'udi). Distributed Monitoring for Scalable Massively Parallel Machines (S. Poinson, B. Tourancheau, X. Vigouroux). Standardization of Event Traces Considered Harmful or Is An Implementation of Object-Independent Event Trace Monitoring and Analysis Systems Possible? (B. Mohr). Programming Tools for Massively Parallel Supercomputers (T. Bemmerl). Part 3: Tools for Parallel Program Development. PVM and HENCE: Tools for Heterogeneous Network Computing (A. Beguelin et al.). ParaRex: A Programming Environment Integrating Execution Replay and Visualization (E. Leu, A. Schiper). Visage - Visualization of Attribute Graphs: A Foundation for a Parallel Programming Environment (A. Rudich, D. Zemik, G. Zodik). A Programming Environment Dedicated to a Model of Explicit Parallelism (M. Alabau et al.). ALPES: A Tool for the Performance Evaluation of Parallel Programs (J.P. Kitajima, C. Tron, B. Plateau). Part 4: Methodologies for Parallel Program Development. Libraries and Tools for Object Parallel Programming (D. Gannon). Stream-Based Interface in C++ for Programming Heterogeneous Systems (R. Pozo). Building Blocks for Iterative Solution of Linear Systems (R. Barrett et al.). Allocating Communication Channels to Parallel Tasks (D. Barthou, F. Gasperoni, U. Schwiegelshohn). Part 5: Automatic Parallelization. Compiling Sequential Programs for Distributed Memory Parallel Computers with Pandore II (F. Andre, O. Cheron, J.-L. Pazat). Loop Nest Scheduling and Transformations (A. Darte, T. Risset, Y. Robert). Interprocedural Analyses for Programming Environments (F. Irigoin). Array Redistributions on Boolean Cubes (M. Loi).
TL;DR: This work has developed techniques to more precisely determine which compilations have actually been invalidated by a change to the program's source.
Abstract: While efficient new algorithms for interprocedural data-flow analysis have made these techniques practical for use in production compilation systems, a new problem has arisen: collecting and using interprocedural information in a compiler introduces subtle dependences among the procedures of a program. If the compiler depends on interprocedural information to optimize a given module, a subsequent editing change to another module in the program may change the interprocedural information and necessitate recompilation. To avoid having to recompile every module in a program in response to a single editing change to one module, we have developed techniques to more precisely determine which compilations have actually been invalidated by a change to the program's source
TL;DR: The ParaScope Editor is a new kind of program construction tool; one that not only manages text, but also presents the user with insights into the semantic structure of the program being constructed.
Abstract: The ParaScope Editor is an interactive parallel programming tool that assists knowledgeable users in developing scientific Fortran programs. It displays the results of sophisticated program analyses, provides a set of powerful interactive transformations, and supports program editing. This paper summarizes experiences of scientific programmers and tool designers using the ParaScope Editor. We evaluate existing features and describe enhancements in three key areas: user interface, analysis, and transformation. many existing features prove crucial to successful program parallelization. They include interprocedural array side-effect analysis and program and dependence view filtering. Desirable functionality includes improved program navigation based on performance estimation, incorporating user assertions in analysis and more guidance in selecting transformations. These results offer insights for the authors of a variety of programming tools and parallelizing compilers.
TL;DR: A human oriented object programming system provides an interactive and dynamic process for the incremental building of computer programs which facilitates the development of complex computer programs such as operating systems and large applications with graphic user interfaces (GUIs) as mentioned in this paper.
Abstract: A human oriented object programming system provides an interactive and dynamic process for the incremental building of computer programs which facilitates the development of complex computer programs such as operating systems and large applications with graphic user interfaces (GUIs). The program is modeled as a collection of units called components. A component represents a single compilable language element such as a class or a function. The three major functionality are the database, the compiler and the build mechanism. The database stores the components and properties. The compiler, along with compiling the source code of a property, is responsible for calculating the dependencies associated with a component. The build mechanism uses properties of components along with the compiler generated dependencies to correctly and efficiently sequence the compilation of components during a build process.
TL;DR: Fiat is a framework that provides parameterized templates and common drivers to support interprocedural data-flow analysis and procedure cloning and is suitable for use in systems with distinct intermediate code representations and enables sharing of system software across research platforms.
Abstract: The fiat system is a compiler-building tool that enables rapid prototyping of interprocedural analysis and compilation systems. Fiat is a framework because it provides parameterized templates and common drivers to support interprocedural data-flow analysis and procedure cloning. Further, fiat provides the complex underlying support required to collect and manage information about the procedures in the program. Fiat's reliance on system-independent abstractions makes it suitable for use in systems with distinct intermediate code representations and enables sharing of system software across research platforms. Demand-driven analysis maintains a clean separation between interprocedural analysis problems, enabling tools built upon fiat to solve only the data-flow problems of immediate interest. Fiat drives interprocedural optimization in the ParaScope programming tools at Rice University and the SUIF compiler at Stanford University. Fiat has proven to be a valuable aid in development of a large number of interprocedural tools, including a data race detection system, a static performance estimation tool, a distributed-memory compiler for Fortran D, an interactive parallelizing tool and an automatic parallelizer in the SUIF compiler.
TL;DR: GPMB (Global Pipelining with Multiple Branches) is presented which is based on architectures supporting multi-way branching and multiple control flows and performs as well as modulo scheduling, and for branch-intensive loops, GPMB performs much better than software pipelining assuming the constraint of one two-way branch per cycle.
Abstract: Compile-time code transformations which expose instruction-level parallelism (ILP) typically take into account the constraints imposed by all execution scenarios in the program. However, there are additional opportunities to increase ILP along some execution sequences if the constraints from alternative execution sequences can be ignored. Traditionally, profile information has been used to identify important execution sequences for aggressive compiler optimization and scheduling. The paper presents a set of static program analysis heuristics used in the IMPACT compiler to identify execution sequences for aggressive optimization. The authors show that the static program analysis heuristics identify execution sequences without hazardous conditions that tend to prohibit compiler optimizations. As a result, the static program analysis approach often achieves optimization results comparable to profile information in spite of its inferior branch prediction accuracies. This observation makes a strong case for using static program analysis with or without profile information to facilitate aggressive compiler optimization and scheduling. >
TL;DR: This article develops a systematic loop transformation strategy called access normalization that restructures loop nests to exploit locality and block transfers and demonstrates the power of the techniques using routines from the BLAS (Basic Linear Algebra Subprograms) library.
Abstract: In scalable parallel machines, processors can make local memory accesses much faster than they can make remote memory accesses. Additionally, when a number of remote accesses must be made, it is usually more efficient to use block transfers of data rather than to use many small messages. To run well on such machines, software must exploit these features. We believe it is too onerous for a programmer to do this by hand, so we have been exploring the use of restructuring compiler technology for this purpose. In this article, we start with a language like HPF-Fortran with user-specified data distribution and develop a systematic loop transformation strategy called access normalization that restructures loop nests to exploit locality and block transfers. We demonstrate the power of our techniques using routines from the BLAS (Basic Linear Algebra Subprograms) library. An important feature of our approach is that we model loop transformation using invertible matrices and integer lattice theory.
TL;DR: Better analysis, run-time support, and flexibility are required for the prototype Fortran D compiler to be useful for a wider range of programs.
Abstract: Fortran D is a version of Fortran enhanced with data decomposition specifications. Case studies illustrate strengths and weaknesses of the prototype Fortran D compiler when compiling linear algebra codes and whole programs. Statement groups, execution conditions, inter-loop communication optimizations, multi-reductions, and array kills for replicated arrays are identified as new compilation issues. On the Intel iPSC/860, the output of the prototype Fortran D compiler approaches the performance of hand-optimized code for parallel computations, but needs improvement for linear algebra and pipelined codes. The Fortran D compiler outperforms and the CM Fortran compiler (2.1 beta) by a factor of four or more on the TMC CM-5 when not using vector units. Better analysis, run-time support, and flexibility are required for the prototype compiler to be useful for a wider range of programs.
TL;DR: The evolution of C++ is traced from C with Classes to the current ANSI and ISO standards work and the explosion of use, interest, commercial activity, compilers, tools, environments, and libraries.
Abstract: This paper outlines the history of the C++ programming language. The emphasis is on the ideas, constraints, and people that shaped the language, rather than the minuitiae of language features. Key design decisions relating to language features are discussed, but the focus is on the overall design goals and practical constraints. The evolution of C++ is traced from C with Classes to the current ANSI and ISO standards work and the explosion of use, interest, commercial activity, compilers, tools, environments, and libraries.