TL;DR: Novel strategies that are capable of significantly reducing the required memory sizes for a large class of data-intensive multimedia applications are presented, and are equally well suited for parallel and mono-processing applications.
Abstract: In this paper, we present novel strategies that are capable of significantly reducing the required memory sizes for a large class of data-intensive multimedia applications. The size reduction is obtained by reusing memory locations for arrays as much as possible through optimization of the storage order. These strategies are equally well suited for parallel and mono-processing applications, and are particularly useful in an embedded application context, where memory size is usually one of the main cost factors. Their feasibility and effectiveness is demonstrated by experimental results for some real-life multimedia applications, for which a considerable size reduction has been obtained.
TL;DR: The upper bound derived on expected I/O performance of SRM indicates that SRM is provably better than disk-striped mergesort (DSM) for realistic parameter values D, M and B.
Abstract: We consider the problem of sorting a file of N records on the D-disk model of parallel I/O in which there are two sources of parallelism. Records are transferred to and from disk concurrently in blocks of B contiguous records. In each I/O operation, up to one block can be transferred to or from each of the D-disks in parallel. We propose a simple, efficient, randomized mergesort algorithm called SRM that uses a forecast-and-flush approach to overcome the inherent difficulties of simple merging on parallel disks. SRM exhibits a limited use of randomization and also has a useful deterministic version. Generalizing the technique of forecasting, our algorithm is able to read in, at any time, the ‘right’ block from any disk and using the technique of flushing, our algorithm evicts, without any I/O overhead, just the ‘right’ blocks from memory to make space for new ones to be read in. The disk layout of SRM is such that it enjoys perfect write parallelism, avoiding fundamental inefficiencies of previous mergesort algorithms. By analysis of generalized maximum occupancy problems we are able to derive an analytical upper bound on SRM's expected overhead valid for arbitrary inputs. The upper bound derived on expected I/O performance of SRM indicates that SRM is provably better than disk-striped mergesort (DSM) for realistic parameter values D, M and B. Average-case simulations show further improvement on the analytical upper bound. Unlike previously proposed optimal sorting algorithms, SRM outper-forms DSM even when the number D of parallel disks is small.
TL;DR: A class of cellular automata for reaction-diffusion systems is presented, based on a local average for the diffusive dynamics, and closely related to finite difference schemes.
Abstract: A class of cellular automata for reaction-diffusion systems is presented. It is based on a local average for the diffusive dynamics, and closely related to finite difference schemes. The reactive dynamics is implemented as a lookup-table with probabilistic rules. The rules are derived directly and systematically from the given differential equations, using probabilistic rounding to enforce the discretization of the concentration variables. For quantitatively correct modeling, such probabilistic rules are usually necessary, but in some cases a deterministic version proves sufficient.
TL;DR: A distributed debugging engine (DDBG) assists the user in debugging GRAPNEL programs on distributed memory computer architectures and Tape/PVM and PROVE support the performance monitoring and visualization of parallel programs developed in the GRADE environment.
Abstract: To provide high-level graphical support for PVM (Parallel Virtual Machine) based program development, a complex programming environment (GRADE) is being developed. GRADE currently provides tools to construct, execute, debug, monitor and visualize message-passing parallel programs. It offers a high-level graphical programming abstraction mechanism to construct parallel applications by introducing a new graphical language called GRAPNEL. GRADE also provides the programmer with the same graphical user interface during the program design and debugging stages. A distributed debugging engine (DDBG) assists the user in debugging GRAPNEL programs on distributed memory computer architectures. Tape/PVM and PROVE support the performance monitoring and visualization of parallel programs developed in the GRADE environment.
TL;DR: A scalable parallel implementation of the classical Benders decomposition algorithm for two-stage stochastic linear programs using a primal-dual, path-following algorithm for solving the scenario subproblems is developed that alleviates the difficulties of load balancing.
Abstract: We develop a scalable parallel implementation of the classical Benders decomposition algorithm for two-stage stochastic linear programs. Using a primal-dual, path-following algorithm for solving the scenario subproblems we develop a parallel implementation that alleviates the difficulties of load balancing. Furthermore, the dual and primal step calculations can be implemented using a data-parallel programming paradigm. With this approach the code effectively utilizes both the multiple, independent processors and the vector units of the target architecture, the Connection Machine CM-5. The, usually limiting, master program is solved very efficiently using the interior point code LoQo on the front-end workstation. The implementation scales almost perfectly with problem and machine size. Extensive computational testing is reported with several large problems with up to 2 million constraints and 13.8 million variables.
TL;DR: A general methodology for the design of adaptive routing algorithms for networks with irregular topology, based on the application of the theory of deadlock avoidance proposed in [14], which increases routing flexibility by allowing cyclic dependencies between channels.
Abstract: Networks of workstations are rapidly emerging as a cost-effective alternative to parallel computers. Switch-based interconnects with irregular topologies allow the wiring flexibility, scalability and incremental expansion capability required in this environment. The irregularity also makes routing and deadlock avoidance on such systems quite complicated. Current proposals avoid deadlock by removing cyclic dependencies between channels. As a consequence, many messages are routed following non-minimal paths, increasing latency and wasting resources. In this paper, we propose a general methodology for the design of adaptive routing algorithms for networks with irregular topology. These routing algorithms allow messages to follow minimal paths in most cases, reducing message latency and increasing network throughput. The methodology is based on the application of the theory of deadlock avoidance proposed in [14], which increases routing flexibility by allowing cyclic dependencies between channels. As an example of application, we propose an adaptive routing algorithm for Autonet. It can be implemented either by duplicating physical channels or by splitting each physical channel into two virtual channels. In the former case, the implementation does not require a new switch design. It only requires changing the routing tables and adding links in parallel with existing ones, taking advantage of spare switch ports. In the latter case, a new switch design is required but the network topology is not changed. Preliminary evaluation results show that the new routing algorithm is able to increase throughput for random traffic by a factor of up to 2.8 with respect to the original algorithm, also reducing latency.
TL;DR: The proposed algorithm guarantees an optimal response time while ensuring data consistency at system quiescence while comparing the algorithm to two well known optimistic concurrency control protocols for groupware applications: dOPT and ORESTE.
Abstract: Replicated data consistency is a key issue in the design of distributed real time groupware applications. In this paper, we propose a new protocol to cope with this problem. The proposed algorithm guarantees an optimal response time while ensuring data consistency at system quiescence. The originality of our proposition relies on the fact that neither locks nor clocks nor global information are required to establish data consistency. Instead, direct dependency relations between generated operations as well as operation transformation mechanism are used. The coupling of the above two mentioned mechanisms is shown to realize a good trade-off between the different requirements of groupware applications. Advantages of our approach are illustrated by comparing the algorithm to two well known optimistic concurrency control protocols for groupware applications: dOPT and ORESTE.
TL;DR: A hybrid scheduling approach is presented that combines demand driven and data parallel techniques, which leads to a scalable and efficient parallel implementation of the ray tracing algorithm with little restriction on the size of the model data base to be rendered.
Abstract: Ray tracing is a powerful technique to generate realistic images of 3D scenes. However, rendering complex scenes may easily exceed the processing and memory capabilities of a single workstation. Distributed processing offers a solution if the algorithm can be parallelized in an efficient way. In this paper a hybrid scheduling approach is presented that combines demand driven and data parallel techniques. Which tasks to process demand driven and which data parallel, is decided by the data intensity of the task and the amount of data locality (coherence) that will be present in the task. By combining demand driven and data driven tasks, a better load balance may be achieved, while at the same time the communication is spread evenly across the network. This leads to a scalable and efficient parallel implementation of the ray tracing algorithm with little restriction on the size of the model data base to be rendered.
TL;DR: This article employs stabilized finite element formulations to solve the Navier-Stokes equations on a thinking machine CM-5 supercomputer, and the coupled, nonlinear equations generated every time step are solved iteratively, with an element-vector based evaluation technique.
Abstract: In this article we present parallel 3D finite element computation of unsteady incompressible flows around circular cylinders. We employ stabilized finite element formulations to solve the Navier-Stokes equations on a thinking machine CM-5 supercomputer. The time integration is based on an implicit method, and the coupled, nonlinear equations generated every time step are solved iteratively, with an element-vector based evaluation technique. This strategy enables us to carry out these computations with millions of coupled, nonlinear equations, and thus resolve the flow features in great detail. At Reynolds number 300 and 800, our results indicate strong 3D features arising from the instability of the columnar vortices forming the Karman street. At Re = 10 000 we employ a large eddy simulation (LES) turbulence model.
TL;DR: RAMA uses hashing to pseudo-randomly distribute data to all of its disks, insuring high bandwidth regardless of access pattern and eliminating bottlenecks in file block accesses.
Abstract: Modem massively parallel file systems provide high bandwidth file access by striping files across arrays of disks attached to a few specialized I/O nodes. However, these file systems are hard to use and difficult to integrate with workstations and tertiary storage. RAMA addresses these problems by providing a high-performance massively parallel file system with a simple interface. RAMA uses hashing to pseudo-randomly distribute data to all of its disks, insuring high bandwidth regardless of access pattern and eliminating bottlenecks in file block accesses. This flexibility does not cause a large loss of performance — RAMA's simulated performance is within 10–15% of the optimum performance of a similarly-sized striped file system, and is a factor of 4 or more better than a striped file system with poorly laid out data.
TL;DR: Several classes of optoelectronic architectures are presented that can support a high-bandwidth data firehose and mention applications in switching, FFT, sorting, matrix?vector processing, database search, and processor-to-memory interconnect.
Abstract: Free-space optical interconnects will soon be able to provide input/output bandwidths to a VLSI chip in excess of a terabit per second. The successful application of this technology to parallel distributed processing systems depends on the development of high-bandwidth interconnects and the ability of the architecture to sustain a stream of data at these bandwidths to the processing elements. We review examples of computational tasks that require scalable input/output, that is, computations where the I/O bandwidth of a processing element must grow in proportion to its computational bandwidth. We present several classes of optoelectronic architectures that can support a high-bandwidth data firehose and mention applications in switching, FFT, sorting, matrix?vector processing, database search, and processor-to-memory interconnect.
TL;DR: The sequence of steps by which MPICH, a high-performance, portable implementation of the Message-Passing Interface (MPI) standard, was ported to the NEC SX-4 raised issues that are important for shared-memory programming in general and shed light on both MPICH and the SX- 4.
Abstract: In this article we recount the sequence of steps by which MPICH, a high-performance, portable implementation of the Message-Passing Interface (MPI) standard, was ported to the NEC SX-4, a high-performance parallel supercomputer Each step in the sequence raised issues that are important for shared-memory programming in general and shed light on both MPICH and the SX-4 The result is a low-latency, very high bandwidth implementation of MPI for the NEC SX-4 In the process, MPICH was also improved in several general ways
TL;DR: Considering is given to the techniques required to support adaptive analysis of automatically generated unstructured meshes on distributed memory MIMD parallel computers, and procedures to automatically generate and adaptively refine meshes in parallel are given.
Abstract: Consideration is given to the techniques required to support adaptive analysis of automatically generated unstructured meshes on distributed memory MIMD parallel computers. Emphasis is placed on the structures needed to support effective parallel computations when the numerical discretization, the mesh, is defined and evolves during the computation. The key base structures are a distributed mesh based on a topological hierarchy, and a parallel distributed octree. Parallel control of the mesh and octree structures is done through a set of partition communication operations and entity migration routines. Load balance is maintained through iterative load balance, or distributed repartitioning. Building on these structures, procedures to automatically generate and adaptively refine meshes in parallel, starting from CAD geometric models, are given. Finally, the combination of these techniques to produce a parallel automated analysis procedure is demonstrated.
TL;DR: A fault-tolerant wormhole routing algorithm on mesh-connected processors that allows the fault polygons to be overlapped and utilizes the position of fault region relative to the current channel to route a message around overlapped fault polygon.
Abstract: A fault-tolerant wormhole routing algorithm on mesh-connected processors is proposed. The proposed algorithm is based on the solid fault model and allows the fault polygons to be overlapped. The algorithm utilizes the position of fault region relative to the current channel to route a message around overlapped fault polygons. A node deactivating algorithm to convert a non-solid fault region into a solid fault region is also proposed. The proposed routing algorithm uses four virtual channels and is deadlock- and livelock-free.
TL;DR: The goal of the remote computation system (RCS) is to provide easy access to modern parallel algorithms on supercomputers for the inexperienced user with an easy-to-use mechanism for using computational resources remotely.
Abstract: Today many high performance computers are reachable over some network. However, the access and use of these computers is often complicated. This prevents many users to work on such machines. The Goal of the Remote Computation System (CS) is to provide easy access to modern parallel algorithms on supercomputers for the inexperienced user. RCS has an easy-to-use mechanism for using computational resources remotely. The computational resources available are used as efficiently as possible in order to minimizes the response time.
TL;DR: All the tools in the MAD environment follow an extensible and modular debugging strategy based on a graphical user interface that helps the user in monitoring and analyzing message passing programs.
Abstract: Debugging parallel programs can be tedious and difficult. Therefore the programmer needs support from tools, that provide features for error detection and performance analysis. The MAD environment is such a toolset. It helps the user in monitoring and analyzing message passing programs. Communication errors and performance bottlenecks are visualized based on an event graph. Source code connection provides a combination between visualized events and the original lines of code or a control and data flow representation. A main part of the environment is dedicated to race conditions. After evaluation of events, which might be reordered during successive program runs, localization of message races can be performed by means of trace driven simulation. All the tools in the MAD environment follow an extensible and modular debugging strategy based on a graphical user interface.
TL;DR: A parallel unstructured finite element reacting flow solver designed for message passing MIMD computers is described, indicating that very high computational rates and high scaled efficiencies can be achieved for large problems despite the use of sparse matrix data structures and the required unstructuring data communication.
Abstract: A parallel unstructured finite element (FE) reacting flow solver designed for message passing MIMD computers is described. This implementation employs automated partitioning algorithms for load balancing unstructured grids, a distributed sparse matrix representation of the global FE equations, and parallel Krylov subspace iterative solvers. In this paper, a number of issues related to the efficient implementation of parallel unstructured mesh applications are presented. These issues include the differences between structured and unstructured mesh parallel applications, major communication kernels for unstructured Krylov iterative solvers, automatic mesh partitioning algorithms, and the influence of mesh partitioning metrics and single-node CPU performance on parallel performance. Results are presented for example FE heat transfer, fluid flow and full reacting flow applications on a 1024 processor nCUBE 2 hypercube and a 1904 processor Intel Paragon. Results indicate that very high computational rates and high scaled efficiencies can be achieved for large problems despite the use of sparse matrix data structures and the required unstructured data communication.
TL;DR: The portable scalable implementation of the NRL Layered Ocean Model is described, based primarily on the tiled data parallel parallel programming paradigm, which is sufficiently general that the actual technique used on a given machine to obtain scalability can be selected at compile time from: data parallel, SPMD message passing, autotasking, or self-service.
Abstract: We describe the portable scalable implementation of the NRL Layered Ocean Model (NLOM). Scalability is based primarily on the tiled data parallel parallel programming paradigm. This is sufficiently general that the actual technique used on a given machine to obtain scalability can be selected at compile time from: (i) data parallel, (ii) SPMD message passing, (iii) autotasking, or (iv) SPMD message passing between multi-processor autotasked systems. The code is thus portable onto all machine types likely to be used by ocean modelers.
TL;DR: The main objective of this paper is to present a characterization for the existence of the WZ factorization and prove uniqueness of the factorization.
Abstract: Evans [D.J. Evans, Implicit matrix elimination schemes, Int. J. Computer Math. 48 (1993) 229–237.] introduced the parallel implicit elimination (PIE) method for solving large order linear equations. The quadrant interlocking factorization (QIF) method of Evans and Hatzopoulos [D.J. Evans, M. Hatzopoulos, A parallel linear system solver, Int. J. Computer Math. 7 (1979) 227–238.] can be viewed as a factorization version of PIE method. The main objective of this paper is to present a characterization for the existence of the WZ factorization. We also prove uniqueness of the factorization.
TL;DR: The data distribution and the communication scheme for the sparse matrix operations of the preconditioned CG are based on the analysis of the indices of the non-zero elements, and a fully local incomplete Cholesky preconditionser is presented.
Abstract: Conjugate gradient (CG) methods to solve sparse systems of linear equations play an important role in numerical methods for solving discretized partial differential equations. The large size and the condition of many technical or physical applications in this area result in the need for efficient parallelization and preconditioning techniques of the CG method, in particular on massively parallel machines. Here, the data distribution and the communication scheme for the sparse matrix operations of the preconditioned CG are based on the analysis of the indices of the non-zero elements. Polynomial preconditioning is shown to reduce global synchronizations considerably, and a fully local incomplete Cholesky preconditioner is presented. On a PARAGON XP/S 10 with 138 processors, the developed parallel methods outperform diagonally scaled CG markedly with respect to both scaling behavior and execution time for many matrices from real finite element applications.
TL;DR: A practical experience in adding a dynamic-load-balancing capability to an existing large parallel application — multiresolution molecular dynamics (MRMD) — which is based on uniform mesh decomposition by using adaptive curvilinear coordinates to represent partition boundaries.
Abstract: We present a practical experience in adding a dynamic-load-balancing capability to an existing large parallel application — multiresolution molecular dynamics (MRMD) — which is based on uniform mesh decomposition. The new load-balancing scheme uses adaptive curvilinear coordinates to represent partition boundaries. Workloads are partitioned with a uniform 3-dimensional mesh in the curvilinear coordinate system. Simulated annealing is used to determine the optimal coordinate system which minimizes load imbalance and communication costs. The number of messages for performing simulations is minimal because of the underlying regular mesh topology. Periodic boundary conditions are naturally incorporated in the new scheme. Performance of the MRMD algorithm with the new load balancer has been tested for nonuniform multimillion-atom systems.
TL;DR: The structure of the simulator and the workload used are presented, and the effects of the above mentioned factors on both global system throughput and individual response time on each of the different parallel applications are compared.
Abstract: Dimemas is a simulator that allows the study of message passing applications on distributed memory machines. Currently, we are using Dimemas to analyze the effects of different processor scheduling policies when several parallel applications share machine resources as processors, interconnection network,… The effect of sequential independent processes on the parallel applications is also being studied with the simulator. This situation is fairly frequent on clusters of workstations running a mixture of parallel and sequential workloads. We also study the influence of communication parameters (network bandwidth and conflicts) in the system performance. The paper presents the structure of the simulator and the workload used. This workload is a mixture of jobs from the NAS parallel benchmarks. We finally compare the effects of the above mentioned factors on both global system throughput and individual response time on each of the different parallel applications.
TL;DR: A neural network approach to the on-line solution of linear inequality systems is considered and for each technique a novel neural network implementation is proposed and the validity and performance of the proposed algorithms has been verified by computer simulation experiments.
Abstract: In this paper a neural network approach to the on-line solution of linear inequality systems is considered. Three different techniques are discussed and for each technique a novel neural network implementation is proposed. The first technique is a standard penalty method implemented as an analog neural network. The second technique is based on the transformation of inequality constraints into equality constraints with simple bounds on the variables. The transformed problem is then solved using least squares (LS) and least absolute values (LAV) optimisation criteria. The third technique makes use of the regularised total least squares criterion (RTLS). For each technique a suitable neural network architecture and associated algorithm in the form of nonlinear differential equations has been developed. The validity and performance of the proposed algorithms has been verified by computer simulation experiments. The analog neural networks are deemed to be particularly well suited for high throughput, real time applications.
TL;DR: This two stage algorithm has been implemented on a cluster of SUN workstations using the Parallel Virtual Machine (PVM) harness and the results of the investigation for a range of well known test problems taken from the MIPLIB set and others from the literature are reported.
Abstract: The sequential branch and bound algorithm is the most established method for solving mixed integer and discrete programming problems. It is based on the tree search of the possible subproblems of the original problem. There are two goals in carrying out a tree search, namely, (i) finding a good and ultimately the best integer solution, and (ii) to prove that the best solution has been found or no integer feasible solution exists. We call these the stage 1 and stage 2 of tree search. In general it is extremely difficult to choose the ideal search strategy in stage 1 or stage 2 for a given integer programming (IP) problem. On the other hand by investigating a number of different strategies (and hence different search trees) a good solution can be reached quickly in respect of many practical IP problems. Starting from this observation a parallel branch and bound algorithm has been designed which exploits this two stage approach. In the first stage we investigate a number of alternative search trees (forest search) in the hope of finding a good solution quickly. This we call the multiple heuristic search (MHS). In this approach the best integer solution is broadcast to other processors involved in MHS tree development. In the second stage we reorganise the search to investigate branches of a chosen tree in parallel. This two stage algorithm has been implemented on a cluster of SUN workstations using the Parallel Virtual Machine (PVM) harness [12]. The results of our investigation for a range of well known test problems taken from the MIPLIB set and others from the literature are reported here.
TL;DR: The proposed parallel algorithm is pragmatically feasible at the moment when multiprocessor systems become more and more popular and has a cost of O(2^n^/^2), an improvement over previous literature.
Abstract: An n-element knapsack problem has 2^n possible solutions to search over, so a task which can be accomplished in 2'' trials if an exhaustive search is used. Due to the exponential time in solving the knapsack problem, the problem is considered to be very hard. In the past decade, much effort has been done in order to find techniques which could lead to practical algorithms with reasonable running time. In 1994, Chang et al. proposed a brilliant parallel algorithm, which needs O(2^n^/^8) processors to solve the knapsack problem in O(2^n^/^2) time; that is, the cost of Chang et al.'s parallel algorithm is O(2^5^n^/^8). In this paper, we propose a parallel algorithm to improve Chang et al.'s parallel algorithm by reducing the time complexity to be O(2^3^n^/^8) under the same O(2^n^/^8) processors available. Thus, the proposed parallel algorithm has a cost of O(2^n^/^2). It is an improvement over previous literature. We believe that the proposed parallel algorithm is pragmatically feasible at the moment when multiprocessor systems become more and more popular.
TL;DR: A framework is described in which a class of imperfectly nested loops can be restructured using unimodular transformations, and an iteration space, which extends the usual concept so as to represent explicitly the executions of individual statements, is proposed to model the converted loop nest.
Abstract: A framework is described in which a class of imperfectly nested loops can be restructured using unimodular transformations. In this framework, an imperfect loop nest is converted to a perfect loop nest using Abu-Sufah's Non-Basic-to-Basic-Loop transformation. Conditions for the legality of this transformation and techniques for their verification are discussed. An iteration space, which extends the usual concept so as to represent explicitly the executions of individual statements, is proposed to model the converted loop nest. Since the converted loop nest is a perfect loop nest, data dependences can be extracted and optimal transformations can be selected for parallelism and/or locality in the normal manner. To generate the restructured code for a unimodular transformation, a code generation method is provided that produces the restructured code that is free of if statements by construction.
TL;DR: It is shown that this block variant of the GMRES method for solving general unsymmetric linear systems with block size s, denoted by BVGMRES( s, m ), is theoretically equivalent to the GM RES method.
Abstract: This paper presents a block variant of the GMRES method for solving general unsymmetric linear systems. This algorithm generates a transformed Hessenberg matrix by solely using block matrix operations and block data communications. It is shown that this algorithm with block size s , denoted by BVGMRES( s, m ), is theoretically equivalent to the GMRES( s, m ) method. The numerical results demonstrate that this algorithm can be more efficient than the standard GMRES method on a cache based single CPU computer with optimized BLAS kernels. Furthermore, the gain in efficiency is more significant on MPPs due to both efficient block operations and efficient block data communications. Preliminary numerical results on some real-world problems also show that this algorithm may be stable up to some reasonable block size.
TL;DR: Devise is demonstrated, a generic visualization tool which is designed to allow an arbitrary number of different but related data streams to be integrated and explored visually in a flexible manner and which incorporate data from Unix utilities and application output.
Abstract: Performance tuning a parallel application involves integrating performance data from many components of the system, including the message passing library, performance monitoring tool, resource manager, operating system, and the application itself. The current practice of visualizing these data streams using a separate, customized tool for each source is inconvenient from a usability perspective, and there is no easy way to visualize the data in an integrated fashion. We demonstrate a solution to this problem using Devise, a generic visualization tool which is designed to allow an arbitrary number of different but related data streams to be integrated and explored visually in a flexible manner. We display data emanating from a variety of sources side by side in three case studies. First we interface the Paradyn parallel performance tool and Devise, using two simple data export modules and Paradyn's simple visualization interface. We show several Devise/Paradyn visualizations which are useful for performance tuning parallel codes, and which incorporate data from Unix utilities and application output. Next we describe the visualization of trace data from a parallel application running in a Condor cluster of workstations. Finally we demonstrate the utility of Devise visualizations in a study of Condor cluster activity.
TL;DR: This paper presents a strategy to handle very complex scenes for radiosity computation by splitting the problem into subproblems, using virtual interface and visibility masks, and is able to achieve better data locality than other standard solutions.
Abstract: This paper presents a strategy to handle very complex scenes for radiosity computation. Compared to other radiosity algorithms, our solution focuses on the ability to compute the radiosity in local environments instead of solving the problem for the whole environment. By splitting the problem into subproblems, using virtual interface and visibility masks, our technique is able to achieve better data locality than other standard solutions. We present an implementation of visibility masks on a distributed memory parallel computer (Intel Paragon XP/S).
TL;DR: This paper addresses a simpler, and in this view more promising, way of doing parallel computing by addressing the issue of scalability of massively parallel systems.
Abstract: Scalability. Few would disagree that the future of high-performance computing lies with massively parallel systems(MPS), since there are major physical limitations to the clockrate of a single processor. Massively parallel systems arerequired to be scalable in the sense that their performanceshould be proportional to the number of processors. However,a feasible architecture for a scalable massively parallel systemis still wanting, as true, i.e., unlimited scalability is not onlytheoretically impossible but even in the practical sense cannotbe achieved on a range of more than an order of magnitude inthe number of processors. Whatever a system’s architecture,interconnect, or programming model, something will not scale:the throughput or latency of the interconnect, its cost, orthe synchronization overheads. Since all these componentscontribute to performance in different ways, the issue of scal-ability is a very complex one indeed.So in what sense can one argue for scalability of massivelyparallel systems? There have been quite a few attempts todefine it (see [1]) on the basis of some strong assumptionsregarding the nature of parallel computation. The mostcommon assumption that is being made in such analyses is thatthe processors run some predominantly local processes whichrequire little external communication (it manifests itself in theparameter “communication to computation ratio” assumed tobe small and used in all but a few performance models). Bymigrating these self-contained processes and placing severalof them per processor to balance out computational workacross the system, it is believed that scalability may beachieved without requiring physically unfeasible networkingand/or dramatically different computational models. In reality,however, process concurrency of such kind continues to facevarious fundamental limitations from data mapping to dynamicload balancing to program paradigm issues.Data parallelism. We would like to address a simpler, andin our view more promising, way of doing parallel computing