Top 223 papers presented at Parallel Computing in 1997

Showing papers presented at "Parallel Computing in 1997"

Journal Article•10.1016/S0167-8191(97)00089-6•

Memory size reduction through storage order optimization for embedded parallel multimedia applications

[...]

Eddy De Greef¹, F. Catthoor², F. Catthoor¹, Hugo De Man¹, Hugo De Man² - Show less +1 more•Institutions (2)

Katholieke Universiteit Leuven¹, The Catholic University of America²

1 Dec 1997

TL;DR: Novel strategies that are capable of significantly reducing the required memory sizes for a large class of data-intensive multimedia applications are presented, and are equally well suited for parallel and mono-processing applications.

...read moreread less

Abstract: In this paper, we present novel strategies that are capable of significantly reducing the required memory sizes for a large class of data-intensive multimedia applications. The size reduction is obtained by reusing memory locations for arrays as much as possible through optimization of the storage order. These strategies are equally well suited for parallel and mono-processing applications, and are particularly useful in an embedded application context, where memory size is usually one of the main cost factors. Their feasibility and effectiveness is demonstrated by experimental results for some real-life multimedia applications, for which a considerable size reduction has been obtained.

...read moreread less

123 citations

Journal Article•10.1016/S0167-8191(97)00015-X•

Simple randomized mergesort on parallel disks

[...]

Rakesh D. Barve¹, Edward F. Grove¹, Jeffrey Scott Vitter¹•Institutions (1)

Duke University¹

1 Jun 1997

TL;DR: The upper bound derived on expected I/O performance of SRM indicates that SRM is provably better than disk-striped mergesort (DSM) for realistic parameter values D, M and B.

...read moreread less

Abstract: We consider the problem of sorting a file of N records on the D-disk model of parallel I/O in which there are two sources of parallelism. Records are transferred to and from disk concurrently in blocks of B contiguous records. In each I/O operation, up to one block can be transferred to or from each of the D-disks in parallel. We propose a simple, efficient, randomized mergesort algorithm called SRM that uses a forecast-and-flush approach to overcome the inherent difficulties of simple merging on parallel disks. SRM exhibits a limited use of randomization and also has a useful deterministic version. Generalizing the technique of forecasting, our algorithm is able to read in, at any time, the ‘right’ block from any disk and using the technique of flushing, our algorithm evicts, without any I/O overhead, just the ‘right’ blocks from memory to make space for new ones to be read in. The disk layout of SRM is such that it enjoys perfect write parallelism, avoiding fundamental inefficiencies of previous mergesort algorithms. By analysis of generalized maximum occupancy problems we are able to derive an analytical upper bound on SRM's expected overhead valid for arbitrary inputs. The upper bound derived on expected I/O performance of SRM indicates that SRM is provably better than disk-striped mergesort (DSM) for realistic parameter values D, M and B. Average-case simulations show further improvement on the analytical upper bound. Unlike previously proposed optimal sorting algorithms, SRM outper-forms DSM even when the number D of parallel disks is small.

...read moreread less

100 citations

Journal Article•10.1016/S0167-8191(97)00081-1•

Cellular automata for reaction-diffusion systems

[...]

Jörg R. Weimar¹•Institutions (1)

Braunschweig University of Technology¹

15 Nov 1997

TL;DR: A class of cellular automata for reaction-diffusion systems is presented, based on a local average for the diffusive dynamics, and closely related to finite difference schemes.

...read moreread less

Abstract: A class of cellular automata for reaction-diffusion systems is presented. It is based on a local average for the diffusive dynamics, and closely related to finite difference schemes. The reactive dynamics is implemented as a lookup-table with probabilistic rules. The rules are derived directly and systematically from the given differential equations, using probabilistic rounding to enforce the discretization of the concentration variables. For quantitatively correct modeling, such probabilistic rules are usually necessary, but in some cases a deterministic version proves sufficient.

...read moreread less

75 citations

Journal Article•10.1016/S0167-8191(96)00075-0•

A graphical development and debugging environment for parallel programs

[...]

Péter Kacsuk¹, José C. Cunha², Gábor Dózsa¹, João Lourenço², Tibor Fadgyas¹, Tiago Antao² - Show less +2 more•Institutions (2)

Hungarian Academy of Sciences¹, Universidade Nova de Lisboa²

28 Feb 1997

TL;DR: A distributed debugging engine (DDBG) assists the user in debugging GRAPNEL programs on distributed memory computer architectures and Tape/PVM and PROVE support the performance monitoring and visualization of parallel programs developed in the GRADE environment.

...read moreread less

Abstract: To provide high-level graphical support for PVM (Parallel Virtual Machine) based program development, a complex programming environment (GRADE) is being developed. GRADE currently provides tools to construct, execute, debug, monitor and visualize message-passing parallel programs. It offers a high-level graphical programming abstraction mechanism to construct parallel applications by introducing a new graphical language called GRAPNEL. GRADE also provides the programmer with the same graphical user interface during the program design and debugging stages. A distributed debugging engine (DDBG) assists the user in debugging GRAPNEL programs on distributed memory computer architectures. Tape/PVM and PROVE support the performance monitoring and visualization of parallel programs developed in the GRADE environment.

...read moreread less

71 citations

Journal Article•10.1016/S0167-8191(97)00044-6•

Scalable parallel Benders decomposition for stochastic linear programming

[...]

Soren S. Nielsen¹, Stavros A. Zenios²•Institutions (2)

University of Texas at Austin¹, University of Cyprus²

1 Aug 1997

TL;DR: A scalable parallel implementation of the classical Benders decomposition algorithm for two-stage stochastic linear programs using a primal-dual, path-following algorithm for solving the scenario subproblems is developed that alleviates the difficulties of load balancing.

...read moreread less

Abstract: We develop a scalable parallel implementation of the classical Benders decomposition algorithm for two-stage stochastic linear programs. Using a primal-dual, path-following algorithm for solving the scenario subproblems we develop a parallel implementation that alleviates the difficulties of load balancing. Furthermore, the dual and primal step calculations can be implemented using a data-parallel programming paradigm. With this approach the code effectively utilizes both the multiple, independent processors and the vector units of the target architecture, the Connection Machine CM-5. The, usually limiting, master program is solved very efficiently using the interior point code LoQo on the front-end workstation. The implementation scales almost perfectly with problem and machine size. Extensive computational testing is reported with several large problems with up to 2 million constraints and 13.8 million variables.

...read moreread less

64 citations

Book Chapter•10.1007/3-540-62573-9_4•

Efficient Adaptive Routing in Networks of Workstations with Irregular Topology

[...]

Federico Silla¹, Manuel P. Malumbres¹, Antonio Robles¹, Pedro López¹, José Duato¹ - Show less +1 more•Institutions (1)

Polytechnic University of Valencia¹

1 Feb 1997

TL;DR: A general methodology for the design of adaptive routing algorithms for networks with irregular topology, based on the application of the theory of deadlock avoidance proposed in [14], which increases routing flexibility by allowing cyclic dependencies between channels.

...read moreread less

Abstract: Networks of workstations are rapidly emerging as a cost-effective alternative to parallel computers. Switch-based interconnects with irregular topologies allow the wiring flexibility, scalability and incremental expansion capability required in this environment. The irregularity also makes routing and deadlock avoidance on such systems quite complicated. Current proposals avoid deadlock by removing cyclic dependencies between channels. As a consequence, many messages are routed following non-minimal paths, increasing latency and wasting resources. In this paper, we propose a general methodology for the design of adaptive routing algorithms for networks with irregular topology. These routing algorithms allow messages to follow minimal paths in most cases, reducing message latency and increasing network throughput. The methodology is based on the application of the theory of deadlock avoidance proposed in [14], which increases routing flexibility by allowing cyclic dependencies between channels. As an example of application, we propose an adaptive routing algorithm for Autonet. It can be implemented either by duplicating physical channels or by splitting each physical channel into two virtual channels. In the former case, the implementation does not require a new switch design. It only requires changing the routing tables and adding links in parallel with existing ones, taking advantage of spare switch ports. In the latter case, a new switch design is required but the network topology is not changed. Preliminary evaluation results show that the new routing algorithm is able to increase throughput for random traffic by a factor of up to 2.8 with respect to the original algorithm, also reducing latency.

...read moreread less

62 citations

Journal Article•10.1016/S0167-8191(96)00074-9•

LICRA: a replicated-data management algorithm for distributed synchronous groupware applications

[...]

Rushed Kanawati¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

28 Feb 1997

TL;DR: The proposed algorithm guarantees an optimal response time while ensuring data consistency at system quiescence while comparing the algorithm to two well known optimistic concurrency control protocols for groupware applications: dOPT and ORESTE.

...read moreread less

Abstract: Replicated data consistency is a key issue in the design of distributed real time groupware applications. In this paper, we propose a new protocol to cope with this problem. The proposed algorithm guarantees an optimal response time while ensuring data consistency at system quiescence. The originality of our proposition relies on the fact that neither locks nor clocks nor global information are required to establish data consistency. Instead, direct dependency relations between generated operations as well as operation transformation mechanism are used. The coupling of the above two mentioned mechanisms is shown to realize a good trade-off between the different requirements of groupware applications. Advantages of our approach are illustrated by comparing the algorithm to two well known optimistic concurrency control protocols for groupware applications: dOPT and ORESTE.

...read moreread less

54 citations

Journal Article•10.1016/S0167-8191(97)00031-8•

Rendering large scenes using parallel ray tracing

[...]

Erik Reinhard¹, Frederik W. Jansen¹•Institutions (1)

Delft University of Technology¹

1 Jul 1997

TL;DR: A hybrid scheduling approach is presented that combines demand driven and data parallel techniques, which leads to a scalable and efficient parallel implementation of the ray tracing algorithm with little restriction on the size of the model data base to be rendered.

...read moreread less

Abstract: Ray tracing is a powerful technique to generate realistic images of 3D scenes. However, rendering complex scenes may easily exceed the processing and memory capabilities of a single workstation. Distributed processing offers a solution if the algorithm can be parallelized in an efficient way. In this paper a hybrid scheduling approach is presented that combines demand driven and data parallel techniques. Which tasks to process demand driven and which data parallel, is decided by the data intensity of the task and the amount of data locality (coherence) that will be present in the task. By combining demand driven and data driven tasks, a better load balance may be achieved, while at the same time the communication is spread evenly across the network. This leads to a scalable and efficient parallel implementation of the ray tracing algorithm with little restriction on the size of the model data base to be rendered.

...read moreread less

54 citations

Journal Article•10.1016/S0167-8191(97)00050-1•

Parallel 3D computation of unsteady flows around circular cylinders

[...]

V. Kalro¹, Tayfun E. Tezduyar¹•Institutions (1)

University of Minnesota¹

1 Sep 1997

TL;DR: This article employs stabilized finite element formulations to solve the Navier-Stokes equations on a thinking machine CM-5 supercomputer, and the coupled, nonlinear equations generated every time step are solved iteratively, with an element-vector based evaluation technique.

...read moreread less

Abstract: In this article we present parallel 3D finite element computation of unsteady incompressible flows around circular cylinders. We employ stabilized finite element formulations to solve the Navier-Stokes equations on a thinking machine CM-5 supercomputer. The time integration is based on an implicit method, and the coupled, nonlinear equations generated every time step are solved iteratively, with an element-vector based evaluation technique. This strategy enables us to carry out these computations with millions of coupled, nonlinear equations, and thus resolve the flow features in great detail. At Reynolds number 300 and 800, our results indicate strong 3D features arising from the instability of the columnar vortices forming the Karman street. At Re = 10 000 we employ a large eddy simulation (LES) turbulence model.

...read moreread less

53 citations

Journal Article•10.1016/S0167-8191(97)00008-2•

RAMA: an easy-to-use, high-performance parallel file system

[...]

Ethan L. Miller¹, Randy H. Katz²•Institutions (2)

University of Maryland, Baltimore County¹, University of California, Berkeley²

1 Jun 1997

TL;DR: RAMA uses hashing to pseudo-randomly distribute data to all of its disks, insuring high bandwidth regardless of access pattern and eliminating bottlenecks in file block accesses.

...read moreread less

Abstract: Modem massively parallel file systems provide high bandwidth file access by striping files across arrays of disks attached to a few specialized I/O nodes. However, these file systems are hard to use and difficult to integrate with workstations and tertiary storage. RAMA addresses these problems by providing a high-performance massively parallel file system with a simple interface. RAMA uses hashing to pseudo-randomly distribute data to all of its disks, insuring high bandwidth regardless of access pattern and eliminating bottlenecks in file block accesses. This flexibility does not cause a large loss of performance — RAMA's simulated performance is within 10–15% of the optimum performance of a similarly-sized striped file system, and is a factor of 4 or more better than a striped file system with poorly laid out data.

...read moreread less

52 citations

Journal Article•10.1006/JPDC.1996.1292•

Firehose Architectures for Free-Space Optically Interconnected VLSI Circuits

[...]

Ashok V. Krishnamoorthy¹, David A. B. Miller²•Institutions (2)

Alcatel-Lucent¹, Stanford University²

25 Feb 1997

TL;DR: Several classes of optoelectronic architectures are presented that can support a high-bandwidth data firehose and mention applications in switching, FFT, sorting, matrix?vector processing, database search, and processor-to-memory interconnect.

...read moreread less

Abstract: Free-space optical interconnects will soon be able to provide input/output bandwidths to a VLSI chip in excess of a terabit per second. The successful application of this technology to parallel distributed processing systems depends on the development of high-bandwidth interconnects and the ability of the architecture to sustain a stream of data at these bandwidths to the processing elements. We review examples of computational tasks that require scalable input/output, that is, computations where the I/O bandwidth of a processing element must grow in proportion to its computational bandwidth. We present several classes of optoelectronic architectures that can support a high-bandwidth data firehose and mention applications in switching, FFT, sorting, matrix?vector processing, database search, and processor-to-memory interconnect.

...read moreread less

Journal Article•10.1016/S0167-8191(96)00062-2•

A high-performance MPI implementation on a shared-memory vector supercomputer

[...]

William Gropp¹, Ewing Lusk¹•Institutions (1)

Argonne National Laboratory¹

1 Jan 1997

TL;DR: The sequence of steps by which MPICH, a high-performance, portable implementation of the Message-Passing Interface (MPI) standard, was ported to the NEC SX-4 raised issues that are important for shared-memory programming in general and shed light on both MPICH and the SX- 4.

...read moreread less

Abstract: In this article we recount the sequence of steps by which MPICH, a high-performance, portable implementation of the Message-Passing Interface (MPI) standard, was ported to the NEC SX-4, a high-performance parallel supercomputer Each step in the sequence raised issues that are important for shared-memory programming in general and shed light on both MPICH and the SX-4 The result is a low-latency, very high bandwidth implementation of MPI for the NEC SX-4 In the process, MPICH was also improved in several general ways

...read moreread less

Journal Article•10.1016/S0167-8191(97)00056-2•

Parallel automatic adaptive analysis

[...]

Mark S. Shephard¹, Joseph E. Flaherty¹, Carlo L. Bottasso¹, H. L. de Cougny¹, Can Özturan¹, M. L. Simone¹ - Show less +2 more•Institutions (1)

Rensselaer Polytechnic Institute¹

1 Sep 1997

TL;DR: Considering is given to the techniques required to support adaptive analysis of automatically generated unstructured meshes on distributed memory MIMD parallel computers, and procedures to automatically generate and adaptively refine meshes in parallel are given.

...read moreread less

Abstract: Consideration is given to the techniques required to support adaptive analysis of automatically generated unstructured meshes on distributed memory MIMD parallel computers. Emphasis is placed on the structures needed to support effective parallel computations when the numerical discretization, the mesh, is defined and evolves during the computation. The key base structures are a distributed mesh based on a topological hierarchy, and a parallel distributed octree. Parallel control of the mesh and octree structures is done through a set of partition communication operations and entity migration routines. Load balance is maintained through iterative load balance, or distributed repartitioning. Building on these structures, procedures to automatically generate and adaptively refine meshes in parallel, starting from CAD geometric models, are given. Finally, the combination of these techniques to produce a parallel automated analysis procedure is demonstrated.

...read moreread less

Journal Article•10.1016/S0167-8191(97)00093-8•

Fault-tolerant wormhole routing in mesh with overlapped solid fault regions

[...]

Seong-Pyo Kim¹, Taisook Han¹•Institutions (1)

KAIST¹

15 Dec 1997

TL;DR: A fault-tolerant wormhole routing algorithm on mesh-connected processors that allows the fault polygons to be overlapped and utilizes the position of fault region relative to the current channel to route a message around overlapped fault polygon.

...read moreread less

Abstract: A fault-tolerant wormhole routing algorithm on mesh-connected processors is proposed. The proposed algorithm is based on the solid fault model and allows the fault polygons to be overlapped. The algorithm utilizes the position of fault region relative to the current channel to route a message around overlapped fault polygons. A node deactivating algorithm to convert a non-solid fault region into a solid fault region is also proposed. The proposed routing algorithm uses four virtual channels and is deadlock- and livelock-free.

...read moreread less

Journal Article•10.1016/S0167-8191(97)00061-6•

The remote computation system

[...]

Peter Arbenz¹, Walter Gander¹, Michael Oettli¹•Institutions (1)

ETH Zurich¹

1 Oct 1997

TL;DR: The goal of the remote computation system (RCS) is to provide easy access to modern parallel algorithms on supercomputers for the inexperienced user with an easy-to-use mechanism for using computational resources remotely.

...read moreread less

Abstract: Today many high performance computers are reachable over some network. However, the access and use of these computers is often complicated. This prevents many users to work on such machines. The Goal of the Remote Computation System (CS) is to provide easy access to modern parallel algorithms on supercomputers for the inexperienced user. RCS has an easy-to-use mechanism for using computational resources remotely. The computational resources available are used as efficiently as possible in order to minimizes the response time.

...read moreread less

Journal Article•10.1016/S0167-8191(96)00105-6•

Debugging with the MAD environment

[...]

Dieter Kranzlmüller¹, Siegfried Grabner¹, Jens Volkert¹•Institutions (1)

Johannes Kepler University of Linz¹

1 Apr 1997

TL;DR: All the tools in the MAD environment follow an extensible and modular debugging strategy based on a graphical user interface that helps the user in monitoring and analyzing message passing programs.

...read moreread less

Abstract: Debugging parallel programs can be tedious and difficult. Therefore the programmer needs support from tools, that provide features for error detection and performance analysis. The MAD environment is such a toolset. It helps the user in monitoring and analyzing message passing programs. Communication errors and performance bottlenecks are visualized based on an event graph. Source code connection provides a combination between visualized events and the original lines of code or a control and data flow representation. A main part of the environment is dedicated to race conditions. After evaluation of events, which might be reordered during successive program runs, localization of message races can be performed by means of trace driven simulation. All the tools in the MAD environment follow an extensible and modular debugging strategy based on a graphical user interface.

...read moreread less

Journal Article•10.1016/S0167-8191(97)00055-0•

Efficient parallel computation of unstructured finite element reacting flow solutions

[...]

John N. Shadid¹, Harry K. Moffat¹, Scott A. Hutchinson¹, Karen D. Devine¹, Gary L. Hennigan¹, Andrew G. Salinger¹ - Show less +2 more•Institutions (1)

Sandia National Laboratories¹

1 Sep 1997

TL;DR: A parallel unstructured finite element reacting flow solver designed for message passing MIMD computers is described, indicating that very high computational rates and high scaled efficiencies can be achieved for large problems despite the use of sparse matrix data structures and the required unstructuring data communication.

...read moreread less

Abstract: A parallel unstructured finite element (FE) reacting flow solver designed for message passing MIMD computers is described. This implementation employs automated partitioning algorithms for load balancing unstructured grids, a distributed sparse matrix representation of the global FE equations, and parallel Krylov subspace iterative solvers. In this paper, a number of issues related to the efficient implementation of parallel unstructured mesh applications are presented. These issues include the differences between structured and unstructured mesh parallel applications, major communication kernels for unstructured Krylov iterative solvers, automatic mesh partitioning algorithms, and the influence of mesh partitioning metrics and single-node CPU performance on parallel performance. Results are presented for example FE heat transfer, fluid flow and full reacting flow applications on a 1024 processor nCUBE 2 hypercube and a 1904 processor Intel Paragon. Results indicate that very high computational rates and high scaled efficiencies can be achieved for large problems despite the use of sparse matrix data structures and the required unstructured data communication.

...read moreread less

Journal Article•10.1016/S0167-8191(97)00111-7•

The NRL layered ocean model

[...]

Alan J. Wallcraft¹, D. R. Moore¹•Institutions (1)

Stennis Space Center¹

30 Dec 1997

TL;DR: The portable scalable implementation of the NRL Layered Ocean Model is described, based primarily on the tiled data parallel parallel programming paradigm, which is sufficiently general that the actual technique used on a given machine to obtain scalability can be selected at compile time from: data parallel, SPMD message passing, autotasking, or self-service.

...read moreread less

Abstract: We describe the portable scalable implementation of the NRL Layered Ocean Model (NLOM). Scalability is based primarily on the tiled data parallel parallel programming paradigm. This is sufficiently general that the actual technique used on a given machine to obtain scalability can be selected at compile time from: (i) data parallel, (ii) SPMD message passing, (iii) autotasking, or (iv) SPMD message passing between multi-processor autotasked systems. The code is thus portable onto all machine types likely to be used by ocean modelers.

...read moreread less

Journal Article•10.1016/S0167-8191(97)00042-2•

Existence and uniqueness of WZ factorization

[...]

S. Chandra Sekhara Rao¹•Institutions (1)

Indian Institute of Technology Kanpur¹

1 Aug 1997

TL;DR: The main objective of this paper is to present a characterization for the existence of the WZ factorization and prove uniqueness of the factorization.

...read moreread less

Abstract: Evans [D.J. Evans, Implicit matrix elimination schemes, Int. J. Computer Math. 48 (1993) 229–237.] introduced the parallel implicit elimination (PIE) method for solving large order linear equations. The quadrant interlocking factorization (QIF) method of Evans and Hatzopoulos [D.J. Evans, M. Hatzopoulos, A parallel linear system solver, Int. J. Computer Math. 7 (1979) 227–238.] can be viewed as a factorization version of PIE method. The main objective of this paper is to present a characterization for the existence of the WZ factorization. We also prove uniqueness of the factorization.

...read moreread less

Journal Article•10.1016/S0167-8191(97)00005-7•

Preconditioned CG methods for sparse matrices on massively parallel machines

[...]

Achim Basermann, B. Reichel, C. Schelthoff

1 May 1997

TL;DR: The data distribution and the communication scheme for the sparse matrix operations of the preconditioned CG are based on the analysis of the indices of the non-zero elements, and a fully local incomplete Cholesky preconditionser is presented.

...read moreread less

Abstract: Conjugate gradient (CG) methods to solve sparse systems of linear equations play an important role in numerical methods for solving discretized partial differential equations. The large size and the condition of many technical or physical applications in this area result in the need for efficient parallelization and preconditioning techniques of the CG method, in particular on massively parallel machines. Here, the data distribution and the communication scheme for the sparse matrix operations of the preconditioned CG are based on the analysis of the indices of the non-zero elements. Polynomial preconditioning is shown to reduce global synchronizations considerably, and a fully local incomplete Cholesky preconditioner is presented. On a PARAGON XP/S 10 with 138 processors, the developed parallel methods outperform diagonally scaled CG markedly with respect to both scaling behavior and execution time for many matrices from real finite element applications.

...read moreread less

Journal Article•10.1016/S0167-8191(97)00060-4•

An adaptive curvilinear-coordinate approach to dynamic load balancing of parallel multiresolution molecular dynamics

[...]

Aiichiro Nakano¹, Timothy J. Campbell¹•Institutions (1)

Louisiana State University¹

1 Oct 1997

TL;DR: A practical experience in adding a dynamic-load-balancing capability to an existing large parallel application — multiresolution molecular dynamics (MRMD) — which is based on uniform mesh decomposition by using adaptive curvilinear coordinates to represent partition boundaries.

...read moreread less

Abstract: We present a practical experience in adding a dynamic-load-balancing capability to an existing large parallel application — multiresolution molecular dynamics (MRMD) — which is based on uniform mesh decomposition. The new load-balancing scheme uses adaptive curvilinear coordinates to represent partition boundaries. Workloads are partitioned with a uniform 3-dimensional mesh in the curvilinear coordinate system. Simulated annealing is used to determine the optimal coordinate system which minimizes load imbalance and communication costs. The number of messages for performing simulations is minimal because of the underlying regular mesh topology. Periodic boundary conditions are naturally incorporated in the new scheme. Performance of the MRMD algorithm with the new load balancer has been tested for nonuniform multimillion-atom systems.

...read moreread less

Journal Article•10.1016/S0167-8191(96)00094-4•

Analyzing scheduling policies using Dimemas

[...]

Jesús Labarta¹, S. Girona¹, Toni Cortes¹•Institutions (1)

Polytechnic University of Catalonia¹

1 Apr 1997

TL;DR: The structure of the simulator and the workload used are presented, and the effects of the above mentioned factors on both global system throughput and individual response time on each of the different parallel applications are compared.

...read moreread less

Abstract: Dimemas is a simulator that allows the study of message passing applications on distributed memory machines. Currently, we are using Dimemas to analyze the effects of different processor scheduling policies when several parallel applications share machine resources as processors, interconnection network,… The effect of sequential independent processes on the parallel applications is also being studied with the simulator. This situation is fairly frequent on clusters of workstations running a mixture of parallel and sequential workloads. We also study the influence of communication parameters (network bandwidth and conflicts) in the system performance. The paper presents the structure of the simulator and the workload used. This workload is a mixture of jobs from the NAS parallel benchmarks. We finally compare the effects of the above mentioned factors on both global system throughput and individual response time on each of the different parallel applications.

...read moreread less

Journal Article•10.1016/S0167-8191(96)00065-8•

Neural networks for solving linear inequality systems

[...]

Andrzej Cichocki¹, A. Bargiela¹•Institutions (1)

Nottingham Trent University¹

1 Jan 1997

TL;DR: A neural network approach to the on-line solution of linear inequality systems is considered and for each technique a novel neural network implementation is proposed and the validity and performance of the proposed algorithms has been verified by computer simulation experiments.

...read moreread less

Abstract: In this paper a neural network approach to the on-line solution of linear inequality systems is considered. Three different techniques are discussed and for each technique a novel neural network implementation is proposed. The first technique is a standard penalty method implemented as an analog neural network. The second technique is based on the transformation of inequality constraints into equality constraints with simple bounds on the variables. The transformed problem is then solved using least squares (LS) and least absolute values (LAV) optimisation criteria. The third technique makes use of the regularised total least squares criterion (RTLS). For each technique a suitable neural network architecture and associated algorithm in the form of nonlinear differential equations has been developed. The validity and performance of the proposed algorithms has been verified by computer simulation experiments. The analog neural networks are deemed to be particularly well suited for high throughput, real time applications.

...read moreread less

Journal Article•10.1016/S0167-8191(97)00016-1•

A distributed processing algorithm for solving integer programs using a cluster of workstations

[...]

Gautam Mitra¹, I. Hai¹, M. T. Hajian²•Institutions (2)

Brunel University London¹, Imperial College London²

29 Jun 1997

TL;DR: This two stage algorithm has been implemented on a cluster of SUN workstations using the Parallel Virtual Machine (PVM) harness and the results of the investigation for a range of well known test problems taken from the MIPLIB set and others from the literature are reported.

...read moreread less

Abstract: The sequential branch and bound algorithm is the most established method for solving mixed integer and discrete programming problems. It is based on the tree search of the possible subproblems of the original problem. There are two goals in carrying out a tree search, namely, (i) finding a good and ultimately the best integer solution, and (ii) to prove that the best solution has been found or no integer feasible solution exists. We call these the stage 1 and stage 2 of tree search. In general it is extremely difficult to choose the ideal search strategy in stage 1 or stage 2 for a given integer programming (IP) problem. On the other hand by investigating a number of different strategies (and hence different search trees) a good solution can be reached quickly in respect of many practical IP problems. Starting from this observation a parallel branch and bound algorithm has been designed which exploits this two stage approach. In the first stage we investigate a number of alternative search trees (forest search) in the hope of finding a good solution quickly. This we call the multiple heuristic search (MHS). In this approach the best integer solution is broadcast to other processors involved in MHS tree development. In the second stage we reorganise the search to investigate branches of a chosen tree in parallel. This two stage algorithm has been implemented on a cluster of SUN workstations using the Parallel Virtual Machine (PVM) harness [12]. The results of our investigation for a range of well known test problems taken from the MIPLIB set and others from the literature are reported here.

...read moreread less

Journal Article•10.1016/S0167-8191(96)00085-3•

A parallel two-list algorithm for the knapsack problem

[...]

Der-Chyuan Lou¹, Chin-Chen Chang¹•Institutions (1)

National Chung Cheng University¹

1 Mar 1997

TL;DR: The proposed parallel algorithm is pragmatically feasible at the moment when multiprocessor systems become more and more popular and has a cost of O(2^n^/^2), an improvement over previous literature.

...read moreread less

Abstract: An n-element knapsack problem has 2^n possible solutions to search over, so a task which can be accomplished in 2'' trials if an exhaustive search is used. Due to the exponential time in solving the knapsack problem, the problem is considered to be very hard. In the past decade, much effort has been done in order to find techniques which could lead to practical algorithms with reasonable running time. In 1994, Chang et al. proposed a brilliant parallel algorithm, which needs O(2^n^/^8) processors to solve the knapsack problem in O(2^n^/^2) time; that is, the cost of Chang et al.'s parallel algorithm is O(2^5^n^/^8). In this paper, we propose a parallel algorithm to improve Chang et al.'s parallel algorithm by reducing the time complexity to be O(2^3^n^/^8) under the same O(2^n^/^8) processors available. Thus, the proposed parallel algorithm has a cost of O(2^n^/^2). It is an improvement over previous literature. We believe that the proposed parallel algorithm is pragmatically feasible at the moment when multiprocessor systems become more and more popular.

...read moreread less

Journal Article•10.1016/S0167-8191(96)00063-4•

Unimodular transformations of non-perfectly nested loops

[...]

Jingling Xue¹•Institutions (1)

University of New England (Australia)¹

15 Feb 1997

TL;DR: A framework is described in which a class of imperfectly nested loops can be restructured using unimodular transformations, and an iteration space, which extends the usual concept so as to represent explicitly the executions of individual statements, is proposed to model the converted loop nest.

...read moreread less

Abstract: A framework is described in which a class of imperfectly nested loops can be restructured using unimodular transformations. In this framework, an imperfect loop nest is converted to a perfect loop nest using Abu-Sufah's Non-Basic-to-Basic-Loop transformation. Conditions for the legality of this transformation and techniques for their verification are discussed. An iteration space, which extends the usual concept so as to represent explicitly the executions of individual statements, is proposed to model the converted loop nest. Since the converted loop nest is a perfect loop nest, data dependences can be extracted and optimal transformations can be selected for parallelism and/or locality in the normal manner. To generate the restructured code for a unimodular transformation, a code generation method is provided that produces the restructured code that is free of if statements by construction.

...read moreread less

Journal Article•10.1016/S0167-8191(97)00004-5•

A block variant of the GMRES method on massively parallel processors

[...]

Guangye Li¹•Institutions (1)

Cray¹

1 Aug 1997

TL;DR: It is shown that this block variant of the GMRES method for solving general unsymmetric linear systems with block size s, denoted by BVGMRES( s, m ), is theoretically equivalent to the GM RES method.

...read moreread less

Abstract: This paper presents a block variant of the GMRES method for solving general unsymmetric linear systems. This algorithm generates a transformed Hessenberg matrix by solely using block matrix operations and block data communications. It is shown that this algorithm with block size s , denoted by BVGMRES( s, m ), is theoretically equivalent to the GMRES( s, m ) method. The numerical results demonstrate that this algorithm can be more efficient than the standard GMRES method on a cache based single CPU computer with optimized BLAS kernels. Furthermore, the gain in efficiency is more significant on MPPs due to both efficient block operations and efficient block data communications. Preliminary numerical results on some real-world problems also show that this algorithm may be stable up to some reasonable block size.

...read moreread less

Journal Article•10.1016/S0167-8191(96)00104-4•

Integrated visualization of parallel program performance data

[...]

Karen L. Karavanic¹, Jussi Myllymaki¹, Miron Livny¹, Barton P. Miller¹•Institutions (1)

University of Wisconsin-Madison¹

1 Apr 1997

TL;DR: Devise is demonstrated, a generic visualization tool which is designed to allow an arbitrary number of different but related data streams to be integrated and explored visually in a flexible manner and which incorporate data from Unix utilities and application output.

...read moreread less

Abstract: Performance tuning a parallel application involves integrating performance data from many components of the system, including the message passing library, performance monitoring tool, resource manager, operating system, and the application itself. The current practice of visualizing these data streams using a separate, customized tool for each source is inconvenient from a usability perspective, and there is no easy way to visualize the data in an integrated fashion. We demonstrate a solution to this problem using Devise, a generic visualization tool which is designed to allow an arbitrary number of different but related data streams to be integrated and explored visually in a flexible manner. We display data emanating from a variety of sources side by side in three case studies. First we interface the Paradyn parallel performance tool and Devise, using two simple data export modules and Paradyn's simple visualization interface. We show several Devise/Paradyn visualizations which are useful for performance tuning parallel codes, and which incorporate data from Unix utilities and application output. Next we describe the visualization of trace data from a parallel application running in a Condor cluster of workstations. Finally we demonstrate the utility of Devise visualizations in a study of Condor cluster activity.

...read moreread less

Journal Article•10.1016/S0167-8191(97)00032-X•

Visibility masks for solving complex radiosity computations on multiprocessors

[...]

Bruno Arnaldi, Thierry Priol, Luc Renambot, Xavier Pueyo

1 Jul 1997

TL;DR: This paper presents a strategy to handle very complex scenes for radiosity computation by splitting the problem into subproblems, using virtual interface and visibility masks, and is able to achieve better data locality than other standard solutions.

...read moreread less

Abstract: This paper presents a strategy to handle very complex scenes for radiosity computation. Compared to other radiosity algorithms, our solution focuses on the ability to compute the radiosity in local environments instead of solving the problem for the whole environment. By splitting the problem into subproblems, using virtual interface and visibility masks, our technique is able to achieve better data locality than other standard solutions. We present an implementation of visibility masks on a distributed memory parallel computer (Intel Paragon XP/S).

...read moreread less

Journal Article•10.1006/JPDC.1996.1290•

Optical Interconnectivity in a Scalable Data-Parallel System

[...]

J.A.B. Dines¹, John Fraser Snowdon¹, Marc P.Y. Desmulliez¹, D.B. Barsky¹, A.V. Shafarenko¹, C.R. Jesshope¹ - Show less +2 more•Institutions (1)

Heriot-Watt University¹

25 Feb 1997

TL;DR: This paper addresses a simpler, and in this view more promising, way of doing parallel computing by addressing the issue of scalability of massively parallel systems.

...read moreread less

Abstract: Scalability. Few would disagree that the future of high-performance computing lies with massively parallel systems(MPS), since there are major physical limitations to the clockrate of a single processor. Massively parallel systems arerequired to be scalable in the sense that their performanceshould be proportional to the number of processors. However,a feasible architecture for a scalable massively parallel systemis still wanting, as true, i.e., unlimited scalability is not onlytheoretically impossible but even in the practical sense cannotbe achieved on a range of more than an order of magnitude inthe number of processors. Whatever a system’s architecture,interconnect, or programming model, something will not scale:the throughput or latency of the interconnect, its cost, orthe synchronization overheads. Since all these componentscontribute to performance in different ways, the issue of scal-ability is a very complex one indeed.So in what sense can one argue for scalability of massivelyparallel systems? There have been quite a few attempts todeﬁne it (see [1]) on the basis of some strong assumptionsregarding the nature of parallel computation. The mostcommon assumption that is being made in such analyses is thatthe processors run some predominantly local processes whichrequire little external communication (it manifests itself in theparameter “communication to computation ratio” assumed tobe small and used in all but a few performance models). Bymigrating these self-contained processes and placing severalof them per processor to balance out computational workacross the system, it is believed that scalability may beachieved without requiring physically unfeasible networkingand/or dramatically different computational models. In reality,however, process concurrency of such kind continues to facevarious fundamental limitations from data mapping to dynamicload balancing to program paradigm issues.Data parallelism. We would like to address a simpler, andin our view more promising, way of doing parallel computing

...read moreread less

...

Expand