Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Distributed memory
  4. 2003
  1. Home
  2. Topics
  3. Distributed memory
  4. 2003
Showing papers on "Distributed memory published in 2003"
Patent•
Method and system for optimally sharing memory between a host processor and graphic processor

[...]

David Wyatt1•
Intel1
24 Apr 2003
TL;DR: In this article, the authors present a method and system for sharing memory between applications running on one or more CPUs, and acceleration co-processors, such as graphics processors, of a computer system in which the memory may retain its optimal caching and access attributes favorable to the maximum performance of both CPU and graphics processor.
Abstract: A method and system according to the present invention provide for sharing memory between applications running on one or more CPUs, and acceleration co-processors, such as graphics processors, of a computer system in which the memory may retain its optimal caching and access attributes favorable to the maximum performance of both CPU and graphics processor. The method involves a division of ownership within which the shared memory is made coherent with respect to the previous owner, prior to handing placing the shared memory in the view the next owner. This arbitration may involve interfaces within which ownership is transitioned from one client to another. Within such transition of ownership the memory may be changed from one view to another by actively altering the processor caching attributes of the shared memory as well as via the use of processor low-level cache control instructions, and/or graphics processor render flush algorithms which serve to enforce data coherency. Embodiments of the invention facilitate maximal access efficiency for all clients which share the data, without the incurring the overhead of moving or replicating the data between the clients.

172 citations

Book Chapter•10.1007/978-3-540-45209-6_105•
THROOM — Supporting POSIX Multithreaded Binaries on a Cluster

[...]

Henrik Löf1, Zoran Radovic1, Erik Hagersten1•
Uppsala University1
26 Aug 2003
TL;DR: This paper presents a runtime system concept that enables unmodified POSIX (Pthreads) binaries to run transparently on clustered hardware and presents THROOM, a proof-of-concept implementation that runs unmodified Pthread binaries on a virtual cluster modeled as standard UNIX processes.
Abstract: Today, most software distributed shared memory systems (SW-DSMs) lack industry standard programming interfaces which limit their applicability to a small set of shared-memory applications. In order to gain general acceptance, SW-DSMs should support the same look-and-feel of shared memory as hardware DSMs. This paper presents a runtime system concept that enables unmodified POSIX (Pthreads) binaries to run transparently on clustered hardware. The key idea is to extend the single process model of multi-threading to a multi-process model where threads are distributed to processes executing in remote nodes. The distributed threads execute in a global shared address space made coherent by a fine-grain SW-DSM layer. We also present THROOM, a proof-of-concept implementation that runs unmodified Pthread binaries on a virtual cluster modeled as standard UNIX processes. THROOM runs on top of the DSZOOM fine-grain SW-DSM system with limited OS support.

142 citations

Proceedings Article•10.1137/1.9781611972733.11•
Communication and Memory Efficient Parallel Decision Tree Construction.

[...]

Ruoming Jin, Gagan Agrawal
1 Jan 2003
TL;DR: A new approach to decision tree construction is reported, which is referred to as SPIES (Statistical Pruning of Intervals for Enhanced Scalability) and can be efficiently parallelized using the same high-level interface and runtime support that was previously used to parallelize association mining and clustering algorithms.
Abstract: Decision tree construction is an important data mining problem. In this paper, we revisit this problem, with a new goal, i.e. Can we develop an efficient parallel algorithm for decision tree construction that can be parallelized in the same way as algorithms for other major mining tasks ?. We report a new approach to decision tree construction, which we refer to as SPIES (Statistical Pruning of Intervals for Enhanced Scalability). This approach combines RainForest based AVC groups with sampling to achieve memory efficient processing of numerical attributes. Overall, this algorithm has the following properties: 1) no preprocessing or sorting of input data is required, 2) the size of the data-structure required in the main memory is very small, 3) the only disk-traffic required is one pass for splitting nodes for each level of the tree, and no writing-back of data, 4) very low communication volume when this algorithm is parallelized, and 5) the same level of accuracy as an algorithm that does not use sampling or pruning. We show that this algorithm can be efficiently parallelized using the same high-level interface and runtime support that was previously used to parallelize association mining and clustering algorithms. This, we believe, is an important step towards offering high-level interfaces for parallel data mining. Moreover, we have efficiently parallelized this algorithm on a cluster of SMPs, i.e. combining shared memory and distributed memory parallelism, and over disk-resident datasets. Key-words: Decision tree construction, parallelization, cluster of SMPs, sampling, algorithms for streaming data

127 citations

Patent•
Pass through circuit for reduced memory latency in a multiprocessor system

[...]

John Howard Barker1, Beeman Noel Fallwell1•
IBM1
24 Jun 2003
TL;DR: In this article, a bypass or pass-through device is used to connect two HyperTransport links together where each of the links is connected to a processor at the other end.
Abstract: A technique and mechanism for reducing memory latency asymmetry in a multiprocessor system by replacing one (or more) processors with a bypass or pass-through device. Using the pass-through mechanism, the reduced number of processors in the system enables all of the remaining processors to connect to each other directly using the interconnect links. The reduction in processor count improves symmetry and reduces overall latency thereby potentially improving performance of certain applications despite having fewer processors. In one specific implementation, the pass through device is used to connect two HyperTransport links together where each of the links is connected to a processor at the other end.

110 citations

Proceedings Article•10.1145/782814.782854•
Automatic fence insertion for shared memory multiprocessing

[...]

Xing Fang1, Jaejin Lee2, Samuel P. Midkiff1•
Purdue University1, Seoul National University2
23 Jun 2003
TL;DR: This work presents the different fence insertion optimization techniques that were used in this Pensieve compiler project to guarantee sequential consistency at the language level, and compares them using performance data.
Abstract: In general, the hardware memory consistency model in a multiprocessor system is not identical to the memory model at the programming language level. Consequently, the programming language memory model must be mapped onto the hardware memory model. Memory fence instructions can be inserted by the compiler where needed to accomplish this mapping. We have developed and implemented several fence insertion and optimization algorithms in our Pensieve compiler project. We present the different fence insertion optimization techniques that were used in this system to guarantee sequential consistency at the language level, and compare them using performance data. Our techniques target two hardware relaxed memory consistency models provided by SMPs based on IBM Power 3 and Intel Pentium 4. Our fence insertion optimization shows up to 17.2% and 32.7% performance improvement on average, with the IBM PowerPC and Intel Pentium 4 (Xeon) multiprocessors respectively.

97 citations

Patent•
Asynchronous, independent and multiple process shared memory system in an adaptive computing architecture

[...]

Ric Howard, Ramana V. Katragadda
19 Aug 2003
TL;DR: In this article, the authors present a system and method for implementation and use of a shared memory, which allows data to be streamed to multiple processes and nodes without regard to synchronization of the plurality of processes.
Abstract: The present invention provides a system and method for implementation and use of a shared memory. The shared memory may be accessed both independently and asynchronously by one or more processes at corresponding nodes, allowing data to be streamed to multiple processes and nodes without regard to synchronization of the plurality of processes. The various nodes may be adaptive computing nodes, kernel or controller nodes, or one or more host processor nodes. The present invention maintains memory integrity, not allowing memory overruns, underruns, or deadlocks. The present invention also provides for “push back” after a memory read, for applications in which it is desirable to “unread” some elements previously read from the memory.

88 citations

Hybrid Parallel Programming on HPC Platforms

[...]

Rolf Rabenseifner
1 Jan 2003
TL;DR: This paper analyzes the strength and weakness of several parallel programming models on clusters of SMP nodes and analyses strategies to overcome typical drawbacks of this easily usable programming scheme on systems with weaker interconnects.
Abstract: Summary Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node inter-connect with the shared memory parallelization inside of each node. Various hybrid MPI+OpenMP programming models are compared with pure MPI. Benchmark results of several platforms are presented. This paper analyzes the strength and weakness of several parallel programming models on clusters of SMP nodes. Benchmark results on a Myrinet cluster and on recent Cray, NEC, IBM, Hitachi, SUN and SGI platforms show, that the hybrid-masteronly programming model can be used more eciently on some vector-type systems, but also on clusters of dual-CPUs. On other systems, one CPU is not able to saturate the inter-node network and the commonly used masteronly programming model suers from insucient inter-node bandwidth. This paper analyses strategies to overcome typical drawbacks of this easily usable programming scheme on systems with weaker interconnects. Best performance can be achieved with overlapping communication and computation, but this scheme is lacking in ease of use.

73 citations

Journal Article•10.1177/1094342003017001005•
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

[...]

Rolf Rabenseifner1, Gerhard Wellein•
University of Stuttgart1
1 Feb 2003
TL;DR: The hybrid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other parallel programming models on hybrid architectures, and also on whether programming paradigms can separate the optimization of communication and computation.
Abstract: Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside each node. The hybrid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other parallel programming models on hybrid architectures. The paper focuses on bandwidth and latency aspects, and also on whether programming paradigms can separate the optimization of communication and computation. Benchmark results are presented for hybrid and pure MPI communication. This paper analyzes the strengths and weaknesses of several parallel programming models on clusters of SMP nodes.

72 citations

Patent•
System and method for sharing memory by heterogeneous processors

[...]

Harm Peter Hofstee1, Charles Ray Johns1, James Allan Kahle1•
IBM1
30 Oct 2003
TL;DR: In this paper, a system for sharing memory by heterogeneous processors, each of which is adapted to process its own instruction set, is presented, where a common bus is used to couple the common memory to the various processors.
Abstract: A system for sharing memory by heterogeneous processors, each of which is adapted to process its own instruction set, is presented. A common bus is used to couple the common memory to the various processors. In one embodiment, a cache for more than one of the processors is stored in the shared memory. In another embodiment, some of the processors include a local memory area that is mapped to the shared memory pool. In yet another embodiment, local memory included on one or more of the processors is partially shared so that some of the local memory is mapped to the shared memory area, while remaining memory in the local memory is private to the particular processor.

65 citations

Patent•
Scalable directory based cache coherence protocol

[...]

Richard E. Kessler1, Kourosh Gharachorloo1, David H. Asher1•
Hewlett-Packard1
31 Mar 2003
TL;DR: In this article, a system and method for maintaining the coherence of shared data in cache and memory contained in the nodes of a multiprocessing computer system is described and discussed.
Abstract: A system and method is disclosed to maintain the coherence of shared data in cache and memory contained in the nodes of a multiprocessing computer system. The distributed multiprocessing computer system contains a number of processors each connected to main memory. A processor in the distributed multiprocessing computer system is identified as a Home processor for a memory block if it includes the original memory block and a coherence directory for the memory block in its main memory. An Owner processor is another processor in the multiprocessing computer system that includes a copy of the Home processor memory block in a cache connected to its main memory. Whenever an Owner processor is present for a memory block, it is the only processor in the distributed multiprocessing computer system to contain a copy of the Home processor memory block. Eviction of a memory block copy held by an Owner processor in its cache requires a write of the memory block copy to its Home and update of the corresponding coherence directory. No reads of the Home processor directory or modification of other processor cache and main memory is required. The coherence controller in each processor is able to send and receive messages out of order to maintain the coherence of the shared data in cache and main memory. If an out of order message causes an incorrect next program state, the coherence controller is able to restore the prior correct saved program state and resume execution.

65 citations

Proceedings Article•10.1109/HPCA.2003.1183540•
Memory system behavior of Java-based middleware

[...]

Magnus Karlsson1, Kevin E. Moore1, Erik Hagersten1, Darien Wood1•
Uppsala University1
8 Feb 2003
TL;DR: A detailed characterization of the memory system, behavior of ECperf and SPECjbb is presented using both commercial server hardware and Simics full-system simulation, finding that the memory footprint and primary working sets of these workloads are small compared to other commercial workloads.
Abstract: In this paper, we present a detailed characterization of the memory system, behavior of ECperf and SPECjbb using both commercial server hardware and Simics full-system simulation. We find that the memory footprint and primary working sets of these workloads are small compared to other commercial workloads (e.g. on-line transaction processing), and that a large fraction of the working sets are shared between processors. We observed two key differences between ECperf and SPECjbb that highlight the importance of isolating the behavior of the middle tier. First, ECperf has a larger instruction footprint, resulting in much higher miss rates for intermediate-size instruction caches. Second, SPECjbb's data set size increases linearly as the benchmark scales up, while ECperf's remains roughly constant. This difference can lead to opposite conclusions on the design of multiprocessor memory systems, such as the utility of moderate sized (i.e. 1 MB) shared caches in a chip multiprocessor.
Proceedings Article•10.1109/IPDPS.2003.1213486•
Partitioning with space-filling curves on the cubed-sphere

[...]

John M. Dennis
22 Apr 2003
TL;DR: Hilbert, Peano, and nested Hilbert m-Peano space filling curves are investigated as the basis of alternative partitioning algorithms that allow a maximum 22% increase in the sustained floating point execution rate versus METIS on O(1000) processors, when running a relatively high resolution climate simulation.
Abstract: Numerical methods for solving the systems of partial differential equations arising in geophysical fluid dynamics rely on a variety of spatial discretization schemes (e.g. finite difference, finite element). For parallel execution on distributed memory computers, the computational domain must be partitioned. The choice of partitioning algorithm can have a significant impact on the sustained floating point execution rate of an atmospheric model. The NCAR spectral element atmospheric model employs a gnomonic projection of a cube onto the surface of the sphere. The six cube faces are each subdivided into an array of quadrilateral spectral elements. When the cubed-sphere is partitioned using METIS, both computational load imbalance and communication requirements can lead to sub-optimal performance. In this paper, Hilbert, Peano, and nested Hilbert m-Peano space filling curves are investigated as the basis of alternative partitioning algorithms. The resulting partitions allow a maximum 22% increase in the sustained floating point execution rate versus METIS on O(1000) processors, when running a relatively high resolution climate simulation.
Patent•
Methods and systems for inter-process copy sharing of data objects

[...]

David Whitlock, Robert Bretl
20 Oct 2003
TL;DR: In this paper, the state of a data object of interest is flushed from a memory location local to a one of the multiple processes to a shared memory location where the flushed state is visible to the rest of concurrently executing multiple processes.
Abstract: Data sharing between multiple computer processes is made possible by brokering the sharing of the state of data objects of interest between the multiple processes via a shared memory location. A state of a data object of interest is flushed from a memory location local to a one of the multiple processes to a shared memory location wherein the flushed state is visible to the rest of concurrently executing multiple processes. The instruction to flush may be explicit or implicit via data references. Similarly, a state of a data object in a memory location local to a process may be refreshed with an updated state available in the shared memory location. The state of data object in a shared memory location or in a local memory location may be determined via data reflection or if so specified, by serialization methods. The flush and refresh operations may be implemented as function calls exposed to the processes requesting data sharing.
Patent•
Logic multiprocessor for FPGA implementation

[...]

Michael R. Butts
23 Sep 2003
TL;DR: In this article, a design verification system utilizing programmable logic devices having varying numbers of logic processors, macro processors, memory processors and general purpose processors programmed therein is disclosed, which can execute Boolean functions, macro operations, memory operations, and other computer instructions.
Abstract: A design verification system utilizing programmable logic devices having varying numbers of logic processors, macro processors, memory processors and general purpose processors programmed therein is disclosed. These various processors can execute Boolean functions, macro operations, memory operations, and other computer instructions. This avoids either the need to implement logic or the need to compile the design into many gate-level Boolean logic operations for logic processors. Improved efficiency in the form of lower cost, lower power and/or higher speeds are the result when verifying certain types of designs.
Patent•
Efficient memory allocation in a wireless transmit/receiver unit

[...]

Dick Stephen G1, Terry Stephen E1, Yi-Ju Chao1•
InterDigital, Inc.1
8 Aug 2003
TL;DR: In this article, the authors proposed an effective sharing of the hardware memory of a wireless transmit receive unit (WTRU), where memory will be shared among various buffers of different entities, such as MAC reordering buffers and the RLC reception buffers.
Abstract: The present invention allows for effective sharing of the hardware memory of a wireless transmit receive unit (WTRU). The memory will be shared among various buffers of different entities. More particularly, memory will be shared among the MAC reordering buffers and the RLC reception buffers.
Journal Article•10.1016/S1389-1286(02)00450-4•
A highly flexible, distributed multiprocessor architecture for network processing

[...]

Muthu Venkatachalam1, Prashant R. Chandra1, Raj Yavatkar1•
Intel1
05 Apr 2003-Computer Networks
TL;DR: This work describes an architecture that uses a combination of distributed memory architecture and one or more multithreaded processors to achieve the necessary performance and presents a programming model for generic network applications that uses software pipelines.
Shared Memory vs Message Passing

[...]

Carole Delporte-Gallet, Hugues Fauconnier, Rachid Guerraoui
1 Jan 2003
TL;DR: Various fundamental distributed computing abstractions, including shared memory, consensus, as well as terminating reliable broadcast, are, in a precise sense, equivalent.
Abstract: his paper determines the computational strenght of the shared memory abstraction (a register) emulated over a message passing system, and compares it with fundamental message passing abstractions like consensus and various forms of reliable broadcast. We introduce the notion of Quorum failure detectors and show that this notion captures the exact amount of information about failures needed to emulate a shared memory in a distributed message passing system where processes can fail by crashing. As a corollary of our result, we determine the weakest failure detector to implement consensus in any environment, including those where half of the processes can crash. We also use our result to precisely compare the strenght of the shared memory abstraction with other distributed computing abstractions. In particular, we show that, in the general environment where up to $n-1$ processes can crash (out of the total number of processes $n$), various fundamental distributed computing abstractions, including shared memory, consensus, as well as terminating reliable broadcast, are, in a precise sense, equivalent.
Journal Article•10.1109/TPDS.2003.1199067•
Information-flow models for shared memory with an application to the PowerPC architecture

[...]

Allon Adir1, Hagit Attiya2, G. Shurek1•
University of Haifa1, Technion – Israel Institute of Technology2
01 May 2003-IEEE Transactions on Parallel and Distributed Systems
TL;DR: A generic framework for defining instructions, programs, and the semantics of their instantiation by operations in a multiprocessor environment that allows an architect to reveal the programming view induced by a shared-memory architecture and guides architecture-level verification.
Abstract: This paper introduces a generic framework for defining instructions, programs, and the semantics of their instantiation by operations in a multiprocessor environment. The framework captures information flow between operations in a multiprocessor program by means of a reads-from mapping from read operations to write operations. Two fundamental relations are defined on the operations: a program order between operations which instantiate the program of some processor and view orders which are specific to each shared memory model. An operation cannot read from the "hidden" pastor from the future; the future and the past causality can be examined either relative to the program order or relative to the view orders. A shared memory model specifies, for a given program, the permissible transformation of resource states. The memory model should reflect the programmer's view by citing the guaranteed behavior of the multiprocessor in the interface visible to the programmer. The model should retrain from dictating the design practices that should be followed by the implementation. Our framework allows an architect to reveal the programming view induced by a shared-memory architecture; it serves programmers exploring the limits of the programming interface and guides architecture-level verification. The framework is applicable for complex, commercial architectures as it can capture subtle programming-interface details, exposing the underlying aggressive microarchitecture mechanisms. As an illustration, we define the shared memory model supported by the PowerPC architecture, within our framework.
Proceedings Article•10.1145/781498.781505•
Programming the FlexRAM parallel intelligent memory system

[...]

Basilio B. Fraguela, Jose Renau1, Paul Feautrier2, David Padua1, Josep Torrellas1 •
University of Illinois at Urbana–Champaign1, École normale supérieure de Lyon2
11 Jun 2003
TL;DR: This paper proposes a family of high-level compiler directives inspired by OpenMP called CFlex and proposes libraries of highly-optimized functions called Intelligent Memory Operations (IMOs), which program the processors in memory through CFlex, but make them completely transparent to the programmer.
Abstract: In an intelligent memory architecture, the main memory of a computer is enhanced with many simple processors. The result is a highly-parallel, heterogeneous machine that is able to exploit computation in the main memory. While several instantiations of this architecture have been proposed, the question of how to effectively program them with little effort has remained a major challenge.In this paper, we show how to effectively hand-program an intelligent memory architecture at a high level and with very modest effort. We use FlexRAM as a prototype architecture. To program it, we propose a family of high-level compiler directives inspired by OpenMP called CFlex. Such directives enable the processors in memory to execute the program in cooperation with the main processor. In addition, we propose libraries of highly-optimized functions called Intelligent Memory Operations (IMOs). These functions program the processors in memory through CFlex, but make them completely transparent to the programmer. Simulation results show that, with CFlex and IMOs, a server with 64 simple processors in memory runs on average 10 times faster than a conventional server. Moreover, a set of conventional programs with 240 lines on average are transformed into CFlex parallel form with only 7 CFlex directives and 2 additional statements on average.
Journal Article•
The Earth Simulator system

[...]

Shinichi Habata, Mitsuo Yokokawa, Shigemune Kitawaki
01 Jan 2003-Nec Research & Development
TL;DR: The Earth Simulator achieved 35.86TFLOPS, or 87.5% of peak performance of the system, in LINPACK benchmark, and has been proven as the most powerful supercomputer in the world.
Abstract: The Earth Simulator, developed by the Japanese government's initiative Earth Simulator Project, is a highly parallel vector supercomputer system that consists of 640 processor nodes and interconnection network. The processor node is a shared memory parallel vector supercomputer, in which 8 vector processors that can deliver 8GFLOPS are tightly connected to a shared memory with a peak performance of64GFLOPS. The interconnection network is a huge non-blocking crossbar switch linking 640 processor nodes and supports for global addressing and synchronization. The aggregate peak vector performance of the Earth Simulator is 40TFLOPS, and the intercommunication bandwidth between every two processor nodes is 12.3GB/s in each direction. The aggregate switching capacity of the interconnection network is 7.87TB/s. To realize a high-performance and high-efficiency computer system, three architectural features are applied in the Earth Simulator; vector processor, shared-memory and high-bandwidth non-blocking interconnection crossbar network. The Earth Simulator achieved 35.86TFLOPS, or 87.5% of peak performance of the system, in LINPACK benchmark, and has been proven as the most powerful supercomputer in the world. It also achieved 26.58TFLOPS, or 64.9% of peak performance of the system, for a global atmospheric circulation model with the spectral method. This record-breaking sustained performance makes this innovative system a very effective scientific tool for providing solutions to the sustainable development of humankind and its symbiosis with the planet earth.
Book Chapter•10.1007/978-3-642-56103-0_12•
A Particle-Partition of Unity Method-Part IV: Parallelization

[...]

Michael Griebel1, Marc Alexander Schweitzer1•
University of Bonn1
1 Jan 2003
TL;DR: These results show the optimal scaling behavior of the algorithm in the discretization as well as the solution phase of the parallelization of the multilevel partition of unity method for distributed memory computers.
Abstract: In this sequel to [7, 8, 9] we focus on the parallelization of our multilevel partition of unity method for distributed memory computers The presented parallelization is based on a data decomposition approach which utilizes a key-based tree implementation and a weighted space filling curve ordering scheme for the load balancing problem We present numerical results in two and three dimensions with up to 128 processors and 42 million degrees of freedom These results show the optimal scaling behavior of our algorithm in the discretization as well as the solution phase
Patent•
Scalable runtime system for global address space languages on shared and distributed memory machines

[...]

Roch Georges Archambault1, Anthony Bolmarcich1, G. Calin Cascaval1, Siddhartha Chatterjee1, Maria Eleftheriou1, Raymond Ying Chau Mak1 •
IBM1
12 Dec 2003
TL;DR: In this article, an improved scalability runtime system for a global address space language running on a distributed or shared memory machine uses a directory of shared variables having a data structure for tracking shared variable information that is shared by program threads.
Abstract: An improved scalability runtime system for a global address space language running on a distributed or shared memory machine uses a directory of shared variables having a data structure for tracking shared variable information that is shared by a plurality of program threads. Allocation and de-allocation routines are used to allocate and de-allocate shared variable entries in the directory of shared variables. Different routines can be used to access different types of shared data. A control structure is used to control access to the shared data such that all threads can access the data at any time. Since all threads see the same objects, synchronization issues are eliminated. In addition, the improved efficiency of the data sharing allows the number of program threads to be vastly increased.
Proceedings Article•10.1145/781498.781532•
Using generative design patterns to generate parallel code for a distributed memory environment

[...]

Kai Tan1, Duane Szafron1, Jonathan Schaeffer1, John Anvik1, Steve MacDonald2 •
University of Alberta1, University of Waterloo2
11 Jun 2003
TL;DR: This paper describes generalizing CO2P3S to generate distributed-memory parallel solutions and uses generative design patterns to take an application specification and use it to generate parallel application code that achieves good performance in shared memory and distributed memory environments.
Abstract: A design pattern is a mechanism for encapsulating the knowledge of experienced designers into a re-usable artifact. Parallel design patterns reflect commonly occurring parallel communication and synchronization structures. Our tools, CO2P3S (Correct Object-Oriented Pattern-based Parallel Programming System) and MetaCO2P3S, use generative design patterns. A programmer selects the parallel design patterns that are appropriate for an application, and then adapts the patterns for that specific application by selecting from a small set of code-configuration options. CO2P3S then generates a custom framework for the application that includes all of the structural code necessary for the application to run in parallel. The programmer is only required to write simple code that launches the application and to fill in some application-specific sequential hook routines. We use generative design patterns to take an application specification (parallel design patterns + sequential user code) and use it to generate parallel application code that achieves good performance in shared memory and distributed memory environments. Although our implementations are for Java, the approach we describe is tool and language independent. This paper describes generalizing CO2P3S to generate distributed-memory parallel solutions.
Patent•
A portable communication apparatus and methods for match-making with distributed memory

[...]

Stephen J. Carlton, Olle Bliding, Jonas Runesson, Anna Emanuelsson, Lars Knutsson 
17 Apr 2003
TL;DR: In this paper, a transceiver is adapted for wireless communication with a first remote communication apparatus, so as to transmit the local match-making data to the first remote communications apparatus, as well as to receive first remote matchmaking data associated with the first wireless communication apparatus.
Abstract: A portable communication apparatus for match-making with a plurality of remote communication apparatuses has memory means adapted to store match-making data, including local match-making data associated with the portable communication apparatus. A transceiver is adapted for wireless communication with a first remote communication apparatus, so as to transmit the local match-making data to the first remote communication apparatus, as well as to receive first remote match-making data associated with the first remote communication apparatus. A processing device is adapted to perform a correlation analysis between the local match-making data and the first remote match-making data and—in case a result of the correlation analysis indicates a match between the portable communication apparatus and the first remote communication apparatus—provide an alert to a user of the portable communication apparatus. The memory means is adapted to store the first remote match-making data, and the transceiver is adapted to transmit the first remote match-making data, in addition to the local match-making data, to a second remote communication apparatus.
Proceedings Article•10.1109/IPDPS.2003.1213137•
Quantifying locality effect in data access delay: memory logP

[...]

Kirk W. Cameron1, Xian-He Sun2•
University of South Carolina1, Illinois Institute of Technology2
22 Apr 2003
TL;DR: This work presents a simple and useful model of point-to-point memory communication to predict and analyze the latency of memory copy, pack and unpack and uses the model to isolate contributions of hardware, middleware, and software to data transfers on Intel- and MIPS-based platforms.
Abstract: The application of hardware-parameterized models to distributed systems can result in omission of key bottlenecks such as the full cost of inter-node communication in a shared memory cluster. However, inclusion in the model of message characteristics and complex memory hierarchies may result in impractical models. Nonetheless, the growing gap between memory and CPU performance combined with the trend toward large scale clustered shared memory platforms implies an increased need to consider the impact of local memory communication on parallel processing in distributed systems. We present a simple and useful model of point-to-point memory communication to predict and analyze the latency of memory copy, pack and unpack. We use the model to isolate contributions of hardware, middleware, and software to data transfers on Intel- and MIPS-based platforms.
Journal Article•10.1007/S10009-002-0093-2•
Scalable distributed on-the-fly symbolic model checking

[...]

Shoham Ben-David1, Orna Grumberg2, Tamir Heyman2, Tamir Heyman1, Assaf Schuster2 •
IBM1, Technion – Israel Institute of Technology2
01 Aug 2003-International Journal on Software Tools for Technology Transfer
TL;DR: A scalable method for parallel symbolic on-the-fly model checking in a distributed memory environment that can check models that are too large to fit in the memory of a single node and an efficient, BDD-based algorithm for a distributed construction of a counterexample is suggested.
Abstract: This paper presents a scalable method for parallel symbolic on-the-fly model checking in a distributed memory environment Our method combines a scheme for on-the-fly model checking for safety properties with a scheme for scalable reachability analysis We suggest an efficient, BDD-based algorithm for a distributed construction of a counterexample The extra memory requirement for counterexample generation is evenly distributed among the processes by a memory balancing procedure At no point during computation does the memory of a single process contain all the data This enhances scalability Collaboration between the parallel processes during counterexample generation reduces memory utilization for the backward step We implemented our method on a standard, loosely- connected environment of workstations, using a high-performance model checker Our initial performance evaluation, carried out on several large circuits, shows that our method can check models that are too large to fit in the memory of a single node Our on-the-fly approach may find counterexamples even when the model is too large to fit in the memory of the parallel system
Book Chapter•10.1007/978-3-540-24644-2_13•
Evaluating the impact of programming language features on the performance of parallel applications on cluster architectures

[...]

Konstantin Berlin1, Jun Huan2, Mary Jacob3, Garima Kochhar3, Jan F. Prins2, Bill Pugh1, P. Sadayappan3, Jaime Spacco1, Chau-Wen Tseng1 •
University of Maryland, College Park1, University of North Carolina at Chapel Hill2, Ohio State University3
2 Oct 2003
TL;DR: It is found that language features can make parallel programs easier to write, but cannot hide the underlying communication costs for the target parallel architecture.
Abstract: We evaluate the impact of programming language features on the performance of parallel applications on modern parallel architectures, particularly for the demanding case of sparse integer codes. We compare a number of programming languages (Pthreads, OpenMP, MPI, UPC) on both shared and distributed-memory architectures. We find that language features can make parallel programs easier to write, but cannot hide the underlying communication costs for the target parallel architecture. Powerful compiler analysis and optimization can help reduce software overhead, but features such as fine-grain remote accesses are inherently expensive on clusters. To avoid large reductions in performance, language features must avoid degrading the performance of local computations.
Journal Article•10.1016/J.PARCO.2003.05.015•
An improved binary-swap compositing for sort-last parallel rendering on distributed memory multiprocessors

[...]

Akira Takeuchi1, Fumihiko Ino1, Kenichi Hagihara1•
Osaka University1
1 Nov 2003
TL;DR: An improvement on the binary-swap (BS) method, which is an efficient image compositing algorithm for sort-last parallel rendering, using three acceleration techniques compared to the original BS method: the interleaved splitting, multiple bounding rectangle, and run-length encoding.
Abstract: Sort-last parallel rendering is a good rendering scheme on distributed memory multiprocessors. This paper presents an improvement on the binary-swap (BS) method, which is an efficient image compositing algorithm for sort-last parallel rendering. Our compositing method uses three acceleration techniques, compared to the original BS method: (1) the interleaved splitting, (2) multiple bounding rectangle, and (3) run-length encoding. Through the use of the three techniques, our method balances the compositing workload among processors, exploits more sparsity of the image, and reduces the cost of communication.We also show some experimental results on a PC cluster. The results show that our method completes the image compositing faster than the original BS method, and its speedup to the original increases with the number of processors.
Patent•
Method and apparatus for an energy efficient operation of multiple processors in a memory

[...]

Duncan G. Elliott, W. Martin Snelgrove
6 May 2003
TL;DR: In this paper, a method of addressing a memory, reading a row of data from the memory, providing the same computational instruction simultaneously to each processor element of a plurality of processor elements, where each of the processor elements is selectively coupled to a corresponding bit position of the memory row, and writing the result into the memory at the same row as the data was read from.
Abstract: A method of operating a digital computer includes the steps of addressing a memory, reading a row of data from the memory, providing the same computational instruction simultaneously to each processor element of a plurality of processor elements, where each of the processor elements is selectively coupled to a corresponding bit position of the memory row of data, performing the same computational operation on a selected plurality of data bits in parallel, and writing the result into the memory at the same row as the data was read from.
Journal Article•10.1145/773365.773367•
Fresh Breeze: a multiprocessor chip architecture guided by modular programming principles

[...]

Jack B. Dennis1•
Massachusetts Institute of Technology1
01 Mar 2003-ACM Sigarch Computer Architecture News
TL;DR: The Fresh Breeze project concerns the architecture and design of a multiprocessor chip that can achieve superior performance while honoring six important principles for supporting modular software construction that are often violated by the architectures proposed for multipROcessor computer systems.
Abstract: It is well-known that multiprocessor systems are vastly more difficult to program than systems that support sequential programming models. In a 1998 paper[11] this author argued that six important principles for supporting modular software construction are often violated by the architectures proposed for multiprocessor computer systems. The Fresh Breeze project concerns the architecture and design of a multiprocessor chip that can achieve superior performance while honoring these six principles.The envisioned multiprocessor chip will incorporate three ideas that are significant departures from mainstream thinking about multiprocessor architecture: (1) Simultaneous multithreading has been shown to have performance advantages relative to contemporary superscalar designs. This advantage can be exploited through use of a programming model that exposes parallelism in the form of multiple threads of computation. (2) The value of a shared address space is widely appreciated. Through the use of 64-bit pointers, the conventional distinction between "memory" and the file system can be abolished. This can provide a superior execution environment in support of program modularity and software reuse, as well as supporting multi-user data protection and security that is consistent with modular software structure. (3) No memory update; cycle-free heap. Data items are created, used, and released, but never modified once created. The allocation, release, and garbage collection of fixed-size chunks of memory will be implemented by efficient hardware mechanisms. A major benefit of this choice is that the multiprocessor cache coherence problem vanishes: any object retrieved from the memory system is immutable. In addition, it is easy to prevent the formation of pointer cycles, simplifying the design of memory management support.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve