Top 360 papers published in the topic of Distributed memory in 1998

Showing papers on "Distributed memory published in 1998"

Proceedings Article•10.1145/278008.278027•

GloMoSim: a library for parallel simulation of large-scale wireless networks

[...]

Xiang Zeng¹, Rajive Bagrodia¹, Mario Gerla¹•Institutions (1)

1 Jul 1998

TL;DR: The paper describes the GloMoSim library, addresses a number of issues relevant to its parallelization, and presents a set of experimental results on the IBM 9076 SP, a distributed memory multicomputer.

...read moreread less

Abstract: A number of library based parallel and sequential network simulators have been designed. The paper describes a library, called GloMoSim (Global Mobile system Simulator), for parallel simulation of wireless networks. GloMoSim has been designed to be extensible and composable: the communication protocol stack for wireless networks is divided into a set of layers, each with its own API. Models of protocols at one layer interact with those at a lower (or higher) layer only via these APIs. The modular implementation enables consistent comparison of multiple protocols at a given layer. The parallel implementation of GloMoSim can be executed using a variety of conservative synchronization protocols, which include the null message and conditional event algorithms. The paper describes the GloMoSim library, addresses a number of issues relevant to its parallelization, and presents a set of experimental results on the IBM 9076 SP, a distributed memory multicomputer. These experiments use models constructed from the library modules.

...read moreread less

1,549 citations

Journal Article•

Multifrontal parallel distributed symmetric and unsymmetric solvers

[...]

Patrick R. Amestoy, Iain S. Duff, Jean-Yves L'Excellent

01 Jan 1998-Rutherford Appleton Laboratory Technical Reports

TL;DR: A new parallel distributed memory multifrontal approach with dynamic scheduling of the computing tasks has been developed to handle numerical pivoting efficiently and performance analysis shows the efficiency and the potential of the method.

...read moreread less

Abstract: We consider the solution of both symmetric and unsymmetric systems of sparse linear equations. A new parallel distributed memory multifrontal approach is described. To handle numerical pivoting efficiently, a parallel asynchronous algorithm with dynamic scheduling of the computing tasks has been developed. We discuss some of the main algorithmic choices and compare both implementation issues and the performance of the LDLT and LU factorizations. Performance analysis on an IBM SP2 shows the efficiency and the potential of the method. The test problems used are from the Rutherford–Boeing collection and from the PARASOL end users.

...read moreread less

655 citations

Patent•

Data processing system and method

[...]

John Iselin Woodfill¹, H. Harlyn Baker¹, Brian Von Herzen¹, Robert Dale Alkire¹•Institutions (1)

Interval Research Corporation¹

2 Apr 1998

TL;DR: In this article, a reconfigurable image processing system with a toroidal topology, distributed memory, and wide bandwidth I/O is described, which is capable of solving real applications at real-time speeds.

...read moreread less

Abstract: A powerful, scaleable, and reconfigurable image processing system and method of processing data therein is described. This general purpose, reconfigurable engine with toroidal topology, distributed memory, and wide bandwidth I/O are capable of solving real applications at real-time speeds. The reconfigurable image processing system can be optimized to efficiently perform specialized computations, such as real-time video and audio processing. This reconfigurable image processing system provides high performance via high computational density, high memory bandwidth, and high I/O bandwidth. Generally, the reconfigurable image processing system and its control structure include a homogeneous array of 16 field programmable gate arrays (FPGA) and 16 static random access memories (SRAM) arranged in a partial torus configuration. The reconfigurable image processing system also includes a PCI bus interface chip, a clock control chip, and a datapath chip. It can be implemented in a single board. It receives data from its external environment, computes correspondence, and uses the results of the correspondence computations for various post-processing industrial applications. The reconfigurable image processing system determines correspondence by using non-parametric local transforms followed by correlation. These non-parametric local transforms include the census and rank transforms. Other embodiments involve a combination of correspondence, rectification, a left-right consistency check, and the application of an interest operator.

...read moreread less

537 citations

Journal Article•10.1016/S0167-8191(98)00093-3•

BSPlib: The BSP programming library

[...]

Jonathan M. D. Hill¹, Bill McColl¹, Dan C. Stefanescu², Dan C. Stefanescu³, Mark W. Goudreau⁴, Kevin J. Lang⁵, Satish Rao⁵, Torsten Suel⁶, Thanasis Tsantilas⁷, Rob H. Bisseling⁸ - Show less +6 more•Institutions (8)

University of Oxford¹, Suffolk University², Harvard University³, University of Central Florida⁴, Princeton University⁵, Alcatel-Lucent⁶, Columbia University⁷, Utrecht University⁸

1 Dec 1998

TL;DR: This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use, and briefly describes applications in benchmarking, Fast Fourier Transforms, sorting, and molecular dynamics.

...read moreread less

Abstract: BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming in two distinct styles: direct remote memory access (DRMA) using put or get operations, and bulk synchronous message passing (BSMP). Currently, implementations of BSPlib exist for a variety of modern architectures, including massively parallel computers with distributed memory, shared memory multiprocessors, and networks of workstations. BSPlib has been used in several scientific and industrial applications; this paper briefly describes applications in benchmarking, Fast Fourier Transforms (FFTs), sorting, and molecular dynamics.

...read moreread less

365 citations

Journal Article•10.1016/S0045-7825(97)00183-7•

Large-scale simulation of elastic wave propagation in heterogeneous media on parallel computers

[...]

Hesheng Bao¹, Jacobo Bielak¹, Omar Ghattas¹, Loukas F. Kallivokas, David R. O'Hallaron¹, Jonathan Richard Shewchuk¹, Jifeng Xu¹ - Show less +3 more•Institutions (1)

Carnegie Mellon University¹

22 Jan 1998-Computer Methods in Applied Mechanics and Engineering

TL;DR: A system that helps automate the task of writing efficient portable unstrucmred mesh solvers for distributed memory parallel supercomputers and indicates that, despite the highly irregular structure of the problem, excellent performance and scalability are achieved.

...read moreread less

345 citations

Patent•

Software partitioned multi-processor system with flexible resource sharing levels

[...]

Stephen H. Zalewski, Andrew H. Mason, Gregory H. Jordan¹, Karen L. Noel•Institutions (1)

Hewlett-Packard¹

10 Jun 1998

TL;DR: In this article, a configuration tree is used to partition a single multiprocessor into multiple partitions, each running a distinct copy, or instance, of an operating system, each of the partitions has access to its own physical resources plus resources designated as shared.

...read moreread less

Abstract: Multiple instances of operating systems execute cooperatively in a single multiprocessor computer wherein all processors and resources are electrically connected together. The single physical machine with multiple physical processors and resources is subdivided by software into multiple partitions, each running a distinct copy, or instance, of an operating system. Each of the partitions has access to its own physical resources plus resources designated as shared. The partitioning is performed by assigning all resources with a configuration tree. None, some, or all, resources may be designated as shared among multiple partitions. Each individual operating instance will generally be assigned the resources it needs to execute independently and these resources will be designated as “private.” Other resources, particularly memory, can be assigned to more than one instance and shared. Shared memory is cache coherent so that instances may be tightly coupled, and may share resources that are normally allocated to a single instance. This allows previously distributed user or operating system applications which usually must pass messages via an external interconnect to operate cooperatively in the shared memory without the need for either an external interconnect or message passing. Examples of application that could take advantage of this capability include distributed lock managers and cluster interconnects. Newly-added resources, such as CPUs and memory, can be dynamically assigned to different partitions and used by instances of operating systems running within the machine by modifying the configuration.

...read moreread less

269 citations

Journal Article•10.1109/71.655248•

Optimal scheduling algorithm for distributed-memory machines

[...]

Sekhar Darbha¹, Dharma P. Agrawal²•Institutions (2)

Rutgers University¹, North Carolina State University²

01 Jan 1998-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A Task Duplication based Scheduling (TDS) algorithm which can schedule directed acyclic graphs (DAGs) with a complexity of O(|V|/sup 2/), where |V| is the number of tasks in the DAG.

...read moreread less

Abstract: Task Scheduling is one of the key elements in any distributed-memory machine (DMM), and an efficient algorithm can help reduce the interprocessor communication time. As optimal scheduling of tasks to DMMs is a strong NP-hard problem, many heuristic algorithms have been introduced in the literature. This paper presents a Task Duplication based Scheduling (TDS) algorithm which can schedule directed acyclic graphs (DAGs) with a complexity of O(|V|/sup 2/), where |V| is the number of tasks in the DAG. This algorithm generates an optimal schedule for a class of DAGs which satisfy a simple cost relationship. The performance of the algorithm has been observed by its application to some practical DAGs, and by comparing it with other existing scheduling schemes in terms of the schedule length and algorithm complexity.

...read moreread less

223 citations

Proceedings Article•10.1145/277650.277748•

Memory management with explicit regions

[...]

Alex Aiken¹•Institutions (1)

University of California, Berkeley¹

1 May 1998

TL;DR: It is shown that on a suite of allocation-intensive C programs, regions are competitive with malloc/free and sometimes substantially faster and that regions support safe memory management with low overhead.

...read moreread less

Abstract: Much research has been devoted to studies of and algorithms for memory management based on garbage collection or explicit allocation and deallocation. An alternative approach, region-based memory management, has been known for decades, but has not been well-studied. In a region-based system each allocation specifies a region, and memory is reclaimed by destroying a region, freeing all the storage allocated therein. We show that on a suite of allocation-intensive C programs, regions are competitive with malloc/free and sometimes substantially faster. We also show that regions support safe memory management with low overhead. Experience with our benchmarks suggests that modifying many existing programs to use regions is not difficult.

...read moreread less

220 citations

Patent•

Method and apparatus for dynamically sharing memory in a multiprocessor system

[...]

Karen L. Noel, Gregory H. Jordan, Paul K. Harter, Thomas Benson

10 Jun 1998

TL;DR: In this article, the authors describe a system where multiple instances of operating systems execute cooperatively in a single multiprocessor computer where all processors and resources are electrically connected together.

...read moreread less

Abstract: Multiple instances of operating systems execute cooperatively in a single multiprocessor computer wherein all processors and resources are electrically connected together. The single physical machine with multiple physical processors and resources is subdivided by software into multiple partitions, each with the ability to run a distinct copy, or instance, of an operating system. At different times, different operating system instances may be loaded on a given partition. Resources, such as CPUs and memory, can be dynamically assigned to different partitions and used by instances of operating systems running within the machine by modifying the configuration. The partitions themselves can also be changed without rebooting the system by modifying the configuration tree. A grouping of partitions, a community, shares memory. Memory may be private to a particular partition or may be shared by partitions within a community. When a community is formed the creating instance reads a configuration tree and builds management structures for the shared memory owned by the community. A single system may have one or more communities, each with their own representation within the configuration tree.

...read moreread less

196 citations

Journal Article•10.1109/2.707614•

Multiprocessors should support simple memory consistency models

[...]

Mark D. Hill¹•Institutions (1)

University of Wisconsin-Madison¹

01 Aug 1998-IEEE Computer

TL;DR: The author argues that multiprocessors should support SC because-with speculative execution, relaxed models do not provide sufficient additional performance to justify exposing their complexity to the authors of low level software.

...read moreread less

Abstract: In the future, many computers will contain multiple processors, in part because the marginal cost of adding a few additional processors is so low that only minimal performance gain is needed to make the additional processors cost effective. Intel, for example, now makes cards containing four Pentium Pro processors that can easily be incorporated into a system. Multiple processor cards like Intel's will help multiprocessing spread from servers to the desktop. But how will these multiprocessors be programmed? The evolution of the programming model is already under way. One important function of the programming model is to describe how memory operates. For a multiprocessor, a reasonable model is sequential consistency (SC), which makes a multiprocessor behave like a multitasking uniprocessor. Nevertheless, many commercial multiprocessors support more relaxed memory models. The author argues that multiprocessors should support SC because-with speculative execution, relaxed models do not provide sufficient additional performance to justify exposing their complexity to the authors of low level software.

...read moreread less

171 citations

Patent•

Shared memory initialization method for system having multiple processor capability

[...]

David L. Collins

18 Sep 1998

TL;DR: In this paper, the authors propose an approach to parallel execution of system initialization tasks by at least two processors to reduce system boot-up time in a multiprocessor computer system.

...read moreread less

Abstract: A multiprocessor computer system is provided with a BIOS that allows parallel execution of system initialization tasks by at least two processors to reduce system boot-up time. At power-on, one of the processors is designated as a bootstrap processor and the remaining processors are designates as application processors. The processors are coupled to a shared memory module by a shared processor bus. The bootstrap processor is configured to instruct the application processor to test and initialize memory locations in the shared memory module while the bootstrap processor proceeds with other system initialization tasks which may include determining the system configuration, initializing peripheral devices, testing the keyboard, and setting up the BIOS data area with configuration information. After completing its tasks, the bootstrap processor determines whether the application processor has completed the memory test, and if so, the bootstrap processor proceeds to locate and execute an operating system. It is expected that testing and initializing memory in parallel with other system initialization tasks will advantageously reduce system boot-up time in multiprocessor systems having large memories (e.g. 1-4 gigabytes).

...read moreread less

Patent•

Multiprocessor computer architecture incorporating a plurality of memory algorithm processors in the memory subsystem

[...]

Jon M. Huppenthal, Paul A. Leskar

3 Dec 1998

Abstract: A multiprocessor computer architecture incorporating a plurality of programmable hardware memory algorithm processors (“MAP”) in the memory subsystem. The MAP may comprise one or more field programmable gate arrays (“FPGAs”) which function to perform identified algorithms in conjunction with, and tightly coupled to, a microprocessor and each MAP is globally accessible by all of the system processors for the purpose of executing user definable algorithms. A circuit within the MAP signals when the last operand has completed its flow thereby allowing a given process to be interrupted and thereafter restarted. Through the use of read only memory (“ROM”) located adjacent the FPGA, a user program may use a single command to select one of several possible pre-loaded algorithms thereby decreasing system reconfiguration time. A computer system memory structure MAP disclosed herein may function in normal or direct memory access (“DMA”) modes of operation and, in the latter mode, one device may feed results directly to another thereby allowing pipelining or parallelizing execution of a user defined algorithm. The system of the present invention also provides a user programmable performance monitoring capability and utilizes parallelizer software to automatically detect parallel regions of user applications containing algorithms that can be executed in the programmable hardware.

...read moreread less

Proceedings Article•10.5555/509058.509092•

Making Sparse Gaussian Elimination Scalable by Static Pivoting

[...]

Xiaoye S. Li¹, James Demmel²•Institutions (2)

Lawrence Berkeley National Laboratory¹, University of California, Berkeley²

7 Nov 1998

TL;DR: The main advantage of the new method over partial pivoting is that it permits a priori determination of data structures and communication pattern for Gaussian elimination, which makes it more scalable on distributed memory machines.

...read moreread less

Abstract: We propose several techniques as alternatives to partial pivoting to stabilize sparse Gaussian elimination. From numerical experiments we demonstrate that for a wide range of problems the new method is as stable as partial pivoting. The main advantage of the new method over partial pivoting is that it permits a priori determination of data structures and communication pattern for Gaussian elimination, which makes it more scalable on distributed memory machines. Based on this a priori knowledge, we design highly parallel algorithms for both sparse Gaussian elimination and triangular solve and we show that they are suitable for large-scale distributed memory machines.

...read moreread less

Patent•

Circuit and method for maintaining order of memory access requests initiated by devices coupled to a multiprocessor system

[...]

James B. Keller¹, Dale E. Gulick¹, Larry D. Hewitt¹, Geoffrey S. Strongin¹•Institutions (1)

Advanced Micro Devices¹

23 Dec 1998

TL;DR: In this article, a method for preserving the order for memory requests originating from I/O devices coupled to a multiprocessor computer system is presented, where the first and second non-coherent memory access transactions are compared.

...read moreread less

Abstract: A circuit and method is disclosed for preserving the order for memory requests originating from I/O devices coupled to a multiprocessor computer system. The multiprocessor computer system includes a plurality of circuit nodes and a plurality of memories. Each circuit node includes at least one microprocessor coupled to a memory controller which in turn is coupled to one of the plurality of memories. The circuit nodes are in data communication with each other, each circuit node being uniquely identified by a node number. At least one of the circuit nodes is coupled to an I/O bridge which in turn is coupled directly or indirectly to one or more I/O devices. The I/O bridge generates non-coherent memory access transactions in response to memory access requests originating with one of the I/O devices. The circuit node coupled to the I/O bridge, receives the non-coherent memory access transactions. For example, the circuit node coupled to the I/O bridge receives first and second non-coherent memory access transactions. The first and second non-coherent memory access transactions include first and second memory addresses, respectively. The first and second non-coherent memory access transactions further include first and second pipe identifications, respectively. The node circuit maps the first and second memory addresses to first and second node numbers, respectively. The first and second pipe identifications are compared. If the first and second pipe identifications compare equally, then the first and second node numbers are compared. First and second coherent memory access transactions are generated by the node coupled to the I/O bridge wherein the first and second coherent memory access transactions correspond to the first and second non-coherent memory access transactions, respectively. The first coherent memory access transaction is transmitted to one of the nodes of the multiprocessor computer system. However, the second coherent memory access transaction is not transmitted unless the first and second pipe identifications do not compare equally or if the first and second node numbers compare equally.

...read moreread less

Journal Article•10.1109/2.683005•

Virtual memory: issues of implementation

[...]

Bruce Jacob¹, Trevor Mudge²•Institutions (2)

University of Maryland, College Park¹, University of Michigan²

01 Jun 1998-IEEE Computer

TL;DR: The authors present the software mechanisms of virtual memory from a hardware perspective and then describe several hardware examples and how they support virtual memory software.

...read moreread less

Abstract: Virtual memory was developed to automate the movement of program code and data between main memory and secondary storage to give the appearance of a single large store. This technique greatly simplified the programmer's job, particularly when program code and data exceeded the main memory's size. Virtual memory has now become widely used, and most modern processors have hardware to support it. Unfortunately, there has not been much agreement on the form that this support should take. The result of this lack of agreement is that hardware mechanisms are often completely incompatible. Thus, designers and porters of system level software have two somewhat unattractive choices: they can write software to fit many different architectures or they can insert layers of software to emulate a particular hardware interface. The authors present the software mechanisms of virtual memory from a hardware perspective and then describe several hardware examples and how they support virtual memory software. Their focus is to show the diversity of virtual memory support and, by implication, how this diversity complicates the design and porting of OSs. The authors introduce basic virtual memory technologies and then compare memory management designs in three commercial microarchitectures. They show the diversity of virtual memory support and, by implication, how this diversity can complicate and compromise system operations.

...read moreread less

Patent•

A snoop filter for use in multiprocessor computer systems

[...]

Andrew E. Phelps¹•Institutions (1)

Sun Microsystems¹

29 Jun 1998

TL;DR: In this article, an improved method and apparatus for distributing transactions among a plurality of groups of processors in a multiprocessor computer system is described, and an embodiment of the invention includes the following operations.

...read moreread less

Abstract: An improved method and apparatus for distributing transactions among a plurality of groups of processors in a multiprocessor computer system are disclosed. An embodiment of the invention includes the following operations. First, receiving an address request at a first group of processors. The address request is associated with a memory address corresponding to a requested memory page. Next, identifying those of the groups of processors that are interested in the address request and identifying those of the groups of processors that are uninterested in the address request. Thereafter, substantially simultaneously broadcasting the address request to the interested groups of processors and not to the uninterested groups of processors.

...read moreread less

Proceedings Article•10.1109/HIPC.1998.737991•

Memory bank disambiguation using modulo unrolling for Raw machines

[...]

Rajeev Barua¹, Whay S. Lee¹, Saman Amarasinghe¹, Anant Agarwal¹•Institutions (1)

Massachusetts Institute of Technology¹

17 Dec 1998

TL;DR: Modulo Unrolling as discussed by the authors is a code transformation technique for enabling array references to be accessed through the fast static network on a Raw machine, which allows the static communication of a large class of array accesses.

...read moreread less

Abstract: We present modulo unrolling, a code transformation technique for enabling array references to be accessed through the fast static network on a Raw machine. A Raw machine comprises of a mesh of simple, replicated tiles connected by an interconnect which supports fast, static near-neighbor communication. Like all other resources, memory is distributed across the tiles. Management of the memory can be performed by well known techniques which generate the requisite communication code on distributed address-space architectures. On the other hand, the fast, static network provides the compiler with a simple interface to optimize such communication. This paper addresses the problem of taking advantage of such static communication for memory accesses. The requirement for static memory communication is the compile-time knowledge of the exact communication required for each memory reference. This knowledge, in turn, can be obtained if a memory reference refers exclusively to memory residing on a single processing tile. We introduce modulo unrolling as a technique which allows the static communication of a large class of array accesses. We show how this technique achieves the goal of static communication by using a relatively small unroll factor. For a set of dense matrix scientific applications, we are able to access all the array references on the static network, enabling scalable speedups on the Raw machine.

...read moreread less

Journal Article•10.1016/S0167-6423(97)00012-9•

A distributed arc-consistency algorithm

[...]

Thang Trung Nguyen¹, Yves Deville¹•Institutions (1)

Université catholique de Louvain¹

01 Jan 1998-Science of Computer Programming

TL;DR: DisAC-4 is a new distributed are-consistency algorithm based on AC-4, and is a coarse-grained parallel algorithm designed for distributed memory computers using message passing communication that shows linear speedup with respect to the number of processors.

...read moreread less

Patent•

Information streaming in a multi-process system using shared memory

[...]

Michael C. Woodring¹, Aaron Cohen¹, Rama R. Menon¹•Institutions (1)

Intel¹

5 Jan 1998

TL;DR: In this article, the authors present a method and system for streaming an information stream from a producer to N consumers in a multi-process environment, where an inter-process communication (IPC) channel containing a shared memory is provided between the producer and at least one of N consumers.

...read moreread less

Abstract: The present invention discloses a method and system for streaming an information stream from a producer to N consumers in a multi-process environment. An inter-process communication (IPC) channel containing a shared memory is provided between the producer and at least one of N consumers. The information stream is written into the shared memory by way of a producer-side interface. The information stream is read from the shared memory by way of a consumer-side interface.

...read moreread less

Patent•

Partitioning memory shared by multiple processors of a distributed processing system

[...]

David J. Wisler, Yu-Cheung Cheung, Charles W. Johnson

30 Sep 1998

TL;DR: A symmetrical processing system includes a number of processor units sharing a memory element as discussed by the authors, at least a portion of the memory element is partitioned so that separate memory partitions are made exclusively available to some if not all the processor units.

...read moreread less

Abstract: A symmetrical processing system includes a number of processor units sharing a memory element. At least a portion of the memory element is partitioned so that separate memory partitions are made exclusively available to some if not all the processor units.

...read moreread less

Patent•

Method and system for clustering data in parallel in a distributed-memory multiprocessor system

[...]

Inderjit S. Dhillon¹, Dharmendra S. Modha¹•Institutions (1)

IBM¹

26 Oct 1998

TL;DR: In this paper, a distributed-memory multi-processor system for clustering data points in parallel using k-means computation for data mining is described. But it does not specify a memory structure for data clustering.

...read moreread less

Abstract: A method, apparatus, article of manufacture, and a memory structure for clustering data points in parallel using a distributed-memory multi-processor system is disclosed. The disclosed system has particularly advantageous application to a rapid and flexible k-means computation for data mining. The method comprises the steps of dividing a set of data points into a plurality of data blocks, initializing a set of k global centroid values in each of the data blocks k initial global centroid values, performing a plurality of asynchronous processes on the data blocks, each asynchronous process assigning each data point in each data block to the closest global centroid value within each data block, computing a set of k block accumulation values from the data points assigned to the k global centroid values, and recomputing the k global centroid values from the k block accumulation values.

...read moreread less

Patent•

Preemptive timer multiplexed shared memory access

[...]

Hook Timothy J. Van¹, Gulbin Ezer¹•Institutions (1)

Microsoft¹

5 Mar 1998

TL;DR: In this article, the authors present a method and system for sharing a data memory among a plurality of processors in a computer system, where a processor requesting accessing to the data memory asserts an active signal.

...read moreread less

Abstract: The present invention is directed to a method and system for sharing a data memory among a plurality of processors in a computer system. In the system and method of the present invention, a plurality of processors are coupled to a data memory for accessing the data memory in N-bit bandwidth. The present invention receives an active signal for accessing the data memory from the plurality of processors. A processor requesting accessing to the data memory asserts an active signal. Among the processors asserting active signals, a processor is selected as a memory master to the data memory. The present invention then transfers the N-bit wide data between the selected processor and the data memory in a time slot defined by a clock cycle. Only one processor is allowed access to the data memory during a given time slot. In the preferred embodiment of the present invention, the N-bit bandwidth is large enough to accommodate the data requirements of all the processors.

...read moreread less

Patent•

Cache coherence unit with integrated message passing and memory protection for a distributed, shared memory multiprocessor system

[...]

Wolf-Dietrich Weber¹•Institutions (1)

Fujitsu¹

7 Jan 1998

TL;DR: In this paper, the authors present a message protocol to extend cache coherence management of scalable shared memory multiprocessing computer systems having a plurality of processors connected to an interconnection over which the plurality of processor communicate with each other.

...read moreread less

Abstract: The present invention generally relates to a system and method for a message protocol to extend cache coherence management of scalable shared memory multiprocessing computer systems having a plurality of processors connected to an interconnection over which the plurality of processors communicate with each other. Each processor communicates with other interconnection processors by sending and receiving messages on the interconnection by means of a messaging protocol which can be used for shared-memory computer systems, shared nothing computer systems, and hybrid computer systems in which some processors are sharing memory while others are not. With this invention a processor node is able to tell whether an incoming message is from within the same coherence group (in which case it is completely unprotected) or whether it is from outside the coherence group (in which case the shared-nothing protections apply). This allows processor nodes sharing memory and processor nodes sharing nothing to co-exist on the same interconnection.

...read moreread less

Patent•

Method and apparatus for reconfiguring memory in a multiprcessor system with shared memory

[...]

Karen L. Noel, Gregory H. Jordan, Paul K. Harter¹, Thomas Benson¹•Institutions (1)

Hewlett-Packard¹

10 Jun 1998

TL;DR: In this article, the authors describe a multi-instance operating system partitioning system, where each instance of an operating system can run a distinct copy, or instance, of the operating system on a given partition.

...read moreread less

Abstract: Multiple instances of operating systems execute cooperatively in a single multiprocessor computer wherein all processors and resources are electrically connected together. The single physical machine with multiple physical processors and resources is subdivided by software into multiple partitions, each with the ability to run a distinct copy, or instance, of an operating system. At different times, different operating system instances may be loaded on a given partition. Resources, such as CPUs and memory, can be dynamically assigned to different partitions and used by instances of operating systems running within the machine by modifying the configuration. The partitions themselves can also be changed without rebooting the system by modifying the configuration tree. Memory may be reconfigured into or out of a partition or community under software control. In general, memory may be in one of three states: private, shared, or unowned. Memory is private if it is “owned” by a single system partition. Memory is shared if it is owned by partitions which are members of a community. Otherwise, it is unowned. Memory is placed in an “unowned” state by an operating system instance and console software, reflected in the system configuration tree, before the memory can be reconfigured into or out of a system.

...read moreread less

Patent•

Dynamic bus arbitration priority and task switching based on shared memory fullness in a multi-processor system

[...]

Eric S. Collins¹, Brett Louis Lindsley², Reginald J. Hill¹•Institutions (2)

Motorola¹, Freescale Semiconductor²

3 Jun 1998

TL;DR: In this paper, a bus request mechanism in each of the processors dynamically calculates normalized priority values based on relative needs across the system, where the normalized priority calculation is based on monitoring the fullness of memory segments of the shared memory associated with each processor.

...read moreread less

Abstract: An arbitration and task switching technique in a real-time multiprocessor data processing system (20) having a common bus (32) and a segmented shared memory (30), where fullness of memory segments of the shared memory (30) is used as a measurement for arbitration and task switching priorities. A bus request mechanism in each of the processors dynamically calculates normalized priority values based on relative needs across the system (20). The normalized priority calculation is based on monitoring the fullness of memory segments of the shared memory (30) associated with each processor (24, 26, 28) of the system (20). Using this normalized priority calculation, the bus access order and bus bandwidth are optimally allocated according to tasks executed by the processors (24, 26, 28). Also, the normalized priority calculation and a preprogrammed threshold is used to control task switching in the multi-processor system (20).

...read moreread less

Journal Article•10.1016/S0965-9978(97)00062-8•

Parallel processing neural networks and genetic algorithms

[...]

Barry Hilary Valentine Topping¹, J Sziveri¹, A. Bahreinejad¹, J P B Leite¹, B. Cheng¹ - Show less +1 more•Institutions (1)

Heriot-Watt University¹

01 Dec 1998-Advances in Engineering Software

TL;DR: In this article, the authors concentrate on parallel implementations of neural networks and genetic algorithms and show how a parallel finite element analysis may be undertaken in an efficient manner by preprocessing of the finite element model using a genetic algorithm utilizing a neural network predictor.

...read moreread less

Journal Article•10.1016/S0304-3975(97)00197-7•

The bulk-synchronous parallel random access machine

[...]

Alexandre Tiskin¹•Institutions (1)

University of Oxford¹

6 Apr 1998

TL;DR: This paper presents a new BSP-type model, called BSPRAM, which reconciles sharedmemory style programming with efficient exploitation of data locality, and identifies some characteristic properties of such algorithms: obliviousness, slackness, granularity.

...read moreread less

Abstract: The model of bulk-synchronous parallel (BSP) computation is an emerging paradigm of general-purpose parallel computing. Originally, BSP was defined as a distributed memory model. Shared-memory style BSP programming had to be provided by PRAM simulation. However, this approach destroys data locality and therefore may prove inefficient for many practical problems. In this paper we present a new BSP-type model, called BSPRAM, which reconciles sharedmemory style programming with efficient exploitation of data locality. BSPRAM can be optimally simulated by BSP for a broad range of algorithms. We identify some characteristic properties of such algorithms: obliviousness, slackness, granularity. Finally, we illustrate these concepts by presenting BSPRAM algorithms for butterfly dag computation, cube dag computation, dense matrix multiplication and sorting.

...read moreread less

Patent•

Multi-processor system for transferring data without incurring deadlock using hierarchical virtual channels

[...]

Madhumitra Sharma

23 Dec 1998

TL;DR: In this article, a hierarchical switch structure is proposed for a large SMP computer system, which allows for a number of multi-processor nodes to be coupled to the switch to operate at an optimum performance.

...read moreread less

Abstract: An architecture and coherency protocol for use in a large SMP computer system includes a hierarchical switch structure which allows for a number of multi-processor nodes to be coupled to the switch to operate at an optimum performance. Within each multi-processor node, a simultaneous buffering system is provided that allows all of the processors of the multi-processor node to operate at peak performance. A memory is shared among the nodes, with a portion of the memory resident at each of the multi-processor nodes. Each of the multi-processor nodes includes a number of elements for maintaining memory coherency, including a victim cache, a directory and a transaction tracking table. The victim cache allows for selective updates of victim data destined for memory stored at a remote multi-processing node, thereby improving the overall performance of memory. Memory performance is additionally improved by including, at each memory, a delayed write buffer which is used in conjunction with the directory to identify victims that are to be written into memory. An arb bus coupled to the output of the directory of each node provides a central ordering point for all messages that are transferred through the SMP. The messages comprise a number of transactions, and each transaction is assigned to a number of different virtual channels, depending upon the processing stage of the message. The use of virtual channels thus helps to maintain data coherency by providing a straightforward method for maintaining system order. Using the virtual channels and the directory structure, cache coherency problems that would previously result in deadlock may be avoided.

...read moreread less

Book Chapter•10.1007/3-540-64359-1_722•

COMPaS: A pentium Pro PC-based SMP cluster and its experience

[...]

Yoshio Tanaka, Motohiko Matsuda, Makoto Ando, Kazuto Kubota, Mitsuhisa Sato - Show less +1 more

30 Mar 1998

TL;DR: A hybrid shared memory/distributed memory programming model is designed in order to take advantage of locality in each SMP node in an eight node SMP cluster called COMPaS and its preliminary evaluation is reported.

...read moreread less

Abstract: We have built an eight node SMP cluster called COMPaS (Cluster Of Multi-Processor Systems), each node of which is a quadprocessor Pentium Pro PC. We have designed and implemented a remote memory based user-level communication layer which provides lowover-head and high bandwidth using Myrinet. We designed a hybrid programming model in order to take advantage of locality in each SMP node. Intra-node computations utilize a multi-threaded programming style (Solaris threads) and inter-node programming is based on message passing and remote memory operations. In this paper we report on this hybrid shared memory/distributed memory programming on COMPaS and its preliminary evaluation. The performance of COMPaS is affected by data size and access patterns, and the proportion of inter-node communication. If the data size is small enough to all fit on the cache, parallel efficiency exceeds 1.0 using the hybrid programming model on COMPaS. But the performance is limited by the low memory bus bandwidth of PC-based SMP nodes for some memory intensive workloads.

...read moreread less

Proceedings Article•10.1109/HIPS.1998.665138•

Making distributed shared memory simple, yet efficient

[...]

M. Swanson¹, Leigh Stoller¹, John B. Carter¹•Institutions (1)

University of Utah¹

30 Mar 1998

TL;DR: This work chose to implement a simple DSM system, Quarks, with an eye towards hiding most computation overhead while using a very low latency transport layer to reduce the effect of communication overhead.

...read moreread less

Abstract: Recent research on distributed shared memory (DSM) has focussed on improving performance by reducing the communication overhead of DSM. Features added include lazy release consistency based coherence protocols and new interfaces that give programmers the ability to hand tune communication. These features have increased DSM performance at the expense of requiring increasingly complex DSM systems or increasingly cumbersome programming. They have also increased the computation overhead of DSM, which has partially offset the communication related performance gains. We chose to implement a simple DSM system, Quarks, with an eye towards hiding most computation overhead while using a very low latency transport layer to reduce the effect of communication overhead. The resulting performance is comparable to that of far more complex DSM systems, such as Treadmarks and Cashmere.

...read moreread less

...

Expand