Top 306 papers published in the topic of Distributed memory in 2006

Showing papers on "Distributed memory published in 2006"

Journal Article•10.1109/TIT.2006.874535•

Decentralized erasure codes for distributed networked storage

[...]

Alexandros G. Dimakis¹, Vinod M. Prabhakaran¹, Kannan Ramchandran¹•Institutions (1)

01 Jun 2006-IEEE Transactions on Information Theory

TL;DR: In this article, the problem of constructing an erasure code for storage over a network when the data sources are distributed is considered, and it is shown that decentralized erasure codes are optimally sparse, and lead to reduced communication, storage and computation cost over random linear coding.

...read moreread less

Abstract: In this correspondence, we consider the problem of constructing an erasure code for storage over a network when the data sources are distributed. Specifically, we assume that there are n storage nodes with limited memory and k n sources generating the data. We want a data collector, who can appear anywhere in the network, to query any k storage nodes and be able to retrieve the data. We introduce decentralized erasure codes, which are linear codes with a specific randomized structure inspired by network coding on random bipartite graphs. We show that decentralized erasure codes are optimally sparse, and lead to reduced communication, storage and computation cost over random linear coding.

...read moreread less

297 citations

Journal Article•10.1039/B515355G•

Distributed memory parallel implementation of energies and gradients for second-order Møller–Plesset perturbation theory with the resolution-of-the-identity approximation

[...]

Christof Hättig, Arnim Hellweg, Andreas Köhn

02 Mar 2006-Physical Chemistry Chemical Physics

TL;DR: The parallel implementation of second-order Møller-Plesset perturbation theory with the resolution-of-the-identity approximation (RI-MP2) extends the applicability of canonical MP2 to considerably larger systems.

...read moreread less

Abstract: We present a parallel implementation of second-order Moller–Plesset perturbation theory with the resolution-of-the-identity approximation (RI-MP2). The implementation is based on a recent improved sequential implementation of RI-MP2 within the Turbomole program package and employs the message passing interface (MPI) standard for communication between distributed memory nodes. The parallel implementation extends the applicability of canonical MP2 to considerably larger systems. Examples are presented for full geometry optimizations with up to 60 atoms and 3300 basis functions and MP2 energy calculations with more than 200 atoms and 7000 basis functions.

...read moreread less

233 citations

Journal Article•10.1109/TNN.2006.875989•

Parallel sequential minimal optimization for the training of support vector machines

[...]

L. J. Cao¹, S. Sathiya Keerthi, Chong Jin Ong, J. Q. Zhang, U. Periyathamby, Xiu Ju Fu, Heow Pueh Lee - Show less +3 more•Institutions (1)

Fudan University¹

01 Jul 2006-IEEE Transactions on Neural Networks

TL;DR: The parallel SMO is developed using message passing interface (MPI) and shows great speedup on the adult data set and the Mixing National Institute of Standard and Technology (MNIST) data set when many processors are used.

...read moreread less

Abstract: Sequential minimal optimization (SMO) is one popular algorithm for training support vector machine (SVM), but it still requires a large amount of computation time for solving large size problems. This paper proposes one parallel implementation of SMO for training SVM. The parallel SMO is developed using message passing interface (MPI). Specifically, the parallel SMO first partitions the entire training data set into smaller subsets and then simultaneously runs multiple CPU processors to deal with each of the partitioned data sets. Experiments show that there is great speedup on the adult data set and the Mixing National Institute of Standard and Technology (MNIST) data set when many processors are used. There are also satisfactory results on the Web data set.

...read moreread less

198 citations

Journal Article•10.1109/TPDS.2006.112•

ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis

[...]

Christopher S. Oehmen¹, J. Nieplocha¹•Institutions (1)

Pacific Northwest National Laboratory¹

01 Aug 2006-IEEE Transactions on Parallel and Distributed Systems

TL;DR: ScalaBLAST is developed, which accommodates very large databases and which scales linearly to as many as thousands of processors on both distributed memory and shared memory architectures, representing a substantial improvement over the current state-of-the-art in high-performance sequence alignment with scaling and portability.

...read moreread less

Abstract: Genes in an organism's DNA (genome) have embedded in them information about proteins, which are the molecules that do most of a cell's work. A typical bacterial genome contains on the order of 5,000 genes. Mammalian genomes can contain tens of thousands of genes. For each genome sequenced, the challenge is to identify protein components (proteome) being actively used for a given set of conditions. Fundamentally, sequence alignment is a sequence matching problem focused on unlocking protein information embedded in the genetic code, making it possible to assemble a "tree of life" by comparing new sequences against all sequences from known organisms. But, the memory footprint of sequence data is growing more rapidly than per-node core memory. Despite years of research and development, high-performance sequence alignment applications either do not scale well, cannot accommodate very large databases in core, or require special hardware. We have developed a high-performance sequence alignment application, ScalaBLAST, which accommodates very large databases and which scales linearly to as many as thousands of processors on both distributed memory and shared memory architectures, representing a substantial improvement over the current state-of-the-art in high-performance sequence alignment with scaling and portability. ScalaBLAST relies on a collection of techniques - distributing the target database over available memory, multilevel parallelism to exploit concurrency, parallel I/O, and latency hiding through data prefetching - to achieve high-performance and scalability. This demonstrated approach of database sharing combined with effective task scheduling should have broad ranging applications to other informatics-driven sciences

...read moreread less

148 citations

Proceedings Article•10.1145/1133956.1133967•

McRT-Malloc: a scalable transactional memory allocator

[...]

Richard L. Hudson¹, Bratin Saha¹, Ali-Reza Adl-Tabatabai¹, Benjamin C. Hertzberg²•Institutions (2)

Intel¹, Stanford University²

10 Jun 2006

TL;DR: This paper is the first to integrate a software transactional memory system with a malloc/free based memory allocator and presents the first algorithm which ensures that space allocated in an aborted transaction is properly freed and does not lead to a space blowup.

...read moreread less

Abstract: Emerging multi-core processors promise to provide an exponentially increasing number of hardware threads with every generation. Applications will need to be highly concurrent to fullyuse the power of these processors. To enable maximum concurrency, libraries (such as malloc-free packages) would therefore need to use non-blocking algorithms. But lock-free algorithms are notoriously difficult to reason about and inappropriate for average programmers. Transactional memory promises to significantly ease concurrent programming for the average programmer. This paper describes a highly efficient non-blocking malloc/free algorithm that supports memory allocation and deallocation inside transactional code blocks. Thus this paper describes a memory allocator that is suitable for emerging multi-core applications, while supporting modern concurrency constructs.This paper makes several novel contributions. It is the first to integrate a software transactional memory system with a malloc/free based memory allocator. We present the first algorithm which ensures that space allocated in an aborted transaction is properly freed and does not lead to a space blowup. Unlike previous lock-free malloc packages, our algorithm avoids atomic operations on typical code paths, making our algorithm substantially more efficient.

...read moreread less

121 citations

Proceedings Article•10.5555/1898953.1899021•

Detecting phases in parallel applications on shared memory architectures

[...]

Erez Perelman¹, Marzia Polito², Jean-Yves Bouguet², Jack Sampson¹, Brad Calder¹, Carole Dulong² - Show less +2 more•Institutions (2)

University of California, San Diego¹, Intel²

25 Apr 2006

TL;DR: This paper examines applying phase analysis algorithms and how to adapt them to parallel applications running on shared memory processors, and examines using the phase analysis to pick simulation points to guide multithreaded simulation.

...read moreread less

Abstract: Most programs are repetitive, where similar behavior can be seen at different execution times. Algorithms have been proposed that automatically group similar portions of a program's execution into phases, where samples of execution in the same phase have homogeneous behavior and similar resource requirements. In this paper, we examine applying these phase analysis algorithms and how to adapt them to parallel applications running on shared memory processors. Our approach relies on a separate representation of each thread's activity. We first focus on showing its ability to identify similar intervals of execution across threads for a single run. We then show that it is effective at identifying similar behavior of a program when the number of threads is varied between runs. This can be used by developers to examine how different phases scale across different number of threads. Finally, we examine using the phase analysis to pick simulation points to guide multithreaded simulation.

...read moreread less

86 citations

Proceedings Article•10.5555/2386124.2386151•

A scalable, hybrid scheme for volume rendering massive data sets

[...]

Hank Childs¹, Mark A. Duchaineau², Kwan-Liu Ma¹•Institutions (2)

University of California, Davis¹, Lawrence Livermore National Laboratory²

11 May 2006

TL;DR: A parallel, distributed memory algorithm for volume rendering massive data sets and two techniques for calculating the sample points: a 3D rasterization technique and a kernel-based technique, which trade off between speed and generality.

...read moreread less

Abstract: We introduce a parallel, distributed memory algorithm for volume rendering massive data sets. The algorithm's scalability has been demonstrated up to 400 processors, rendering one hundred million unstructured elements in under one second. The heart of the algorithm is a hybrid approach that parallelizes over both the elements of the input data and over the pixels of the output image. At each stage of the algorithm, there are strong limits on how much work each processor performs, ensuring good parallel efficiency. The algorithm is sample-based. We present two techniques for calculating the sample points: a 3D rasterization technique and a kernel-based technique, which trade off between speed and generality. Finally, the algorithm is very flexible. It can be deployed in general purpose visualization tools and can also support diverse mesh types, ranging from structured grids to curvilinear and unstructured meshes to point clouds.

...read moreread less

76 citations

Book Chapter•10.1007/11846802_18•

High performance RDMA protocols in HPC

[...]

Tim S. Woodall¹, Galen M. Shipman¹, George Bosilca², Richard L. Graham¹, Arthur B. Maccabe³ - Show less +1 more•Institutions (3)

Los Alamos National Laboratory¹, University of Tennessee², University of New Mexico³

17 Sep 2006-Lecture Notes in Computer Science

TL;DR: This paper describes a unique long-message MPI library ‘pipeline' protocol that addresses constraints while avoiding some of the pitfalls of existing techniques by using portable send/receive semantics to hide the cost of initializing the pipeline algorithm.

...read moreread less

Abstract: Modern network communication libraries that leverage Remote Directory Memory Access (RDMA) and OS bypass protocols, such as Infiniband [2] and Myrinet [10] can offer significant performance advantages over conventional send/receive protocols. However, this performance often comes with hidden per buffer setup costs [4]. This paper describes a unique long-message MPI [9] library ‘pipeline' protocol that addresses these constraints while avoiding some of the pitfalls of existing techniques. By using portable send/receive semantics to hide the cost of initializing the pipeline algorithm, and then effectively overlapping the cost of memory registration with RDMA operations, this protocol provides very good performance for any large-memory usage pattern. This approach avoids the use of non-portable memory hooks or keeping registered memory from being returned to the OS. Through this approach, bandwidth may be increased up to 67% when memory buffers are not effectively reused while providing superior performance in the effective bandwidth benchmark. Several user level protocols are explored using Open MPI's PML (Point to point messaging layer) and compared/contrasted to this ‘pipeline' protocol.

...read moreread less

75 citations

Patent•

Sub-page-granular cache coherency using shared virtual memory mechanism

[...]

Roy E. Clark¹, David W. Desroches¹•Institutions (1)

EMC Corporation¹

29 Mar 2006

TL;DR: In this article, the authors propose a technique for providing access to shared data based on enhanced standard virtual memory mechanisms. But this technique can be configured to handle shared data at the sub-page-granular level using sparse virtual pages to minimize memory access contention and thus improve the likelihood of quick re-hits.

...read moreread less

Abstract: A technique involves providing access to shared data based on enhanced standard virtual memory mechanisms. Once data from a shared area of memory is moved into primary memory of a first computerized device from a second computerized device, the first computerized device can retain that data in order to shorten the latency of subsequent accesses. Such a technique can be configured to handle shared data at the sub-page-granular level using sparse virtual pages to minimize memory access contention and thus improve the likelihood of quick re-hits. Furthermore, such a technique can be conveniently accomplished through an enhancement to a common page fault handler of an operating system and utilizing atomic remote access support from a standard communications protocol thus alleviating the need to employ more costly and complicated solutions such as inflexible hardware implementations or independent programs that could pose additional design burdens and reliability concerns.

...read moreread less

70 citations

Proceedings Article•10.1109/ICSAMOS.2006.300821•

Exploration of Distributed Shared Memory Architectures for NoC-based Multiprocessors

[...]

Matteo Monchiero¹, Gianluca Palermo¹, Cristina Silvano¹, Oreste Villa¹•Institutions (1)

Polytechnic University of Milan¹

1 Jul 2006

TL;DR: A distributed shared memory architecture has been explored, that is suitable for low-power on-chip multiprocessors based on NoC, and the energy/delay exploration of on- chip physically distributed and logically shared memory address space for MP-SoCs based on a parameterizable NoC is explored.

...read moreread less

Abstract: Multiprocessor system-on-chip (MP-SoC) platforms represent an emerging trend for embedded multimedia applications. To enable MP-SoC platforms, scalable communication-centric interconnect fabrics, such as networks-on-chip (NoC), have been recently proposed. The shared memory represents one of the key elements in designing MP-SoCs, since its function is to provide data exchange and synchronization support. In this paper, a distributed shared memory architecture has been explored, that is suitable for low-power on-chip multiprocessors based on NoC. In particular, the paper focuses on the energy/delay exploration of on-chip physically distributed and logically shared memory address space for MP-SoCs based on a parameterizable NoC. The data allocation on the physically distributed shared memory space is dynamically managed by an on-chip Hardware Memory Management Unit. Experimental results show the impact of different NoC topologies and distributed shared memory configurations for a selected set of parallel benchmark applications from the power/performance perspective.

...read moreread less

70 citations

Journal Article•10.1145/1324177.1324179•

Arithmetic on a Distributed-Memory Quantum Multicomputer

[...]

Rodney Van Meter¹, William J. Munro², Kae Nemoto³, Kohei M. Itoh¹•Institutions (3)

Keio University¹, Hewlett-Packard², National Institute of Informatics³

24 Jul 2006-arXiv: Quantum Physics

TL;DR: In this article, the authors evaluate the performance of quantum arithmetic algorithms run on a distributed quantum computer (a quantum multicomputer), and show that the teledata approach performs better, and that carry-ripple adders perform well when the teleportation block is decomposed so that the key quantum operations can be parallelized.

...read moreread less

Abstract: We evaluate the performance of quantum arithmetic algorithms run on a distributed quantum computer (a quantum multicomputer). We vary the node capacity and I/O capabilities, and the network topology. The tradeoff of choosing between gates executed remotely, through ``teleported gates'' on entangled pairs of qubits (telegate), versus exchanging the relevant qubits via quantum teleportation, then executing the algorithm using local gates (teledata), is examined. We show that the teledata approach performs better, and that carry-ripple adders perform well when the teleportation block is decomposed so that the key quantum operations can be parallelized. A node size of only a few logical qubits performs adequately provided that the nodes have two transceiver qubits. A linear network topology performs acceptably for a broad range of system sizes and performance parameters. We therefore recommend pursuing small, high-I/O bandwidth nodes and a simple network. Such a machine will run Shor's algorithm for factoring large numbers efficiently.

...read moreread less

Journal Article•10.1145/1133255.1133995•

Shared memory programming for large scale machines

[...]

Christopher Barton¹, CĆlin Casçaval², George Almási², Yili Zheng³, Montse Farreras⁴, Siddhartha Chatterje², José Nelson Amaral¹ - Show less +3 more•Institutions (4)

University of Alberta¹, IBM², Purdue University³, Polytechnic University of Catalonia⁴

11 Jun 2006

TL;DR: An experimental evaluation on BlueGene/L®, a distributed-memory machine, demonstrates that the combination of the compiler with the runtime system produces programs with performance comparable to that of efficient MPI programs and good performance scalability up to hundreds of thousands of processors.

...read moreread less

Abstract: This paper describes the design and implementation of a scalable run-time system and an optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on BlueGene/L®, a distributed-memory machine, demonstrates that the combination of the compiler with the runtime system produces programs with performance comparable to that of efficient MPI programs and good performance scalability up to hundreds of thousands of processors.Our runtime system design solves the problem of maintaining shared object consistency efficiently in a distributed memory machine. Our compiler infrastructure simplifies the code generated for parallel loops in UPC through the elimination of affinity tests, eliminates several levels of indirection for accesses to segments of shared arrays that the compiler can prove to be local, and implements remote update operations through a lower-cost asynchronous message. The performance evaluation uses three well-known benchmarks --- HPC RandomAccess, HPC STREAM and NAS CG --- to obtain scaling and absolute performance numbers for these benchmarks on up to 131072 processors, the full BlueGene/L machine. These results were used to win the HPC Challenge Competition at SC05 in Seattle WA, demonstrating that PGAS languages support both productivity and performance.

...read moreread less

Book Chapter•10.1007/11609773_16•

Parallel external directed model checking with linear i/o

[...]

Shahid Jabbar, Stefan Edelkamp

8 Jan 2006

TL;DR: Parallel External A*, a parallel variant of external memory directed model checking, assuming a sufficient number of processors and number of open file pointers per process, the I/O complexity is reduced to linear by exploiting a hash-function based state space partition scheme.

...read moreread less

Abstract: In this paper we present Parallel External A*, a parallel variant of external memory directed model checking. As a model scales up, its successors generation becomes complex and, in turn, starts to impact the running time of the model checker. Probings of our external memory model checker IO-HSF-SPIN revealed that in some of the cases about 70% of the whole running time was consumed in the internal processing. Employing a multiprocessor machine or a cluster of workstations, we can distribute the internal working load of the algorithm on multiple processors. Moreover, assuming a sufficient number of processors and number of open file pointers per process, the I/O complexity is reduced to linear by exploiting a hash-function based state space partition scheme.

...read moreread less

Book Chapter•10.1007/11846802_12•

Scalable parallel suffix array construction

[...]

Fabian Kulla¹, Peter Sanders¹•Institutions (1)

Karlsruhe Institute of Technology¹

17 Sep 2006-Lecture Notes in Computer Science

TL;DR: The first implementation and experimental evaluation of a scalable parallel algorithm for suffix array construction that works on distributed memory computers using MPI and makes it possible to build suffix arrays for huge inputs very quickly.

...read moreread less

Abstract: Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. We describe the first implementation and experimental evaluation of a scalable parallel algorithm for suffix array construction. The implementation works on distributed memory computers using MPI, Experiments with up to 128 processors show good constant factors and make it look likely that the algorithm would also scale to considerably larger systems. This makes it possible to build suffix arrays for huge inputs very quickly. Our algorithm is a parallelization of the linear time DC3 algorithm.

...read moreread less

Journal Article•10.15388/INFORMATICA.2006.134•

Parallel DEM Software for Simulation of Granular Media

[...]

Algirdas Maknickas¹, Arnas Kačeniauskas¹, Rimantas Kačianauskas¹, Robertas Balevičius¹, Algis Džiugys² - Show less +1 more•Institutions (2)

Vilnius Gediminas Technical University¹, Energy Institute²

01 Apr 2006-Informatica (lithuanian Academy of Sciences)

TL;DR: The paper describes the development and performance of parallel algorithms for the discrete element method (DEM) software and the novel algorithm combining link-cells for contact detection, the static domain decomposition for parallelization and MPI data transfer for processors exchanging particles has been developed for distributed memory PC clusters.

...read moreread less

Abstract: The paper describes the development and performance of parallel algorithms for the discrete element method (DEM) software. Spatial domain decomposition strategy and message passing inter-processor communication have been implemented in the DEMMAT code for simulation of visco-elastic frictional granular media. The novel algorithm combining link-cells for contact detection, the static domain decomposition for parallelization and MPI data transfer for processors exchanging particles has been developed for distributed memory PC clusters. The parallel software DEMMAT_PAR has been applied to model compacting of spherical particles in the rectangular box. Two benchmark problems with different numbers of particles have been solved in order to measure parallel efficiency of the code. The inter-processor communication has been examined in order to improve domain decomposition topology and to achieve better load balancing. The speed-up equal to 11 has been obtained on 16 processors. The parallel performance study has been performed on the PC cluster VILKAS of Vilnius Gediminas Technical University, Lithuania.

...read moreread less

Book Chapter•10.1007/11846802_10•

Hybrid MPI and OpenMP parallel programming

[...]

Rolf Rabenseifner, Georg Hager¹, Gabriele Jost², Rainer Keller•Institutions (2)

University of Erlangen-Nuremberg¹, Sun Microsystems²

17 Sep 2006

TL;DR: Most HPC systems are clusters of shared memory nodes, but also “constelation” type systems with large SMP nodes, which means parallel programming must combine the distributed memory parallelization on the node inter-connect with the shared memory Parallel programming inside of each node.

...read moreread less

Abstract: Most HPC systems are clusters of shared memory nodes. Such systems can be PC clusters with dual or quad boards, but also “constelation” type systems with large SMP nodes. Parallel programming must combine the distributed memory parallelization on the node inter-connect with the shared memory parallelization inside of each node.

...read moreread less

Proceedings Article•10.1109/DATE.2006.243733•

Distributed Loop Controller Architecture for Multi-threading in Uni-threaded VLIW Processors

[...]

P. Raghavan¹, A. Lambrechts¹, M. Jayapala¹, F. Catthoor¹, D. Verkest¹ - Show less +1 more•Institutions (1)

Katholieke Universiteit Leuven¹

6 Mar 2006

TL;DR: The proposed distributed instruction memory organization requires minimal hardware overhead and allows execution of multiple loops in parallel in a uni-processor system and can reduce the energy consumed in the instruction and data memory hierarchy.

...read moreread less

Abstract: Reduced energy consumption is one of the most important design goals for embedded application domains like wireless, multimedia and biomedical. Instruction memory hierarchy has been proven to be one of the most power hungry parts of the system. This paper introduces an architectural enhancement for the instruction memory to reduce energy and improve performance. The proposed distributed instruction memory organization requires minimal hardware overhead and allows execution of multiple loops in parallel in a uni-processor system. This architecture enhancement can reduce the energy consumed in the instruction and data memory hierarchy by 70.01 % and improve the performance by 32.89% compared to enhanced SMT based architectures

...read moreread less

Proceedings Article•10.1109/DATE.2006.244066•

COSMECA: Application Specific Co-Synthesis of Memory and Communication Architectures for MPSoC

[...]

Sudeep Pasricha¹, Nikil Dutt¹•Institutions (1)

University of California, Irvine¹

6 Mar 2006

TL;DR: An automated application specific co-synthesis methodology for memory and communication architectures (COSMECA) in MPSoC designs is proposed to design a communication architecture having the least number of busses, which satisfies performance and memory area constraints, while the secondary objective is to reduce the memory area cost.

...read moreread less

Abstract: Memory and communication architectures have a significant impact on the cost, performance, and time-to-market of complex multi-processor system-on-chip (MPSoC) designs. The memory architecture dictates most of the data traffic flow in a design, which in turn influences the design of the communication architecture. Thus there is a need to co-synthesize the memory and communication architectures to avoid making sub-optimal design decisions. This is in contrast to traditional platform-based design approaches where memory and communication architectures are synthesized separately. In this paper, we propose an automated application specific co-synthesis methodology for memory and communication architectures (COSMECA) in MPSoC designs. The primary objective is to design a communication architecture having the least number of busses, which satisfies performance and memory area constraints, while the secondary objective is to reduce the memory area cost. Results of applying COSMECA to several industrial strength MPSoC applications from the networking domain indicate a saving of as much as 40% in number of busses and 29% in memory area compared to the traditional approach.

...read moreread less

Journal Article•10.1016/J.PARCO.2006.08.004•

An improved two-step algorithm for task and data parallel scheduling in distributed memory machines

[...]

Savina Bansal, Padam Kumar¹, Kuldip Singh¹•Institutions (1)

Indian Institute of Technology Roorkee¹

1 Nov 2006

TL;DR: A two-step Modified Critical Path and Area-based (MCPA) scheduling heuristic is developed which targets at improving the processor allocation phase of an existing Critical Path-based scheduling algorithm and turns out to be much better than the parent CPA algorithm and comparable to the high complexity Critical Path reduction algorithm.

...read moreread less

Abstract: Scheduling of most of the parallel scientific applications demand simultaneous exploitation of task and data parallelism for efficient and effective utilization of system and other resources. Traditional optimization techniques, like optimal control-theoretic approaches, convex-programming, and bin-packing, have been suggested in the literature for dealing with the most critical processor allocation phase. However, their application onto the real world problems is not straightforward, which departs the solutions away from optimality. Heuristic based approaches, in contrast, work in the integer domain for the number of processors all through, and perform appreciably well. A two-step Modified Critical Path and Area-based (MCPA) scheduling heuristic is developed which targets at improving the processor allocation phase of an existing Critical Path and Area-based (CPA) scheduling algorithm. Strength of the suggested algorithm lies in bridging the gap between the processor allocation and task assignment phases of scheduling. It helps in making better processor allocations for data parallel tasks without sacrificing the essential task parallelism available in the application program. Performance of MCPA algorithm, in terms of normalized schedule length and speedup, is evaluated for random and real application task graph suites. It turns out to be much better than the parent CPA algorithm and comparable to the high complexity Critical Path Reduction (CPR) algorithm.

...read moreread less

Proceedings Article•10.1145/1146909.1146925•

Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies

[...]

Ilya Issenin¹, Erik Brockmeyer², Bart Durinck², Nikil Dutt¹•Institutions (2)

University of California, Irvine¹, Katholieke Universiteit Leuven²

24 Jul 2006

TL;DR: In this paper, the authors present a data reuse analysis technique that allows the system designer to explore a wide range of customized memory hierarchy organizations with different size and energy profiles, which enables the system designers to explore feasible memory subsystem solutions that meet power and area constraints while maintaining the necessary performance level.

...read moreread less

Abstract: The increasing use of multiprocessor systems-on-chip (MPSoCs) for high performance demands of embedded applications results in high power dissipation. The memory subsystem is a large and critical contributor to both energy and performance, requiring system designers to perform exploration of low power memory organizations. In this paper we present a novel multiprocessor data reuse analysis technique that allows the system designer to explore a wide range of customized memory hierarchy organizations with different size and energy profiles. Our technique enables the system designer to explore feasible memory subsystem solutions that meet power and area constraints while maintaining the necessary performance level. Our experiments on the complex QSDPCM benchmark illustrate the exploration of a wide range of customized memory hierarchies for an MPSoC implementation.

...read moreread less

Patent•

Task scheduling policy for limited memory systems

[...]

Xiaohan Qin¹•Institutions (1)

IBM¹

18 Jan 2006

TL;DR: In this article, a kernel scheduler is utilized to schedule a next application for execution in accordance with the memory availability and the determined memory requirements of the one or more monitored applications.

...read moreread less

Abstract: A system, method, and program product for scheduling applications in a data processing system. The method of the invention includes a step of monitoring memory availability within the data processing system during execution of one or more applications. The method further includes recording a memory usage history for one or more applications from which a memory requirement may be determined. A kernel scheduler is utilized to schedule a next application for execution in accordance with the memory availability and the determined memory requirements of the one or more monitored applications.

...read moreread less

Patent•

Single interconnect providing read and write access to a memory shared by concurrent threads

[...]

John R. Nickolls¹, Brett W. Coon¹, Ming Y. Siu¹, Stuart F. Oberman¹, Samuel Liu¹ - Show less +1 more•Institutions (1)

Nvidia¹

30 Oct 2006

TL;DR: In this paper, a shared memory is used by concurrent threads in a multithreaded processor, with any addressable storage location in the shared memory being readable and writeable by any of the threads.

...read moreread less

Abstract: A shared memory is usable by concurrent threads in a multithreaded processor, with any addressable storage location in the shared memory being readable and writeable by any of the threads. Processing engines that execute the threads are coupled to the shared memory via an interconnect that transfers data in only one direction (e.g., from the shared memory to the processing engines); the same interconnect supports both read and write operations. The interconnect advantageously supports multiple parallel read or write operations.

...read moreread less

Patent•

Apparatus, system, and method for coalescing parallel memory requests

[...]

Bryon S. Nordquist¹, Stephen D. Lew¹•Institutions (1)

Nvidia¹

24 Jan 2006

TL;DR: In this article, a controller receives memory requests from the parallel threads and coalesces the memory requests to improve memory transfer efficiency in a multiprocessor system, where the memory request is made by the processor.

...read moreread less

Abstract: A multiprocessor system executes parallel threads. A controller receives memory requests from the parallel threads and coalesces the memory requests to improve memory transfer efficiency.

...read moreread less

Patent•

Programmable network protocol handler architecture

[...]

Christos John Georgiou¹, Monty M. Denneau¹•Institutions (1)

IBM¹

24 Mar 2006

TL;DR: In this paper, a high-speed protocol handler combines parallelism and pipelining in multiple programmable processors, along with specialized front-end logic at the network interface that handles time critical protocol operations.

...read moreread less

Abstract: An architecture that achieves high speed performance in a network protocol handler combines parallelism and pipelining in multiple programmable processors, along with specialized front-end logic at the network interface that handles time critical protocol operations. The multiple processors are interconnected via high-speed interconnect, and each processor's memory is globally accessible by other processors. Each processor has multiple threads, each capable of fully executing programs. Each processors contains embedded dynamic random access memory (DRAM). Threads within a processor are assigned the processing of various protocol functions in a parallel/pipelined fashion. Data frame processing is done by one or more of the threads to identify relates frames. Related frames are dispatch to the same thread so as to minimize the overhead associated with memory accesses and general protocol processing. The high-speed protocol handler may also provide built-in monitors for examining the activity of its hardware resources and reallocating the workload to the resources that are not heavily used, thus balancing the resource utilization and increasing the workload throughput.

...read moreread less

Journal Article•10.1109/TVLSI.2006.871759•

A combined DMA and application-specific prefetching approach for tackling the memory latency bottleneck

[...]

Minas Dasygenis¹, E. Brockmeyer², B. Durinck², Francky Catthoor², Dimitrios Soudris¹, Antonios Thanailakis¹ - Show less +2 more•Institutions (2)

Democritus University of Thrace¹, IMEC²

01 Mar 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: The findings on hiding memory latency using the direct memory access (DMA) mode, which is present in all modern systems, combined with a software prefetch mechanism, and a customized on-chip memory hierarchy mapping are presented.

...read moreread less

Abstract: Memory latency has always been a major issue in embedded systems that execute memory-intensive applications. This is even more true as the gap between processor and memory speed continues to grow. Hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherit in large off-chip memories; however, both types of prefetching have their shortcomings. Hardware schemes are more complex and require extra circuitry to compute data access strides, while software schemes generate prefetch instructions, which if not computed carefully may hamper performance. On the other hand, some applications domains (such as multimedia) have a uniform and known a priori memory access pattern, that if exploited, could yield significant application performance improvement. With this characteristic in mind, we present our findings on hiding memory latency using the direct memory access (DMA) mode, which is present in all modern systems, combined with a software prefetch mechanism, and a customized on-chip memory hierarchy mapping. Compared to previous approaches, we are able to estimate the performance and power metrics, without actually implementing the embedded system. Experimental results on nine well known multimedia and imaging applications prove the efficiency of our technique. Finally, we verify the performance estimations by implementing and simulating the algorithms on the TI C6201 processor.

...read moreread less

Proceedings Article•10.5555/1898953.1899020•

A distributed paging RAM grid system for wide-area memory sharing

[...]

Rui Chu, Nong Xiao, Yongzhen Zhuang¹, Yunhao Liu¹, Xicheng Lu - Show less +1 more•Institutions (1)

Hong Kong University of Science and Technology¹

25 Apr 2006

TL;DR: This work proposes a service-oriented grid memory sharing scheme, distributed paging RAM grid (DPRG), and designs major operations and optimizations to fit the usage of grid systems.

...read moreread less

Abstract: Memory-intensive applications often suffer from the poor performance of disk swapping when memory is inadequate. Remote memory sharing schemes, which provide a remote memory that is faster than the local hard disk, are able to improve the performance of such applications. Due to the limitation of being applicable within single clusters only, however, most of the previous remote memory mechanisms, such as the network memory scheme, fail to be extendable into a large scale, distributed, heterogeneous, and dynamic environment. In this work, we propose a service-oriented grid memory sharing scheme, distributed paging RAM grid (DPRG). We study the properties and criteria of large scale memory sharing, and then design major operations and optimizations to fit the usage of grid systems. We collect trace from our grid environment, and evaluate DPRG through comprehensive trace-driven simulations. Results show that DPRG significantly outperforms existing remote memory sharing schemes and supports grid computing applications effectively.

...read moreread less

Patent•

Method and apparatus for profiling execution of code using multiple processors

[...]

Wen-Tzer Thomas Chen¹, Jimmie Earl Dewitt¹, Frank Eliot Levine¹, Enio Manuel Pineda¹•Institutions (1)

IBM¹

16 Mar 2006

TL;DR: In this paper, the authors describe a method for gathering performance related data in a multiprocessing environment using a set of profiling processors, which are executed on a processor that minimizes the distortion to the processor resources used to execute the program to be profiled.

...read moreread less

Abstract: A computer implemented method, apparatus, and computer usable medium for gathering performance related data in a multiprocessing environment. Instrumentation code is executed on a processor that minimizes the distortion to the processor resources used to execute the program to be profiled. Data is written by the instrumentation code to a shared memory in response to an event occurring during execution of the program. The data is generated during execution of the program on the processor and the instrumentation code uses shared memory to convey the data to a profiling application running on a set of profiling processors. The data is collected by the set of profiling processors in the shared memory written by the instrumentation code.

...read moreread less

Proceedings Article•10.5555/1182635.1164261•

Query co-processing on commodity processors

[...]

Anastassia Ailamaki¹, Naga K. Govindaraju², Stavros Harizopoulos³, Dinesh Manocha²•Institutions (3)

Carnegie Mellon University¹, University of North Carolina at Chapel Hill², Massachusetts Institute of Technology³

1 Sep 2006

TL;DR: This tutorial will survey the micro-architectural and architectural differences across these processors with data management in mind, and will present previous work and future opportunities for expanding query processing algorithms to other hardware than general-purpose processors.

...read moreread less

Abstract: The rapid increase in the data volumes for the past few decades has intensified the need for high processing power for database and data mining applications. Researchers have actively sought to design and develop new architectures for improving the performance. Recent research shows that the performance can be significantly improved using either (a) effective utilization of architectural features and memory hierarchies used by the conventional processors, or (b) the high computational power and memory bandwidth in commodity hardware such as network processing units (NPUs), Cell processors and graphics processing units (GPUs). This tutorial will survey the micro-architectural and architectural differences across these processors with data management in mind, and will present previous work and future opportunities for expanding query processing algorithms to other hardware than general-purpose processors. In addition to the database community, we intend to increase awareness in the computer architecture scene about opportunities to construct heterogeneous chips.

...read moreread less

Journal Article•10.1109/MM.2006.1•

Adaptive History-Based Memory Schedulers for Modern Processors

[...]

Ibrahim Hur¹, Calvin Lin²•Institutions (2)

IBM¹, University of Texas at Austin²

01 Jan 2006-IEEE Micro

TL;DR: A new memory scheduler is presented that makes decisions based on the history of recently scheduled operations, providing two advantages: it can better reason about the delays associated with complex DRAM structure, and it can adapt to different observed workload.

...read moreread less

Abstract: Careful memory scheduling can increase memory bandwidth and overall system performance. We present a new memory scheduler that makes decisions based on the history of recently scheduled operations, providing two advantages: it can better reason about the delays associated with complex DRAM structure, and it can adapt to different observed workload

...read moreread less

Patent•

Synchronized objects for software transactional memory

[...]

Victor Luchangco¹, Virendra J. Marathe²•Institutions (2)

Sun Microsystems¹, Oracle Corporation²

14 Feb 2006

TL;DR: In this paper, a transactional-memory manager is used to coordinate memory access requests directed at the memory from a plurality of transactions in a software transactional memory system, where the commit request from a given transaction of the set, determines whether to commit or abort the given transaction based at least in part on transactional states of other transactions in the set.

...read moreread less

Abstract: A system for implementing synchronized objects for software transactional memory may include one or more processors and a memory storing program instructions executable by the processor to implement a transactional-memory manager configured to coordinate memory access requests directed at the memory from a plurality of transactions. The transactional-memory manager records, within a collaborator record for a shared data object in the memory, identifications of a set of two or more transactions that have requested synchronization on the object. In response to a commit request from a given transaction of the set, the transactional-memory manager determines whether to commit or abort the given transaction based at least in part on the transactional states of other transactions in the set, examining the collaborator record to identify the other transactions.

...read moreread less

...

Expand