Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Distributed memory
  4. 2010
  1. Home
  2. Topics
  3. Distributed memory
  4. 2010
Showing papers on "Distributed memory published in 2010"
Journal Article•10.1109/TIT.2010.2054295•
Network Coding for Distributed Storage Systems

[...]

Alexandros G. Dimakis1, P B Godfrey2, Yunnan Wu3, Martin J. Wainwright4, Kannan Ramchandran4 •
University of Southern California1, University of Illinois at Urbana–Champaign2, Microsoft3, University of California, Berkeley4
01 Sep 2010-IEEE Transactions on Information Theory
TL;DR: It is shown that there is a fundamental tradeoff between storage and repair bandwidth which is theoretically characterize using flow arguments on an appropriately constructed graph and regenerating codes are introduced that can achieve any point in this optimal tradeoff.
Abstract: Distributed storage systems provide reliable access to data through redundancy spread over individually unreliable nodes. Application scenarios include data centers, peer-to-peer storage systems, and storage in wireless networks. Storing data using an erasure code, in fragments spread across nodes, requires less redundancy than simple replication for the same level of reliability. However, since fragments must be periodically replaced as nodes fail, a key question is how to generate encoded fragments in a distributed way while transferring as little data as possible across the network. For an erasure coded system, a common practice to repair from a single node failure is for a new node to reconstruct the whole encoded data object to generate just one encoded block. We show that this procedure is sub-optimal. We introduce the notion of regenerating codes, which allow a new node to communicate functions of the stored data from the surviving nodes. We show that regenerating codes can significantly reduce the repair bandwidth. Further, we show that there is a fundamental tradeoff between storage and repair bandwidth which we theoretically characterize using flow arguments on an appropriately constructed graph. By invoking constructive results in network coding, we introduce regenerating codes that can achieve any point in this optimal tradeoff.

2,171 citations

Proceedings Article•10.1109/SC.2010.53•
The 48-core SCC Processor: the Programmer's View

[...]

Timothy G. Mattson1, Michael Riepen1, Thomas Lehnig, Paul Brett1, Werner Haas1, Patrick Kennedy, Jason Howard1, Sriram R. Vangal1, Nitin Borkar1, Greg Ruhl1, Saurabh Dighe1 •
Intel1
13 Nov 2010
TL;DR: The programmer's view of this chip is described and RCCE is described: the native message passing model created for the SCC processor, an intermediate case, sharing traits of message passing and shared memory architectures.
Abstract: The number of cores integrated onto a single die is expected to climb steadily in the foreseeable future. This move to many-core chips is driven by a need to optimize performance per watt. How best to connect these cores and how to program the resulting many-core processor, however, is an open research question. Designs vary from GPUs to cache-coherent shared memory multiprocessors to pure distributed memory chips. The 48-core SCC processor reported in this paper is an intermediate case, sharing traits of message passing and shared memory architectures. The hardware has been described elsewhere. In this paper, we describe the programmer's view of this chip. In particular we describe RCCE: the native message passing model created for the SCC processor.

283 citations

Patent•
Malware investigation by analyzing computer memory

[...]

Marcus Peinado1, Weidong Cui1•
Microsoft1
27 Apr 2010
TL;DR: In this paper, a static analysis on code for a software environment to form an extended type graph is described for malware investigation by analyzing computer memory in a computing device and a raw memory snapshot of the computer memory can be obtained at runtime.
Abstract: Technology is described for malware investigation by analyzing computer memory in a computing device. The method can include performing static analysis on code for a software environment to form an extended type graph. A raw memory snapshot of the computer memory can be obtained at runtime. The raw memory snapshot may include the software environment executing on the computing device. Dynamic data structures can be found in the raw memory snapshot using the extended type graph to form an object graph. An authorized memory area can be defined having executable code, static data structures, and dynamic data structures. Implicit and explicit function pointers can be identified. The function pointers can be checked to validate that the function pointers reference a valid memory location in the authorized memory area and whether the computer memory is uncompromised.

217 citations

Proceedings Article•10.1145/1854273.1854350•
A case for NUMA-aware contention management on multicore systems

[...]

Sergey Blagodurov1, Alexandra Fedorova1, Sergey Zhuravlev1, Ali Kamali1•
Simon Fraser University1
11 Sep 2010
TL;DR: In this paper, a contention-aware scheduling algorithm for NUMA systems is proposed, where the scheduler must not only consider the placement of threads, but also their placement of their memory.
Abstract: On multicore systems contention for shared resources occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as last-level caches and memory controllers. Previous work investigated how contention could be addressed via scheduling. A contention-aware scheduler separates competing threads onto separate memory hierarchy domains to eliminate resource sharing and, as a consequence, mitigate contention. However, all previous work on contention-aware scheduling assumed that the underlying system is UMA (uniform memory access latencies, single memory controller). Modern multicore systems, however, are NUMA, which means that they feature non-uniform memory access latencies and multiple memory controllers. We discovered that contention management is a lot more difficult on NUMA systems, because the scheduler must not only consider the placement of threads, but also the placement of their memory. This is mostly required to eliminate contention for memory controllers contrary to the popular belief that remote access latency is the dominant concern. In this work we quantify the effects on performance imposed by resource contention and remote access latency. This analysis inspires the design of a contention-aware scheduling algorithm for NUMA systems. This algorithm significantly outperforms a NUMA-unaware algorithm proposed before as well as the default Linux scheduler. We also investigate memory migration strategies, which are the necessary part of the NUMA contention-aware scheduling algorithm. Finally, we propose and evaluate a new contention management algorithm that is priority-aware.

209 citations

Patent•
System and Method for Processor-Based Security

[...]

Ruby B. Lee, David Champagne
19 Jan 2010
TL;DR: In this paper, a system and method for processor-based security for on-chip security and trusted computing services for software applications is presented. Butts et al. present a processor with a processor core, a cache memory, a plurality of registers for storing at least one hash value and at least 1 encryption key, a memory interface, and at most one on-Chip instruction for creating a secure memory area in a memory external to the processor, and a hypervisor program executed by the processor.
Abstract: A system and method for processor-based security is provided, for on-chip security and trusted computing services for software applications. A processor is provided having a processor core, a cache memory, a plurality of registers for storing at least one hash value and at least one encryption key, a memory interface, and at least one on-chip instruction for creating a secure memory area in a memory external to the processor, and a hypervisor program executed by the processor. The hypervisor program instructs the processor to execute the at least one on-chip instruction to create a secure memory area for a software area for a software module, and the processor encrypts data written to, and decrypts data read from, the external memory using the at least one encryption key and the verifying data read from the external memory using the at least one hash value. Secure module interactions are provided, as well as the generation of a power-on key which can be used to protect memory in the event of a re-boot event. Lightweight, run-time attestation reports are generated which include selected information about software modules executed by the processors, for use in determining whether the processor is trusted to provide secure services.

203 citations

Posted Content•
A hybrid MPI-OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence

[...]

Pablo D. Mininni1, Pablo D. Mininni2, Duane Rosenberg2, Raghu Reddy3, Annick Pouquet2 •
Facultad de Ciencias Exactas y Naturales1, National Center for Atmospheric Research2, Pittsburgh Supercomputing Center3
23 Mar 2010-arXiv: Computational Physics
TL;DR: In this paper, a hybrid scheme that utilizes MPI for distributed memory parallelism and OpenMP for shared-memory parallelism is presented, motivated by the desire to achieve exceptionally high Reynolds numbers in pseudospectral computations of fluid turbulence.
Abstract: A hybrid scheme that utilizes MPI for distributed memory parallelism and OpenMP for shared memory parallelism is presented. The work is motivated by the desire to achieve exceptionally high Reynolds numbers in pseudospectral computations of fluid turbulence on emerging petascale, high core-count, massively parallel processing systems. The hybrid implementation derives from and augments a well-tested scalable MPI-parallelized pseudospectral code. The hybrid paradigm leads to a new picture for the domain decomposition of the pseudospectral grids, which is helpful in understanding, among other things, the 3D transpose of the global data that is necessary for the parallel fast Fourier transforms that are the central component of the numerical discretizations. Details of the hybrid implementation are provided, and performance tests illustrate the utility of the method. It is shown that the hybrid scheme achieves near ideal scalability up to ~20000 compute cores with a maximum mean efficiency of 83%. Data are presented that demonstrate how to choose the optimal number of MPI processes and OpenMP threads in order to optimize code performance on two different platforms.

186 citations

Journal Article•10.1016/J.ADVWATRES.2010.07.007•
ParBreZo: A parallel, unstructured grid, Godunov-type, shallow-water code for high-resolution flood inundation modeling at the regional scale

[...]

Brett F. Sanders1, Jochen E. Schubert1, Russell L. Detwiler1•
University of California, Berkeley1
01 Dec 2010-Advances in Water Resources
TL;DR: The Single Process Multiple Data (SPMD) paradigm of distributed-memory parallelism is implemented so the code can be executed on computer clusters with distributed memory, shared memory, or some combination of the two (now common with multi-core architectures).

156 citations

Proceedings Article•10.1145/1837274.1837363•
Reducing write activities on non-volatile memories in embedded CMPs via data migration and recomputation

[...]

Jingtong Hu1, Chun Jason Xue2, Wei-Che Tseng1, Yi He1, Meikang Qiu3, Edwin H.-M. Sha1 •
University of Texas at Dallas1, City University of Hong Kong2, University of Kentucky3
13 Jun 2010
TL;DR: This paper targets embedded Chip Multiprocessors with Scratch Pad Memory (SPM) and non-volatile main memory and introduces data migration and recompu-tation techniques to reduce the number of write activities on non-Volatile memories.
Abstract: Recent advances in circuit and process technologies have pushed non-volatile memory technologies into a new era. These technologies exhibit appealing properties such as low power consumption, non-volatility, shock-resistivity, and high density. However, there are challenges to which we need answers in the road of applying non-volatile memories as main memory in computer systems. First, non-volatile memories have limited number of write/erase cycles compared with DRAM memory. Second, write activities on non-volatile memory are more expensive than DRAM memory in terms of energy consumption and access latency. Both challenges will benefit from reduction of the write activities on the nonvolatile memory. In this paper, we target embedded Chip Multiprocessors (CMPs) with Scratch Pad Memory (SPM) and non-volatile main memory. We introduce data migration and recompu-tation techniques to reduce the number of write activities on non-volatile memories. Experimental results show that the proposed methods can reduce the number of writes by 59.41% on average, which means that the non-volatile memory can last 2.8 times as long as before. Meanwhile, the finish time of programs is reduced by 31.81% on average.

128 citations

Proceedings Article•10.5555/1883978.1884011•
Reasoning about the implementation of concurrency abstractions on x86-TSO

[...]

Scott Owens1•
University of Cambridge1
21 Jun 2010
TL;DR: A novel principle for reasoning about assembly programs on the previous x86-TSO memory model is developed, and it is used to analyze five concurrency abstraction implementations: two spinlocks; a non-blocking write protocol; the double-checked locking idiom; and java.util.concurrent's Parker.
Abstract: With the rise of multi-core processors, shared-memory concurrency has become a widespread feature of computation, from hardware, to operating systems, to programming languages such as C++ and Java. However, none of these provide sequentially consistent shared memory; instead they have relaxed memory models, which make concurrent programs even more challenging to understand. Programming language implementations run on hardware memory models, so VM and run-time system implementors must reason at both levels. Of particular interest are the low-level implementations of the abstractions that support language-level concurrency-especially because they invariably contain data races.In this paper, we develop a novel principle for reasoning about assembly programs on our previous x86-TSO memory model, and we use it to analyze five concurrency abstraction implementations: two spinlocks (from Linux); a non-blocking write protocol; the double-checked locking idiom; and java.util.concurrent's Parker. Our principle, called triangular-race freedom, strengthens the usual data-race freedom style of reasoning.

108 citations

Journal Article•10.1145/1814539.1814548•
A real-time Java chip-multiprocessor

[...]

Christof Pitter1, Martin Schoeberl1•
Vienna University of Technology1
27 Aug 2010-ACM Transactions in Embedded Computing Systems
TL;DR: In this article, three different arbitration policies are presented, evaluated, and compared with respect to their real-time and average-case performance: a fixed priority, a fair-based, and a time-sliced arbiter.
Abstract: Chip-multiprocessors are an emerging trend for embedded systems. In this article, we introduce a real-time Java multiprocessor called JopCMP. It is a symmetric shared-memory multiprocessor, and consists of up to eight Java Optimized Processor (JOP) cores, an arbitration control device, and a shared memory. All components are interconnected via a system on chip bus. The arbiter synchronizes the access of multiple CPUs to the shared main memory. In this article, three different arbitration policies are presented, evaluated, and compared with respect to their real-time and average-case performance: a fixed priority, a fair-based, and a time-sliced arbiter.Tasks running on different CPUs of a chip-multiprocessor (CMP) influence each others' execution times when accessing a shared memory. Therefore, the system needs an arbiter that is able to limit the worst-case execution time of a task running on a CPU, even though tasks executing simultaneously on other CPUs access the main memory. Our research shows that timing analysis is in fact possible for homogeneous multiprocessor systems with a shared memory. The timing analysis of tasks, executing on the CMP using time-sliced memory arbitration, leads to viable worst-case execution time bounds.The time-sliced arbiter divides the memory access time into equal time slots, one time slot for each CPU. This memory arbitration scheme allows for a calculation of upper bounds of Java application worst-case execution times, depending on the number of CPUs, the time slot size, and the memory access time. Examples of worst-case execution time calculation are presented, and the analyzed results of a real-world application task are compared to measured execution time results. Finally, we evaluate the tradeoffs when using a time-predictable solution compared to using average-case optimized chip-multiprocessors, applying three different benchmarks. These experiments are carried out by executing the programs on the CMP prototype.

81 citations

Proceedings Article•10.1145/1706299.1706304•
Coarse-grained transactions

[...]

Eric Koskinen1, Matthew Parkinson1, Maurice Herlihy2•
University of Cambridge1, Brown University2
17 Jan 2010
TL;DR: A generalization of transactional memory in which a transaction consists of coarse-grained data-type operations rather than simple memory read/write operations is defined, and how the semantics applies to numerous TM implementation details discussed widely in the literature is discussed.
Abstract: Traditional transactional memory systems suffer from overly conservative conflict detection, yielding so-called false conflicts, because they are based on fine-grained, low-level read/write conflicts. In response, the recent trend has been toward integrating various abstract data-type libraries using ad-hoc methods of high-level conflict detection. These proposals have led to improved performance but a lack of a unified theory has led to confusion in the literature.We clarify these recent proposals by defining a generalization of transactional memory in which a transaction consists of coarse-grained (abstract data-type) operations rather than simple memory read/write operations. We provide semantics for both pessimistic (e.g. transactional boosting) and optimistic (e.g. traditional TMs and recent alternatives) execution. We show that both are included in the standard atomic semantics, yet find that the choice imposes different requirements on the coarse-grained operations: pessimistic requires operations be left-movers, optimistic requires right-movers. Finally, we discuss how the semantics applies to numerous TM implementation details discussed widely in the literature.
Proceedings Article•10.5555/1921479.1921489•
Large data visualization on distributed memory multi-GPU clusters

[...]

Thomas Fogal1, Hank Childs2, Siddharth Shankar3, Jens Krüger4, R. Daniel Bergeron1, Philip J. Hatcher1 •
University of New Hampshire1, Lawrence Berkeley National Laboratory2, University of Utah3, Analysis Group4
25 Jun 2010
TL;DR: In this paper, the authors study a common visualization technique in a GPU-accelerated, distributed memory setting, and present performance characteristics when scaling to extremely large data sets, such as MRI and CT scanners.
Abstract: Data sets of immense size are regularly generated on large scale computing resources. Even among more traditional methods for acquisition of volume data, such as MRI and CT scanners, data which is too large to be effectively visualized on standard workstations is now commonplace.One solution to this problem is to employ a 'visualization cluster,' a small to medium scale cluster dedicated to performing visualization and analysis of massive data sets generated on larger scale supercomputers. These clusters are designed to fit a different need than traditional supercomputers, and therefore their design mandates different hardware choices, such as increased memory, and more recently, graphics processing units (GPUs). While there has been much previous work on distributed memory visualization as well as GPU visualization, there is a relative dearth of algorithms which effectively use GPUs at a large scale in a distributed memory environment. In this work, we study a common visualization technique in a GPU-accelerated, distributed memory setting, and present performance characteristics when scaling to extremely large data sets.
Journal Article•10.1016/J.ADVENGSOFT.2008.12.004•
Parallel discrete element simulation of poly-dispersed granular material

[...]

Rimantas Kačianauskas1, Algirdas Maknickas1, Arnas Kačeniauskas1, Darius Markauskas1, Robertas Balevičius1 •
Vilnius Gediminas Technical University1
01 Jan 2010-Advances in Engineering Software
TL;DR: It has been shown that a drastic increase of computational expenses of simulation for the poly-dispersed material in terms of CPU time is associated with the increase of its heterogeneity.
Proceedings Article•10.1109/HIPC.2010.5713180•
A space-efficient parallel algorithm for computing betweenness centrality in distributed memory

[...]

Nicholas Edmonds1, Torsten Hoefler2, Andrew Lumsdaine1•
Indiana University1, University of Illinois at Urbana–Champaign2
1 Dec 2010
TL;DR: This paper presents a new parallelizable algorithm with low spatial complexity that is based on the best known sequential algorithm and is especially well suited to distributed memory processing because it can be implemented using coarse-grained parallelism.
Abstract: Betweenness centrality is a measure based on shortest paths that attempts to quantify the relative importance of nodes in a network. As computation of betweenness centrality becomes increasingly important in areas such as social network analysis, networks of interest are becoming too large to fit in the memory of a single processing unit, making parallel execution a necessity. Par-allelization over the vertex set of the standard algorithm, with a final reduction of the centrality for each vertex, is straightforward but requires Ω(|V|2) storage. In this paper we present a new parallelizable algorithm with low spatial complexity that is based on the best known sequential algorithm. Our algorithm requires O(|V| + |E|) storage and enables efficient parallel execution. Our algorithm is especially well suited to distributed memory processing because it can be implemented using coarse-grained parallelism. The presented time bounds for parallel execution of our algorithm on CRCW PRAM and on distributed memory systems both show good asymptotic performance. Experimental results with a distributed memory computer show the practical applicability of our algorithm.
Proceedings Article•10.1145/1810479.1810498•
Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

[...]

Michela Becchi1, Surendra Byna1, Srihari Cadambi1, Srimat Chakradhar1•
Princeton University1
13 Jun 2010
TL;DR: A runtime to automatically enhance the performance of applications running on heterogeneous platforms consisting of a multi- core (CPU) and a throughput-oriented many-core (GPU) and shows that data-aware CPU/GPU scheduling improves performance by up to 25% over the best data-agnostic scheduling on the same platform.
Abstract: In this paper, we describe a runtime to automatically enhance the performance of applications running on heterogeneous platforms consisting of a multi-core (CPU) and a throughput-oriented many-core (GPU). The CPU and GPU are connected by a non-coherent interconnect such as PCI-E, and as such do not have shared memory. Heterogeneous platforms available today such as [9] are of this type. Our goal is to enable the programmer to seamlessly use such a system without rewriting the application and with minimal knowledge of the underlying architectural details. Assuming that applications perform function calls to computational kernels with available CPU and GPU implementations, our runtime achieves this goal by automatically scheduling the kernels and managing data placement. In particular, it intercepts function calls to well-known computational kernels and schedules them on CPU or GPU based on their argument size and location. To improve performance, it defers all data transfers between the CPU and the GPU until necessary. By managing data placement transparently to the programmer, it provides a unified memory view despite the underlying separate memory sub-systems.We experimentally evaluate our runtime on a heterogeneous platform consisting of a 2.5GHz quad-core Xeon CPU and an NVIDIA C870 GPU. Given array sorting, parallel reduction, dense and sparse matrix operations and ranking as computational kernels, we use our runtime to automatically retarget SSI [25], K-means [32] and two synthetic applications to the above platform with no code changes. We find that, in most cases, performance improves if the computation is moved to the data, and not vice-versa. For instance, even if a particular instance of a kernel is slower on the GPU than on the CPU, the overall application may be faster if the kernel is scheduled on the GPU anyway, especially if the kernel data is already located on the GPU memory due to prior decisions. Our results show that data-aware CPU/GPU scheduling improves performance by up to 25% over the best data-agnostic scheduling on the same platform.
Proceedings Article•10.5555/1870926.1870952•
An efficient distributed memory interface for many-core platform with 3D stacked DRAM

[...]

Igor Loi1, Luca Benini1•
University of Bologna1
8 Mar 2010
TL;DR: This paper presents an efficient and flexible distributed memory interface for 3D-stacked DRAM that ensures ultra-low-latency access to the memory modules on top of each processing element (vertically local memory neighborhoods) and takes full advantage of the lower latency of vertical interconnect.
Abstract: Historically, processor performance has increased at a much faster rate than that of main memory and up-coming NoC-based many-core architectures are further tightening the memory bottleneck. 3D integration based on TSV technology may provide a solution, as it enables stacking of multiple memory layers, with orders-of-magnitude increase in memory interface bandwidth, speed and energy efficiency. To fully exploit this potential, the architectural interface to vertically stacked memory must be streamlined. In this paper we present an efficient and flexible distributed memory interface for 3D-stacked DRAM. Our interface ensures ultra-low-latency access to the memory modules on top of each processing element (vertically local memory neighborhoods). Communication to these local modules do not travel through the NoC and takes full advantage of the lower latency of vertical interconnect, thus speeding up significantly the common case. The interface still supports a convenient global address space abstraction with high-latency remote access, due to the slower horizontal interconnect. Experimental results demonstrate significant bandwidth improvement that ranges from 1.44x to 7.40x as compared to the JEDEC standard, with peaks of 4.53GB/s for direct memory access, and 850MB/s for remote access through the NoC.
Proceedings Article•10.1109/HPCC.2010.23•
Enhancing Muesli's Data Parallel Skeletons for Multi-core Computer Architectures

[...]

Philipp Ciechanowicz1, Herbert Kuchen1•
University of Münster1
1 Sep 2010
TL;DR: This work presents an extension of their skeleton library Muesli which now allows to use the same application without modifications on a variety of parallel machines ranging from multi-processor distributed memory to many-core shared memory machines and combinations of those such as clusters of multi-core nodes.
Abstract: Algorithmic skeletons encapsulate typical parallel programming patterns such that they can be easily applied by users Existing skeleton libraries usually work on distributed memory machines We present an extension of our skeleton library Muesli which now allows to use the same application without modifications on a variety of parallel machines ranging from multi-processor distributed memory to many-core shared memory machines and combinations of those such as clusters of multi-core nodes Internally, the skeletons are based on MPI and Open MP We demonstrate the efficiency of our approach by providing experimental results
Proceedings Article•10.1145/1810479.1810510•
Towards optimizing energy costs of algorithms for shared memory architectures

[...]

Vijay Anand Korthikanti1, Gul Agha1•
University of Illinois at Urbana–Champaign1
13 Jun 2010
TL;DR: This paper develops a methodology to evaluate how energy consumption of a given parallel algorithm changes as the number of cores and their frequency is varied, and establishes the optimal number of core to minimize the energy consumed by the execution of a parallel algorithm for a specific problem size while satisfying a given performance requirement.
Abstract: Energy consumption by computer systems has emerged as an important concern. However, the energy consumed in executing an algorithm cannot be inferred from its performance alone: it must be modeled explicitly. This paper analyzes energy consumption of parallel algorithms executed on shared memory multicore processors. Specifically, we develop a methodology to evaluate how energy consumption of a given parallel algorithm changes as the number of cores and their frequency is varied. We use this analysis to establish the optimal number of cores to minimize the energy consumed by the execution of a parallel algorithm for a specific problem size while satisfying a given performance requirement. We study the sensitivity of our analysis to changes in parameters such as the ratio of the power consumed by a computation step versus the power consumed in accessing memory. The results show that the relation between the problem size and the optimal number of cores is relatively unaffected for a wide range of these parameters.
Proceedings Article•10.5555/1870926.1871199•
Partitioning and allocation of scratch-pad memory for priority-based preemptive multi-task systems

[...]

Hideki Takase1, Hiroyuki Tomiyama1, Hiroaki Takada1•
Nagoya University1
8 Mar 2010
TL;DR: The three methods which are proposed, i.e., spatial, temporal, and hybrid methods, bring about effective usage of the scratch-pad memory space, and achieve energy reduction in the instruction memory subsystems, are applicable to a real-time environment.
Abstract: Scratch-pad memory has been employed as a partial or entire replacement for cache memory due to its better energy efficiency. In this paper, we propose scratch-pad memory management techniques for priority-based preemptive multi-task systems. Our techniques are applicable to a real-time environment. The three methods which we propose, i.e., spatial, temporal, and hybrid methods, bring about effective usage of the scratch-pad memory space, and achieve energy reduction in the instruction memory subsystems. We formulate each method as an integer programming problem that simultaneously determines (1) partitioning of scratch-pad memory space for the tasks, and (2) allocation of program code to scratch-pad memory space for each task. It is remarkable that periods and priorities of tasks are considered in the formulas. Additionally, we implement an RTOS-hardware cooperative support mechanism for a runtime code allocation to the scratch-pad memory space. We have made the experiments with the fully functional real-time operating system. The experimental results with four task sets have demonstrated the effectiveness of our techniques. Up to 73 % energy reduction compared to a standard method was achieved.
Proceedings Article•10.1109/ICPPW.2010.62•
Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems

[...]

Jinpil Lee1, Mitsuhisa Sato1•
University of Tsukuba1
13 Sep 2010
TL;DR: XcalableMP is introduced, it shows that users can describe the parallelization for distributed memory system with a small modification to the original sequential code, and the performance evaluation result is shown.
Abstract: Although MPI is a de-facto standard for parallel programming on distributed memory systems, writing MPI programs is often a time-consuming and complicated process. XcalableMP is a language extension of C and Fortran for parallel programming on distributed memory systems that helps users to reduce those programming efforts. XcalableMP provides two programming models. The first one is the global view model, which supports typical parallelization based on the data and task parallel paradigm, and enables parallelizing the original sequential code using minimal modification with simple, OpenMP-like directives. The other one is the local view model, which allows using CAF-like expressions to describe inter-node communication. Users can even use MPI and OpenMP explicitly in our language to optimize performance explicitly. In this paper, we introduce XcalableMP, the implementation of the compiler, and the performance evaluation result. For the performance evaluation, we parallelized HPCC Benchmark in XcalableMP. It shows that users can describe the parallelization for distributed memory system with a small modification to the original sequential code.
Patent•
Distributed symmetric multiprocessing computing architecture

[...]

Richard S. Anderson
15 Nov 2010
TL;DR: In this article, the authors present a scalable symmetric multiprocessing (shared memory) computer architecture using a network of homogeneous multi-core servers, which combines new operating system extensions with a high speed network that supports remote direct memory access to achieve an effective global distributed shared memory.
Abstract: Example embodiments of the present invention includes systems and methods for implementing a scalable symmetric multiprocessing (shared memory) computer architecture using a network of homogeneous multi-core servers. The level of processor and memory performance achieved is suitable for running applications that currently require cache coherent shared memory mainframes and supercomputers. The architecture combines new operating system extensions with a high-speed network that supports remote direct memory access to achieve an effective global distributed shared memory. A distributed thread model allows a process running in a head node to fork threads in other (worker) nodes that run in the same global address space. Thread synchronization is supported by a distributed mutex implementation. A transactional memory model allows a multi-threaded program to maintain global memory page consistency across the distributed architecture. A distributed file access implementation supports non-contentious file I/O for threads. These and other functions provide a symmetric multiprocessing programming model consistent with standards such as Portable Operating System Interface for Unix (POSIX).
Proceedings Article•10.1109/MICRO.2010.39•
Memory Latency Reduction via Thread Throttling

[...]

Hsiang-Yun Cheng1, Chung-Hsiang Lin1, Jian Li2, Chia-Lin Yang1•
National Taiwan University1, IBM2
4 Dec 2010
TL;DR: A memory thread throttling mechanism that tunes the allowable memory threads dynamically under workload variation to improve system performance and leads to a geometric mean of 12% performance improvement for real-world applications on the same hardware.
Abstract: Memory Wall is a well-known obstacle to processor performance improvement. The popularity of multi-core architecture will further exaggerate the problem since the memory resource is shared by all cores. Interferences among requests from different cores may prolong the latency of memory accesses thereby degrading the system performance. To tackle the problem, this paper proposes to decouple application threads into compute and memory tasks, and restrict the number of concurrent memory tasks to avoid the interference among memory requests. Yet with this scheduling restriction, a CPU core may unnecessarily stay idle, which incurs adverse impact on the overall performance. Therefore, we develop a memory thread throttling mechanism that tunes the allowable memory threads dynamically under workload variation to improve system performance. The proposed run-time mechanism monitors memory and computation ratios of a program for phase detection. It then decides the memory thread constraint for the next program phase based on an analytical model that can estimate system performance under different constraint values. To prove the concept, we prototype the mechanism in some real-world applications as well as synthetic workloads. We evaluate their performance on real machines. The experimental results demonstrate up to 20% speedup with a pool of synthetic workloads on an Intel i7 (Nehalem) machine and match with the speedup estimated by the proposed analytical model. Furthermore, the intelligent run-time scheduling leads to a geometric mean of 12% performance improvement for real-world applications on the same hardware.
Proceedings Article•10.1109/IPDPSW.2010.5470813•
Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory

[...]

Markus Wittmann1, Georg Hager1, Gerhard Wellein1•
University of Erlangen-Nuremberg1
19 Apr 2010
TL;DR: A new pipelined approach is introduced that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead and demonstrates how temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment.
Abstract: New algorithms and optimization techniques are needed to balance the accelerating trend towards bandwidth-starved multicore chips. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. For clusters of shared-memory nodes we demonstrate how temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment.
Patent•
System and Method for Executing a Transaction Using Parallel Co-Transactions

[...]

Mark S. Moir1, Robert E. Cypher1, Daniel S. Nussbaum1•
Business International Corporation1
31 Mar 2010
TL;DR: The transactional memory system described in this paper implements parallel co-transactions that access a shared memory such that at most one of the transactions in a set will succeed and all others will fail.
Abstract: The transactional memory system described herein may implement parallel co-transactions that access a shared memory such that at most one of the co-transactions in a set will succeed and all others will fail (eg, be aborted) Co-transactions may improve the performance of programs that use transactional memory by attempting to perform the same high-level operation using multiple algorithmic approaches, transactional memory implementations and/or speculation options in parallel, and allowing only the first to complete to commit its results If none of the co-transactions succeeds, one or more may be retried, possibly using a different approach and/or transactional memory implementation The at-most-one property may be managed through the use of a shared “done” flag Conflicts between co-transactions in a set and accesses made by transactions or activities outside the set may be managed using lazy write ownership acquisition and/or a priority-based approach Each co-transaction may execute on a different processor resource
Proceedings Article•
User-defined distributions and layouts in chapel: philosophy and framework

[...]

Bradford L. Chamberlain1, Steven J. Deitz1, David Iten1, Sung-Eun Choi1•
Cray1
14 Jun 2010
TL;DR: This paper introduces user-defined domain maps, a novel concept for implementing distributions and memory layouts for parallel data aggregates in Chapel, designed for generality and do not rely on hard-coding a fixed set of distributions into the compiler and runtime.
Abstract: This paper introduces user-defined domain maps, a novel concept for implementing distributions and memory layouts for parallel data aggregates. Our domain maps implement parallel arrays for the Chapel programming language and are themselves implemented using standard Chapel features. Domain maps export a functional interface that our compiler targets as it maps from a user's global-view algorithm down to the task-level operations required to implement the computation for multicore processors, GPUs, and distributed memory architectures. Unlike distributions in HPF and ZPL, Chapel's domain maps are designed for generality and do not rely on hard-coding a fixed set of distributions into the compiler and runtime. The chief contributions of this paper are its description of our motivating principles and an overview of our framework.
Journal Article•10.1016/J.JPDC.2010.02.003•
Embedded-TM: Energy and complexity-effective hardware transactional memory for embedded multicore systems

[...]

Cesare Ferri1, Samantha Wood2, Tali Moreshet3, R. Iris Bahar1, Maurice Herlihy1 •
Brown University1, Bryn Mawr College2, Swarthmore College3
01 Oct 2010-Journal of Parallel and Distributed Computing
TL;DR: It is found that ignoring energy considerations can lead to poor design choices, particularly for resource-constrained embedded platforms, and with the right balance of energy efficiency and simplicity, HTM will become an attractive choice for future embedded system designs.
Proceedings Article•10.1145/2020373.2020376•
Hybrid PGAS runtime support for multicore nodes

[...]

Filip Blagojevic1, Paul Hargrove1, Costin Iancu1, Katherine Yelick1•
Lawrence Berkeley National Laboratory1
12 Oct 2010
TL;DR: This paper presents a new process-based shared memory runtime for PGAS languages, and shows speedups of over 60% for application benchmarks and 100% for collective communication benchmarks, when compared to the previous pthreads implementation.
Abstract: With multicore processors as the standard building block for high performance systems, parallel runtime systems need to provide excellent performance on shared memory, distributed memory, and hybrids. Conventional wisdom suggests that threads should be used as the runtime mechanism within shared memory, and two runtime versions for shared and distributed memory are often designed and implemented separately, retrofitting after the fact for hybrid systems. In this paper we consider the problem of implementing a runtime layer for Partitioned Global Address Space (PGAS) languages, which offer a uniform programming abstraction for hybrid machines. We present a new process-based shared memory runtime and compare it to our previous pthreads implementation. Both are integrated with the GASNet communication layer, and they can co-exist with one another. We evaluate the shared memory runtime approaches, showing that they interact in important and sometimes surprising ways with the communication layer. Using a set of microbenchmarks and application level benchmarks on an IBM BG/P, Cray XT, and InfiniBand cluster, we show that threads, processes and combinations of both are needed for maximum performance. Our new runtime shows speedups of over 60% for application benchmarks and 100% for collective communication benchmarks, when compared to the previous implementation. Our work primarily targets PGAS languages, but some of the lessons are relevant to other parallel runtime systems and libraries.
Patent•
Memory system including a controller and processors associated with memory devices

[...]

Michael Shachar1, Barak Rotbard1, Oren Golov1, Uri Perlmutter1, Dotan Sokolov1, Julian Vlaiko1, Yair Schwartz1 •
Apple Inc.1
17 Jan 2010
TL;DR: In this paper, a data storage system includes a plurality of non-volatile memory devices arranged in one or more sets, a main controller and a set of processors, each of which is associated with a respective set of the NVRAM devices, and each of the processors is configured to accept commands from a host and to convert the commands into recipes.
Abstract: A data storage system includes a plurality of non-volatile memory devices arranged in one or more sets, a main controller and one or more processors. The main controller is configured to accept commands from a host and to convert the commands into recipes. Each recipe includes a list of multiple memory operations to be performed sequentially in the non-volatile memory devices belonging to one of the sets. Each of the processors is associated with a respective set of the non-volatile memory devices, and is configured to receive one or more of the recipes from the main controller and to execute the memory operations specified in the received recipes in the non-volatile memory devices belonging to the respective set.
Journal Article•10.1137/090757010•
Fast Access to Distributed Atomic Memory

[...]

Partha Dutta1, Rachid Guerraoui2, Ron R. Levy, Marko Vukolic3•
IBM1, École Polytechnique Fédérale de Lausanne2, Institut Eurécom3
01 Aug 2010-SIAM Journal on Computing
TL;DR: The exact conditions under which every read and write involves one round of communication with the servers are determined, which relate the number of readers to the tolerated number of faulty servers and the nature of these failures.
Abstract: We study efficient and robust implementations of an atomic read-write data structure over an asynchronous distributed message-passing system made of reader and writer processes, as well as a number of servers implementing the data structure. We determine the exact conditions under which every read and write involves one round of communication with the servers. These conditions relate the number of readers to the tolerated number of faulty servers and the nature of these failures.
Journal Article•10.1137/090756934•
A Novel Parallel QR Algorithm for Hybrid Distributed Memory HPC Systems

[...]

Robert Granat, Bo Kågström, Daniel Kressner1•
ETH Zurich1
01 Jun 2010-SIAM Journal on Scientific Computing
TL;DR: In this paper, a novel variant of the parallel QR algorithm for solving dense nonsymmetric eigenvalue problems on hybrid distributed high performance computing systems is presented, which introduces the concept of multiwindow bulge chain chasing and parallelizes aggressive early deflation.
Abstract: A novel variant of the parallel QR algorithm for solving dense nonsymmetric eigenvalue problems on hybrid distributed high performance computing systems is presented. For this purpose, we introduce the concept of multiwindow bulge chain chasing and parallelize aggressive early deflation. The multiwindow approach ensures that most computations when chasing chains of bulges are performed in level 3 BLAS operations, while the aim of aggressive early deflation is to speed up the convergence of the QR algorithm. Mixed MPI-OpenMP coding techniques are utilized for porting the codes to distributed memory platforms with multithreaded nodes, such as multicore processors. Numerous numerical experiments confirm the superior performance of our parallel QR algorithm in comparison with the existing ScaLAPACK code, leading to an implementation that is one to two orders of magnitude faster for sufficiently large problems, including a number of examples from applications.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve