Top 322 papers published in the topic of Distributed memory in 2010

Showing papers on "Distributed memory published in 2010"

Journal Article•10.1109/TIT.2010.2054295•

Network Coding for Distributed Storage Systems

[...]

Alexandros G. Dimakis¹, P B Godfrey², Yunnan Wu³, Martin J. Wainwright⁴, Kannan Ramchandran⁴ - Show less +1 more•Institutions (4)

University of Southern California¹, University of Illinois at Urbana–Champaign², Microsoft³, University of California, Berkeley⁴

01 Sep 2010-IEEE Transactions on Information Theory

TL;DR: It is shown that there is a fundamental tradeoff between storage and repair bandwidth which is theoretically characterize using flow arguments on an appropriately constructed graph and regenerating codes are introduced that can achieve any point in this optimal tradeoff.

...read moreread less

Abstract: Distributed storage systems provide reliable access to data through redundancy spread over individually unreliable nodes. Application scenarios include data centers, peer-to-peer storage systems, and storage in wireless networks. Storing data using an erasure code, in fragments spread across nodes, requires less redundancy than simple replication for the same level of reliability. However, since fragments must be periodically replaced as nodes fail, a key question is how to generate encoded fragments in a distributed way while transferring as little data as possible across the network. For an erasure coded system, a common practice to repair from a single node failure is for a new node to reconstruct the whole encoded data object to generate just one encoded block. We show that this procedure is sub-optimal. We introduce the notion of regenerating codes, which allow a new node to communicate functions of the stored data from the surviving nodes. We show that regenerating codes can significantly reduce the repair bandwidth. Further, we show that there is a fundamental tradeoff between storage and repair bandwidth which we theoretically characterize using flow arguments on an appropriately constructed graph. By invoking constructive results in network coding, we introduce regenerating codes that can achieve any point in this optimal tradeoff.

...read moreread less

2,171 citations

Proceedings Article•10.1109/SC.2010.53•

The 48-core SCC Processor: the Programmer's View

[...]

Timothy G. Mattson¹, Michael Riepen¹, Thomas Lehnig, Paul Brett¹, Werner Haas¹, Patrick Kennedy, Jason Howard¹, Sriram R. Vangal¹, Nitin Borkar¹, Greg Ruhl¹, Saurabh Dighe¹ - Show less +7 more•Institutions (1)

Intel¹

13 Nov 2010

TL;DR: The programmer's view of this chip is described and RCCE is described: the native message passing model created for the SCC processor, an intermediate case, sharing traits of message passing and shared memory architectures.

...read moreread less

Abstract: The number of cores integrated onto a single die is expected to climb steadily in the foreseeable future. This move to many-core chips is driven by a need to optimize performance per watt. How best to connect these cores and how to program the resulting many-core processor, however, is an open research question. Designs vary from GPUs to cache-coherent shared memory multiprocessors to pure distributed memory chips. The 48-core SCC processor reported in this paper is an intermediate case, sharing traits of message passing and shared memory architectures. The hardware has been described elsewhere. In this paper, we describe the programmer's view of this chip. In particular we describe RCCE: the native message passing model created for the SCC processor.

...read moreread less

283 citations

Patent•

Malware investigation by analyzing computer memory

[...]

Marcus Peinado¹, Weidong Cui¹•Institutions (1)

Microsoft¹

27 Apr 2010

TL;DR: In this paper, a static analysis on code for a software environment to form an extended type graph is described for malware investigation by analyzing computer memory in a computing device and a raw memory snapshot of the computer memory can be obtained at runtime.

...read moreread less

Abstract: Technology is described for malware investigation by analyzing computer memory in a computing device. The method can include performing static analysis on code for a software environment to form an extended type graph. A raw memory snapshot of the computer memory can be obtained at runtime. The raw memory snapshot may include the software environment executing on the computing device. Dynamic data structures can be found in the raw memory snapshot using the extended type graph to form an object graph. An authorized memory area can be defined having executable code, static data structures, and dynamic data structures. Implicit and explicit function pointers can be identified. The function pointers can be checked to validate that the function pointers reference a valid memory location in the authorized memory area and whether the computer memory is uncompromised.

...read moreread less

217 citations

Proceedings Article•10.1145/1854273.1854350•

A case for NUMA-aware contention management on multicore systems

[...]

Sergey Blagodurov¹, Alexandra Fedorova¹, Sergey Zhuravlev¹, Ali Kamali¹•Institutions (1)

Simon Fraser University¹

11 Sep 2010

TL;DR: In this paper, a contention-aware scheduling algorithm for NUMA systems is proposed, where the scheduler must not only consider the placement of threads, but also their placement of their memory.

...read moreread less

Abstract: On multicore systems contention for shared resources occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as last-level caches and memory controllers. Previous work investigated how contention could be addressed via scheduling. A contention-aware scheduler separates competing threads onto separate memory hierarchy domains to eliminate resource sharing and, as a consequence, mitigate contention. However, all previous work on contention-aware scheduling assumed that the underlying system is UMA (uniform memory access latencies, single memory controller). Modern multicore systems, however, are NUMA, which means that they feature non-uniform memory access latencies and multiple memory controllers. We discovered that contention management is a lot more difficult on NUMA systems, because the scheduler must not only consider the placement of threads, but also the placement of their memory. This is mostly required to eliminate contention for memory controllers contrary to the popular belief that remote access latency is the dominant concern. In this work we quantify the effects on performance imposed by resource contention and remote access latency. This analysis inspires the design of a contention-aware scheduling algorithm for NUMA systems. This algorithm significantly outperforms a NUMA-unaware algorithm proposed before as well as the default Linux scheduler. We also investigate memory migration strategies, which are the necessary part of the NUMA contention-aware scheduling algorithm. Finally, we propose and evaluate a new contention management algorithm that is priority-aware.

...read moreread less

209 citations

Patent•

System and Method for Processor-Based Security

[...]

Ruby B. Lee, David Champagne

19 Jan 2010

TL;DR: In this paper, a system and method for processor-based security for on-chip security and trusted computing services for software applications is presented. Butts et al. present a processor with a processor core, a cache memory, a plurality of registers for storing at least one hash value and at least 1 encryption key, a memory interface, and at most one on-Chip instruction for creating a secure memory area in a memory external to the processor, and a hypervisor program executed by the processor.

...read moreread less

Abstract: A system and method for processor-based security is provided, for on-chip security and trusted computing services for software applications. A processor is provided having a processor core, a cache memory, a plurality of registers for storing at least one hash value and at least one encryption key, a memory interface, and at least one on-chip instruction for creating a secure memory area in a memory external to the processor, and a hypervisor program executed by the processor. The hypervisor program instructs the processor to execute the at least one on-chip instruction to create a secure memory area for a software area for a software module, and the processor encrypts data written to, and decrypts data read from, the external memory using the at least one encryption key and the verifying data read from the external memory using the at least one hash value. Secure module interactions are provided, as well as the generation of a power-on key which can be used to protect memory in the event of a re-boot event. Lightweight, run-time attestation reports are generated which include selected information about software modules executed by the processors, for use in determining whether the processor is trusted to provide secure services.

...read moreread less

203 citations

Posted Content•

A hybrid MPI-OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence

[...]

Pablo D. Mininni¹, Pablo D. Mininni², Duane Rosenberg², Raghu Reddy³, Annick Pouquet² - Show less +1 more•Institutions (3)

Facultad de Ciencias Exactas y Naturales¹, National Center for Atmospheric Research², Pittsburgh Supercomputing Center³

23 Mar 2010-arXiv: Computational Physics

TL;DR: In this paper, a hybrid scheme that utilizes MPI for distributed memory parallelism and OpenMP for shared-memory parallelism is presented, motivated by the desire to achieve exceptionally high Reynolds numbers in pseudospectral computations of fluid turbulence.

...read moreread less

Abstract: A hybrid scheme that utilizes MPI for distributed memory parallelism and OpenMP for shared memory parallelism is presented. The work is motivated by the desire to achieve exceptionally high Reynolds numbers in pseudospectral computations of fluid turbulence on emerging petascale, high core-count, massively parallel processing systems. The hybrid implementation derives from and augments a well-tested scalable MPI-parallelized pseudospectral code. The hybrid paradigm leads to a new picture for the domain decomposition of the pseudospectral grids, which is helpful in understanding, among other things, the 3D transpose of the global data that is necessary for the parallel fast Fourier transforms that are the central component of the numerical discretizations. Details of the hybrid implementation are provided, and performance tests illustrate the utility of the method. It is shown that the hybrid scheme achieves near ideal scalability up to ~20000 compute cores with a maximum mean efficiency of 83%. Data are presented that demonstrate how to choose the optimal number of MPI processes and OpenMP threads in order to optimize code performance on two different platforms.

...read moreread less

186 citations

Journal Article•10.1016/J.ADVWATRES.2010.07.007•

ParBreZo: A parallel, unstructured grid, Godunov-type, shallow-water code for high-resolution flood inundation modeling at the regional scale

[...]

Brett F. Sanders¹, Jochen E. Schubert¹, Russell L. Detwiler¹•Institutions (1)

University of California, Berkeley¹

01 Dec 2010-Advances in Water Resources

TL;DR: The Single Process Multiple Data (SPMD) paradigm of distributed-memory parallelism is implemented so the code can be executed on computer clusters with distributed memory, shared memory, or some combination of the two (now common with multi-core architectures).

...read moreread less

156 citations

Proceedings Article•10.1145/1837274.1837363•

Reducing write activities on non-volatile memories in embedded CMPs via data migration and recomputation

[...]

Jingtong Hu¹, Chun Jason Xue², Wei-Che Tseng¹, Yi He¹, Meikang Qiu³, Edwin H.-M. Sha¹ - Show less +2 more•Institutions (3)

University of Texas at Dallas¹, City University of Hong Kong², University of Kentucky³

13 Jun 2010

TL;DR: This paper targets embedded Chip Multiprocessors with Scratch Pad Memory (SPM) and non-volatile main memory and introduces data migration and recompu-tation techniques to reduce the number of write activities on non-Volatile memories.

...read moreread less

Abstract: Recent advances in circuit and process technologies have pushed non-volatile memory technologies into a new era. These technologies exhibit appealing properties such as low power consumption, non-volatility, shock-resistivity, and high density. However, there are challenges to which we need answers in the road of applying non-volatile memories as main memory in computer systems. First, non-volatile memories have limited number of write/erase cycles compared with DRAM memory. Second, write activities on non-volatile memory are more expensive than DRAM memory in terms of energy consumption and access latency. Both challenges will benefit from reduction of the write activities on the nonvolatile memory. In this paper, we target embedded Chip Multiprocessors (CMPs) with Scratch Pad Memory (SPM) and non-volatile main memory. We introduce data migration and recompu-tation techniques to reduce the number of write activities on non-volatile memories. Experimental results show that the proposed methods can reduce the number of writes by 59.41% on average, which means that the non-volatile memory can last 2.8 times as long as before. Meanwhile, the finish time of programs is reduced by 31.81% on average.

...read moreread less

128 citations

Proceedings Article•10.5555/1883978.1884011•

Reasoning about the implementation of concurrency abstractions on x86-TSO

[...]

Scott Owens¹•Institutions (1)

University of Cambridge¹

21 Jun 2010

TL;DR: A novel principle for reasoning about assembly programs on the previous x86-TSO memory model is developed, and it is used to analyze five concurrency abstraction implementations: two spinlocks; a non-blocking write protocol; the double-checked locking idiom; and java.util.concurrent's Parker.

...read moreread less

Abstract: With the rise of multi-core processors, shared-memory concurrency has become a widespread feature of computation, from hardware, to operating systems, to programming languages such as C++ and Java. However, none of these provide sequentially consistent shared memory; instead they have relaxed memory models, which make concurrent programs even more challenging to understand. Programming language implementations run on hardware memory models, so VM and run-time system implementors must reason at both levels. Of particular interest are the low-level implementations of the abstractions that support language-level concurrency-especially because they invariably contain data races.In this paper, we develop a novel principle for reasoning about assembly programs on our previous x86-TSO memory model, and we use it to analyze five concurrency abstraction implementations: two spinlocks (from Linux); a non-blocking write protocol; the double-checked locking idiom; and java.util.concurrent's Parker. Our principle, called triangular-race freedom, strengthens the usual data-race freedom style of reasoning.

...read moreread less

108 citations

Journal Article•10.1145/1814539.1814548•

A real-time Java chip-multiprocessor

[...]

Christof Pitter¹, Martin Schoeberl¹•Institutions (1)

Vienna University of Technology¹

27 Aug 2010-ACM Transactions in Embedded Computing Systems

TL;DR: In this article, three different arbitration policies are presented, evaluated, and compared with respect to their real-time and average-case performance: a fixed priority, a fair-based, and a time-sliced arbiter.

...read moreread less

Abstract: Chip-multiprocessors are an emerging trend for embedded systems. In this article, we introduce a real-time Java multiprocessor called JopCMP. It is a symmetric shared-memory multiprocessor, and consists of up to eight Java Optimized Processor (JOP) cores, an arbitration control device, and a shared memory. All components are interconnected via a system on chip bus. The arbiter synchronizes the access of multiple CPUs to the shared main memory. In this article, three different arbitration policies are presented, evaluated, and compared with respect to their real-time and average-case performance: a fixed priority, a fair-based, and a time-sliced arbiter.Tasks running on different CPUs of a chip-multiprocessor (CMP) influence each others' execution times when accessing a shared memory. Therefore, the system needs an arbiter that is able to limit the worst-case execution time of a task running on a CPU, even though tasks executing simultaneously on other CPUs access the main memory. Our research shows that timing analysis is in fact possible for homogeneous multiprocessor systems with a shared memory. The timing analysis of tasks, executing on the CMP using time-sliced memory arbitration, leads to viable worst-case execution time bounds.The time-sliced arbiter divides the memory access time into equal time slots, one time slot for each CPU. This memory arbitration scheme allows for a calculation of upper bounds of Java application worst-case execution times, depending on the number of CPUs, the time slot size, and the memory access time. Examples of worst-case execution time calculation are presented, and the analyzed results of a real-world application task are compared to measured execution time results. Finally, we evaluate the tradeoffs when using a time-predictable solution compared to using average-case optimized chip-multiprocessors, applying three different benchmarks. These experiments are carried out by executing the programs on the CMP prototype.

...read moreread less

81 citations

Proceedings Article•10.1145/1706299.1706304•

Coarse-grained transactions

[...]

Eric Koskinen¹, Matthew Parkinson¹, Maurice Herlihy²•Institutions (2)

University of Cambridge¹, Brown University²

17 Jan 2010

TL;DR: A generalization of transactional memory in which a transaction consists of coarse-grained data-type operations rather than simple memory read/write operations is defined, and how the semantics applies to numerous TM implementation details discussed widely in the literature is discussed.

...read moreread less

Abstract: Traditional transactional memory systems suffer from overly conservative conflict detection, yielding so-called false conflicts, because they are based on fine-grained, low-level read/write conflicts. In response, the recent trend has been toward integrating various abstract data-type libraries using ad-hoc methods of high-level conflict detection. These proposals have led to improved performance but a lack of a unified theory has led to confusion in the literature.We clarify these recent proposals by defining a generalization of transactional memory in which a transaction consists of coarse-grained (abstract data-type) operations rather than simple memory read/write operations. We provide semantics for both pessimistic (e.g. transactional boosting) and optimistic (e.g. traditional TMs and recent alternatives) execution. We show that both are included in the standard atomic semantics, yet find that the choice imposes different requirements on the coarse-grained operations: pessimistic requires operations be left-movers, optimistic requires right-movers. Finally, we discuss how the semantics applies to numerous TM implementation details discussed widely in the literature.

...read moreread less

Proceedings Article•10.5555/1921479.1921489•

Large data visualization on distributed memory multi-GPU clusters

[...]

Thomas Fogal¹, Hank Childs², Siddharth Shankar³, Jens Krüger⁴, R. Daniel Bergeron¹, Philip J. Hatcher¹ - Show less +2 more•Institutions (4)

University of New Hampshire¹, Lawrence Berkeley National Laboratory², University of Utah³, Analysis Group⁴

25 Jun 2010

TL;DR: In this paper, the authors study a common visualization technique in a GPU-accelerated, distributed memory setting, and present performance characteristics when scaling to extremely large data sets, such as MRI and CT scanners.

...read moreread less

Abstract: Data sets of immense size are regularly generated on large scale computing resources. Even among more traditional methods for acquisition of volume data, such as MRI and CT scanners, data which is too large to be effectively visualized on standard workstations is now commonplace.One solution to this problem is to employ a 'visualization cluster,' a small to medium scale cluster dedicated to performing visualization and analysis of massive data sets generated on larger scale supercomputers. These clusters are designed to fit a different need than traditional supercomputers, and therefore their design mandates different hardware choices, such as increased memory, and more recently, graphics processing units (GPUs). While there has been much previous work on distributed memory visualization as well as GPU visualization, there is a relative dearth of algorithms which effectively use GPUs at a large scale in a distributed memory environment. In this work, we study a common visualization technique in a GPU-accelerated, distributed memory setting, and present performance characteristics when scaling to extremely large data sets.

...read moreread less

Journal Article•10.1016/J.ADVENGSOFT.2008.12.004•

Parallel discrete element simulation of poly-dispersed granular material

[...]

Rimantas Kačianauskas¹, Algirdas Maknickas¹, Arnas Kačeniauskas¹, Darius Markauskas¹, Robertas Balevičius¹ - Show less +1 more•Institutions (1)

Vilnius Gediminas Technical University¹

01 Jan 2010-Advances in Engineering Software

TL;DR: It has been shown that a drastic increase of computational expenses of simulation for the poly-dispersed material in terms of CPU time is associated with the increase of its heterogeneity.

...read moreread less

Proceedings Article•10.1109/HIPC.2010.5713180•

A space-efficient parallel algorithm for computing betweenness centrality in distributed memory

[...]

Nicholas Edmonds¹, Torsten Hoefler², Andrew Lumsdaine¹•Institutions (2)

Indiana University¹, University of Illinois at Urbana–Champaign²

1 Dec 2010

TL;DR: This paper presents a new parallelizable algorithm with low spatial complexity that is based on the best known sequential algorithm and is especially well suited to distributed memory processing because it can be implemented using coarse-grained parallelism.

...read moreread less

Abstract: Betweenness centrality is a measure based on shortest paths that attempts to quantify the relative importance of nodes in a network. As computation of betweenness centrality becomes increasingly important in areas such as social network analysis, networks of interest are becoming too large to fit in the memory of a single processing unit, making parallel execution a necessity. Par-allelization over the vertex set of the standard algorithm, with a final reduction of the centrality for each vertex, is straightforward but requires Ω(|V|2) storage. In this paper we present a new parallelizable algorithm with low spatial complexity that is based on the best known sequential algorithm. Our algorithm requires O(|V| + |E|) storage and enables efficient parallel execution. Our algorithm is especially well suited to distributed memory processing because it can be implemented using coarse-grained parallelism. The presented time bounds for parallel execution of our algorithm on CRCW PRAM and on distributed memory systems both show good asymptotic performance. Experimental results with a distributed memory computer show the practical applicability of our algorithm.

...read moreread less

Proceedings Article•10.1145/1810479.1810498•

Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

[...]

Michela Becchi¹, Surendra Byna¹, Srihari Cadambi¹, Srimat Chakradhar¹•Institutions (1)

Princeton University¹

13 Jun 2010

TL;DR: A runtime to automatically enhance the performance of applications running on heterogeneous platforms consisting of a multi- core (CPU) and a throughput-oriented many-core (GPU) and shows that data-aware CPU/GPU scheduling improves performance by up to 25% over the best data-agnostic scheduling on the same platform.

...read moreread less

Abstract: In this paper, we describe a runtime to automatically enhance the performance of applications running on heterogeneous platforms consisting of a multi-core (CPU) and a throughput-oriented many-core (GPU). The CPU and GPU are connected by a non-coherent interconnect such as PCI-E, and as such do not have shared memory. Heterogeneous platforms available today such as [9] are of this type. Our goal is to enable the programmer to seamlessly use such a system without rewriting the application and with minimal knowledge of the underlying architectural details. Assuming that applications perform function calls to computational kernels with available CPU and GPU implementations, our runtime achieves this goal by automatically scheduling the kernels and managing data placement. In particular, it intercepts function calls to well-known computational kernels and schedules them on CPU or GPU based on their argument size and location. To improve performance, it defers all data transfers between the CPU and the GPU until necessary. By managing data placement transparently to the programmer, it provides a unified memory view despite the underlying separate memory sub-systems.We experimentally evaluate our runtime on a heterogeneous platform consisting of a 2.5GHz quad-core Xeon CPU and an NVIDIA C870 GPU. Given array sorting, parallel reduction, dense and sparse matrix operations and ranking as computational kernels, we use our runtime to automatically retarget SSI [25], K-means [32] and two synthetic applications to the above platform with no code changes. We find that, in most cases, performance improves if the computation is moved to the data, and not vice-versa. For instance, even if a particular instance of a kernel is slower on the GPU than on the CPU, the overall application may be faster if the kernel is scheduled on the GPU anyway, especially if the kernel data is already located on the GPU memory due to prior decisions. Our results show that data-aware CPU/GPU scheduling improves performance by up to 25% over the best data-agnostic scheduling on the same platform.

...read moreread less

Proceedings Article•10.5555/1870926.1870952•

An efficient distributed memory interface for many-core platform with 3D stacked DRAM

[...]

Igor Loi¹, Luca Benini¹•Institutions (1)

University of Bologna¹

8 Mar 2010

TL;DR: This paper presents an efficient and flexible distributed memory interface for 3D-stacked DRAM that ensures ultra-low-latency access to the memory modules on top of each processing element (vertically local memory neighborhoods) and takes full advantage of the lower latency of vertical interconnect.

...read moreread less

Abstract: Historically, processor performance has increased at a much faster rate than that of main memory and up-coming NoC-based many-core architectures are further tightening the memory bottleneck. 3D integration based on TSV technology may provide a solution, as it enables stacking of multiple memory layers, with orders-of-magnitude increase in memory interface bandwidth, speed and energy efficiency. To fully exploit this potential, the architectural interface to vertically stacked memory must be streamlined. In this paper we present an efficient and flexible distributed memory interface for 3D-stacked DRAM. Our interface ensures ultra-low-latency access to the memory modules on top of each processing element (vertically local memory neighborhoods). Communication to these local modules do not travel through the NoC and takes full advantage of the lower latency of vertical interconnect, thus speeding up significantly the common case. The interface still supports a convenient global address space abstraction with high-latency remote access, due to the slower horizontal interconnect. Experimental results demonstrate significant bandwidth improvement that ranges from 1.44x to 7.40x as compared to the JEDEC standard, with peaks of 4.53GB/s for direct memory access, and 850MB/s for remote access through the NoC.

...read moreread less

Proceedings Article•10.1109/HPCC.2010.23•

Enhancing Muesli's Data Parallel Skeletons for Multi-core Computer Architectures

[...]

Philipp Ciechanowicz¹, Herbert Kuchen¹•Institutions (1)

University of Münster¹

1 Sep 2010

TL;DR: This work presents an extension of their skeleton library Muesli which now allows to use the same application without modifications on a variety of parallel machines ranging from multi-processor distributed memory to many-core shared memory machines and combinations of those such as clusters of multi-core nodes.

...read moreread less

Abstract: Algorithmic skeletons encapsulate typical parallel programming patterns such that they can be easily applied by users Existing skeleton libraries usually work on distributed memory machines We present an extension of our skeleton library Muesli which now allows to use the same application without modifications on a variety of parallel machines ranging from multi-processor distributed memory to many-core shared memory machines and combinations of those such as clusters of multi-core nodes Internally, the skeletons are based on MPI and Open MP We demonstrate the efficiency of our approach by providing experimental results

...read moreread less

Proceedings Article•10.1145/1810479.1810510•

Towards optimizing energy costs of algorithms for shared memory architectures

[...]

Vijay Anand Korthikanti¹, Gul Agha¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

13 Jun 2010

TL;DR: This paper develops a methodology to evaluate how energy consumption of a given parallel algorithm changes as the number of cores and their frequency is varied, and establishes the optimal number of core to minimize the energy consumed by the execution of a parallel algorithm for a specific problem size while satisfying a given performance requirement.

...read moreread less

Abstract: Energy consumption by computer systems has emerged as an important concern. However, the energy consumed in executing an algorithm cannot be inferred from its performance alone: it must be modeled explicitly. This paper analyzes energy consumption of parallel algorithms executed on shared memory multicore processors. Specifically, we develop a methodology to evaluate how energy consumption of a given parallel algorithm changes as the number of cores and their frequency is varied. We use this analysis to establish the optimal number of cores to minimize the energy consumed by the execution of a parallel algorithm for a specific problem size while satisfying a given performance requirement. We study the sensitivity of our analysis to changes in parameters such as the ratio of the power consumed by a computation step versus the power consumed in accessing memory. The results show that the relation between the problem size and the optimal number of cores is relatively unaffected for a wide range of these parameters.

...read moreread less

Proceedings Article•10.5555/1870926.1871199•

Partitioning and allocation of scratch-pad memory for priority-based preemptive multi-task systems

[...]

Hideki Takase¹, Hiroyuki Tomiyama¹, Hiroaki Takada¹•Institutions (1)

Nagoya University¹

8 Mar 2010

TL;DR: The three methods which are proposed, i.e., spatial, temporal, and hybrid methods, bring about effective usage of the scratch-pad memory space, and achieve energy reduction in the instruction memory subsystems, are applicable to a real-time environment.

...read moreread less

Abstract: Scratch-pad memory has been employed as a partial or entire replacement for cache memory due to its better energy efficiency. In this paper, we propose scratch-pad memory management techniques for priority-based preemptive multi-task systems. Our techniques are applicable to a real-time environment. The three methods which we propose, i.e., spatial, temporal, and hybrid methods, bring about effective usage of the scratch-pad memory space, and achieve energy reduction in the instruction memory subsystems. We formulate each method as an integer programming problem that simultaneously determines (1) partitioning of scratch-pad memory space for the tasks, and (2) allocation of program code to scratch-pad memory space for each task. It is remarkable that periods and priorities of tasks are considered in the formulas. Additionally, we implement an RTOS-hardware cooperative support mechanism for a runtime code allocation to the scratch-pad memory space. We have made the experiments with the fully functional real-time operating system. The experimental results with four task sets have demonstrated the effectiveness of our techniques. Up to 73 % energy reduction compared to a standard method was achieved.

...read moreread less

Proceedings Article•10.1109/ICPPW.2010.62•

Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems

[...]

Jinpil Lee¹, Mitsuhisa Sato¹•Institutions (1)

University of Tsukuba¹

13 Sep 2010

TL;DR: XcalableMP is introduced, it shows that users can describe the parallelization for distributed memory system with a small modification to the original sequential code, and the performance evaluation result is shown.

...read moreread less

Abstract: Although MPI is a de-facto standard for parallel programming on distributed memory systems, writing MPI programs is often a time-consuming and complicated process. XcalableMP is a language extension of C and Fortran for parallel programming on distributed memory systems that helps users to reduce those programming efforts. XcalableMP provides two programming models. The first one is the global view model, which supports typical parallelization based on the data and task parallel paradigm, and enables parallelizing the original sequential code using minimal modification with simple, OpenMP-like directives. The other one is the local view model, which allows using CAF-like expressions to describe inter-node communication. Users can even use MPI and OpenMP explicitly in our language to optimize performance explicitly. In this paper, we introduce XcalableMP, the implementation of the compiler, and the performance evaluation result. For the performance evaluation, we parallelized HPCC Benchmark in XcalableMP. It shows that users can describe the parallelization for distributed memory system with a small modification to the original sequential code.

...read moreread less

Patent•

Distributed symmetric multiprocessing computing architecture

[...]

Richard S. Anderson

15 Nov 2010

TL;DR: In this article, the authors present a scalable symmetric multiprocessing (shared memory) computer architecture using a network of homogeneous multi-core servers, which combines new operating system extensions with a high speed network that supports remote direct memory access to achieve an effective global distributed shared memory.

...read moreread less

Abstract: Example embodiments of the present invention includes systems and methods for implementing a scalable symmetric multiprocessing (shared memory) computer architecture using a network of homogeneous multi-core servers. The level of processor and memory performance achieved is suitable for running applications that currently require cache coherent shared memory mainframes and supercomputers. The architecture combines new operating system extensions with a high-speed network that supports remote direct memory access to achieve an effective global distributed shared memory. A distributed thread model allows a process running in a head node to fork threads in other (worker) nodes that run in the same global address space. Thread synchronization is supported by a distributed mutex implementation. A transactional memory model allows a multi-threaded program to maintain global memory page consistency across the distributed architecture. A distributed file access implementation supports non-contentious file I/O for threads. These and other functions provide a symmetric multiprocessing programming model consistent with standards such as Portable Operating System Interface for Unix (POSIX).

...read moreread less

Proceedings Article•10.1109/MICRO.2010.39•

Memory Latency Reduction via Thread Throttling

[...]

Hsiang-Yun Cheng¹, Chung-Hsiang Lin¹, Jian Li², Chia-Lin Yang¹•Institutions (2)

National Taiwan University¹, IBM²

4 Dec 2010

TL;DR: A memory thread throttling mechanism that tunes the allowable memory threads dynamically under workload variation to improve system performance and leads to a geometric mean of 12% performance improvement for real-world applications on the same hardware.

...read moreread less

Abstract: Memory Wall is a well-known obstacle to processor performance improvement. The popularity of multi-core architecture will further exaggerate the problem since the memory resource is shared by all cores. Interferences among requests from different cores may prolong the latency of memory accesses thereby degrading the system performance. To tackle the problem, this paper proposes to decouple application threads into compute and memory tasks, and restrict the number of concurrent memory tasks to avoid the interference among memory requests. Yet with this scheduling restriction, a CPU core may unnecessarily stay idle, which incurs adverse impact on the overall performance. Therefore, we develop a memory thread throttling mechanism that tunes the allowable memory threads dynamically under workload variation to improve system performance. The proposed run-time mechanism monitors memory and computation ratios of a program for phase detection. It then decides the memory thread constraint for the next program phase based on an analytical model that can estimate system performance under different constraint values. To prove the concept, we prototype the mechanism in some real-world applications as well as synthetic workloads. We evaluate their performance on real machines. The experimental results demonstrate up to 20% speedup with a pool of synthetic workloads on an Intel i7 (Nehalem) machine and match with the speedup estimated by the proposed analytical model. Furthermore, the intelligent run-time scheduling leads to a geometric mean of 12% performance improvement for real-world applications on the same hardware.

...read moreread less

Proceedings Article•10.1109/IPDPSW.2010.5470813•

Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory

[...]

Markus Wittmann¹, Georg Hager¹, Gerhard Wellein¹•Institutions (1)

University of Erlangen-Nuremberg¹

19 Apr 2010

TL;DR: A new pipelined approach is introduced that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead and demonstrates how temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment.

...read moreread less

Abstract: New algorithms and optimization techniques are needed to balance the accelerating trend towards bandwidth-starved multicore chips. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. For clusters of shared-memory nodes we demonstrate how temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment.

...read moreread less

Patent•

System and Method for Executing a Transaction Using Parallel Co-Transactions

[...]

Mark S. Moir¹, Robert E. Cypher¹, Daniel S. Nussbaum¹•Institutions (1)

Business International Corporation¹

31 Mar 2010

TL;DR: The transactional memory system described in this paper implements parallel co-transactions that access a shared memory such that at most one of the transactions in a set will succeed and all others will fail.

...read moreread less

Abstract: The transactional memory system described herein may implement parallel co-transactions that access a shared memory such that at most one of the co-transactions in a set will succeed and all others will fail (eg, be aborted) Co-transactions may improve the performance of programs that use transactional memory by attempting to perform the same high-level operation using multiple algorithmic approaches, transactional memory implementations and/or speculation options in parallel, and allowing only the first to complete to commit its results If none of the co-transactions succeeds, one or more may be retried, possibly using a different approach and/or transactional memory implementation The at-most-one property may be managed through the use of a shared “done” flag Conflicts between co-transactions in a set and accesses made by transactions or activities outside the set may be managed using lazy write ownership acquisition and/or a priority-based approach Each co-transaction may execute on a different processor resource

...read moreread less

Proceedings Article•

User-defined distributions and layouts in chapel: philosophy and framework

[...]

Bradford L. Chamberlain¹, Steven J. Deitz¹, David Iten¹, Sung-Eun Choi¹•Institutions (1)

Cray¹

14 Jun 2010

TL;DR: This paper introduces user-defined domain maps, a novel concept for implementing distributions and memory layouts for parallel data aggregates in Chapel, designed for generality and do not rely on hard-coding a fixed set of distributions into the compiler and runtime.

...read moreread less

Abstract: This paper introduces user-defined domain maps, a novel concept for implementing distributions and memory layouts for parallel data aggregates. Our domain maps implement parallel arrays for the Chapel programming language and are themselves implemented using standard Chapel features. Domain maps export a functional interface that our compiler targets as it maps from a user's global-view algorithm down to the task-level operations required to implement the computation for multicore processors, GPUs, and distributed memory architectures. Unlike distributions in HPF and ZPL, Chapel's domain maps are designed for generality and do not rely on hard-coding a fixed set of distributions into the compiler and runtime. The chief contributions of this paper are its description of our motivating principles and an overview of our framework.

...read moreread less

Journal Article•10.1016/J.JPDC.2010.02.003•

Embedded-TM: Energy and complexity-effective hardware transactional memory for embedded multicore systems

[...]

Cesare Ferri¹, Samantha Wood², Tali Moreshet³, R. Iris Bahar¹, Maurice Herlihy¹ - Show less +1 more•Institutions (3)

Brown University¹, Bryn Mawr College², Swarthmore College³

01 Oct 2010-Journal of Parallel and Distributed Computing

TL;DR: It is found that ignoring energy considerations can lead to poor design choices, particularly for resource-constrained embedded platforms, and with the right balance of energy efficiency and simplicity, HTM will become an attractive choice for future embedded system designs.

...read moreread less

Proceedings Article•10.1145/2020373.2020376•

Hybrid PGAS runtime support for multicore nodes

[...]

Filip Blagojevic¹, Paul Hargrove¹, Costin Iancu¹, Katherine Yelick¹•Institutions (1)

Lawrence Berkeley National Laboratory¹

12 Oct 2010

TL;DR: This paper presents a new process-based shared memory runtime for PGAS languages, and shows speedups of over 60% for application benchmarks and 100% for collective communication benchmarks, when compared to the previous pthreads implementation.

...read moreread less

Abstract: With multicore processors as the standard building block for high performance systems, parallel runtime systems need to provide excellent performance on shared memory, distributed memory, and hybrids. Conventional wisdom suggests that threads should be used as the runtime mechanism within shared memory, and two runtime versions for shared and distributed memory are often designed and implemented separately, retrofitting after the fact for hybrid systems. In this paper we consider the problem of implementing a runtime layer for Partitioned Global Address Space (PGAS) languages, which offer a uniform programming abstraction for hybrid machines. We present a new process-based shared memory runtime and compare it to our previous pthreads implementation. Both are integrated with the GASNet communication layer, and they can co-exist with one another. We evaluate the shared memory runtime approaches, showing that they interact in important and sometimes surprising ways with the communication layer. Using a set of microbenchmarks and application level benchmarks on an IBM BG/P, Cray XT, and InfiniBand cluster, we show that threads, processes and combinations of both are needed for maximum performance. Our new runtime shows speedups of over 60% for application benchmarks and 100% for collective communication benchmarks, when compared to the previous implementation. Our work primarily targets PGAS languages, but some of the lessons are relevant to other parallel runtime systems and libraries.

...read moreread less

Patent•

Memory system including a controller and processors associated with memory devices

[...]

Michael Shachar¹, Barak Rotbard¹, Oren Golov¹, Uri Perlmutter¹, Dotan Sokolov¹, Julian Vlaiko¹, Yair Schwartz¹ - Show less +3 more•Institutions (1)

Apple Inc.¹

17 Jan 2010

TL;DR: In this paper, a data storage system includes a plurality of non-volatile memory devices arranged in one or more sets, a main controller and a set of processors, each of which is associated with a respective set of the NVRAM devices, and each of the processors is configured to accept commands from a host and to convert the commands into recipes.

...read moreread less

Abstract: A data storage system includes a plurality of non-volatile memory devices arranged in one or more sets, a main controller and one or more processors. The main controller is configured to accept commands from a host and to convert the commands into recipes. Each recipe includes a list of multiple memory operations to be performed sequentially in the non-volatile memory devices belonging to one of the sets. Each of the processors is associated with a respective set of the non-volatile memory devices, and is configured to receive one or more of the recipes from the main controller and to execute the memory operations specified in the received recipes in the non-volatile memory devices belonging to the respective set.

...read moreread less

Journal Article•10.1137/090757010•

Fast Access to Distributed Atomic Memory

[...]

Partha Dutta¹, Rachid Guerraoui², Ron R. Levy, Marko Vukolic³•Institutions (3)

IBM¹, École Polytechnique Fédérale de Lausanne², Institut Eurécom³

01 Aug 2010-SIAM Journal on Computing

TL;DR: The exact conditions under which every read and write involves one round of communication with the servers are determined, which relate the number of readers to the tolerated number of faulty servers and the nature of these failures.

...read moreread less

Abstract: We study efficient and robust implementations of an atomic read-write data structure over an asynchronous distributed message-passing system made of reader and writer processes, as well as a number of servers implementing the data structure. We determine the exact conditions under which every read and write involves one round of communication with the servers. These conditions relate the number of readers to the tolerated number of faulty servers and the nature of these failures.

...read moreread less

Journal Article•10.1137/090756934•

A Novel Parallel QR Algorithm for Hybrid Distributed Memory HPC Systems

[...]

Robert Granat, Bo Kågström, Daniel Kressner¹•Institutions (1)

ETH Zurich¹

01 Jun 2010-SIAM Journal on Scientific Computing

TL;DR: In this paper, a novel variant of the parallel QR algorithm for solving dense nonsymmetric eigenvalue problems on hybrid distributed high performance computing systems is presented, which introduces the concept of multiwindow bulge chain chasing and parallelizes aggressive early deflation.

...read moreread less

Abstract: A novel variant of the parallel QR algorithm for solving dense nonsymmetric eigenvalue problems on hybrid distributed high performance computing systems is presented. For this purpose, we introduce the concept of multiwindow bulge chain chasing and parallelize aggressive early deflation. The multiwindow approach ensures that most computations when chasing chains of bulges are performed in level 3 BLAS operations, while the aim of aggressive early deflation is to speed up the convergence of the QR algorithm. Mixed MPI-OpenMP coding techniques are utilized for porting the codes to distributed memory platforms with multithreaded nodes, such as multicore processors. Numerous numerical experiments confirm the superior performance of our parallel QR algorithm in comparison with the existing ScaLAPACK code, leading to an implementation that is one to two orders of magnitude faster for sufficiently large problems, including a number of examples from applications.

...read moreread less

...

Expand