Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Distributed memory
  4. 2014
  1. Home
  2. Topics
  3. Distributed memory
  4. 2014
Showing papers on "Distributed memory published in 2014"
Journal Article•10.1038/NN.3655•
Changing concepts of working memory.

[...]

Wei Ji Ma1, Masud Husain2, Paul M. Bays3•
Center for Neural Science1, University of Oxford2, University of California, Berkeley3
01 Mar 2014-Nature Neuroscience
TL;DR: Behavioral and emerging neural evidence for the proposed conceptualization of working memory as a limited resource that is distributed flexibly among all items to be maintained in memory are considered.
Abstract: Working memory is widely considered to be limited in capacity, holding a fixed, small number of items, such as Miller's 'magical number' seven or Cowan's four. It has recently been proposed that working memory might better be conceptualized as a limited resource that is distributed flexibly among all items to be maintained in memory. According to this view, the quality rather than the quantity of working memory representations determines performance. Here we consider behavioral and emerging neural evidence for this proposal.

1,157 citations

Journal Article•10.1117/1.JBO.19.4.040801•
The Toast++ software suite for forward and inverse modeling in optical tomography.

[...]

Martin Schweiger1, Simon R. Arridge1•
University College London1
01 Apr 2014-Journal of Biomedical Optics
TL;DR: The Toast++ open-source software environment for solving the forward and inverse problems in diffuse optical tomography (DOT) contains model-based iterative inverse solvers for reconstructing the volume distribution of absorption and scattering parameters from boundary measurements of light transmission.
Abstract: We present the Toast++ open-source software environment for solving the forward and inverse problems in diffuse optical tomography (DOT). The software suite consists of a set of libraries to simulate near-infrared light propagation in highly scattering media with complex boundaries and heterogeneous internal parameter distribution, based on a finite-element solver. Steady-state, time- and frequency-domain data acquisition systems can be modeled. The forward solver is implemented in C++ and supports performance acceleration with parallelization for shared and distributed memory architectures, as well as graphics processing computation. Building on the numerical forward solver, Toast++ contains model-based iterative inverse solvers for reconstructing the volume distribution of absorption and scattering parameters from boundary measurements of light transmission. A range of regularization methods are provided, including the possibility of incorporating prior knowledge of internal structure. The user can link to the Toast++ libraries either directly to compile application programs for DOT, or make use of the included MATLAB and PYTHON bindings to generate script-based solutions. This approach allows rapid prototyping and provides a rich toolset in both environments for debugging, testing, and visualization.

279 citations

Proceedings Article•10.1109/DSN.2014.50•
Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory

[...]

Yixin Luo1, Sriram Govindan2, Bikash Sharma2, Mark Santaniello2, Justin Meza1, Aman Kansal2, Jie Liu2, Badriddine Khessib2, Kushagra Vaid2, Onur Mutlu1 •
Carnegie Mellon University1, Microsoft2
23 Jun 2014
TL;DR: The key insight is that there exists a diverse spectrum of tolerance to memory errors in new data-intensive applications, and that traditional one-size-fits-all memory reliability techniques are inefficient in terms of cost.
Abstract: Memory devices represent a key component of datacenter total cost of ownership (TCO), and techniques used to reduce errors that occur on these devices increase this cost. Existing approaches to providing reliability for memory devices pessimistically treat all data as equally vulnerable to memory errors. Our key insight is that there exists a diverse spectrum of tolerance to memory errors in new data-intensive applications, and that traditional one-size-fits-all memory reliability techniques are inefficient in terms of cost. For example, we found that while traditional error protection increases memory system cost by 12.5%, some applications can achieve 99.00% availability on a single server with a large number of memory errors without any error protection. This presents an opportunity to greatly reduce server hardware cost by provisioning the right amount of memory reliability for different applications. Toward this end, in this paper, we make three main contributions to enable highly-reliable servers at low datacenter cost. First, we develop a new methodology to quantify the tolerance of applications to memory errors. Second, using our methodology, we perform a case study of three new dataintensive workloads (an interactive web search application, an in-memory key -- value store, and a graph mining framework) to identify new insights into the nature of application memory error vulnerability. Third, based on our insights, we propose several new hardware/software heterogeneous-reliability memory system designs to lower datacenter cost while achieving high reliability and discuss their trade-o s. We show that our new techniques can reduce server hardware cost by 4.7% while achieving 99.90% single server availability.

185 citations

Journal Article•10.14778/2732967.2732973•
NOMAD: non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion

[...]

Hyokun Yun1, Hsiang-Fu Yu2, Cho-Jui Hsieh2, S. V. N. Vishwanathan1, Inderjit S. Dhillon2 •
Purdue University1, University of Texas at Austin2
1 Jul 2014
TL;DR: An efficient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion), which outperforms synchronous algorithms which require explicit bulk synchronization after every iteration.
Abstract: We develop an efficient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion). NOMAD is a decentralized algorithm with non-blocking communication between processors. One of the key features of NOMAD is that the ownership of a variable is asynchronously transferred between processors in a decentralized fashion. As a consequence it is a lock-free parallel algorithm. In spite of being asynchronous, the variable updates of NOMAD are serializable, that is, there is an equivalent update ordering in a serial implementation. NOMAD outperforms synchronous algorithms which require explicit bulk synchronization after every iteration: our extensive empirical evaluation shows that not only does our algorithm perform well in distributed setting on commodity hardware, but also outperforms state-of-the-art algorithms on a HPC cluster both in multi-core and distributed memory settings.

183 citations

Proceedings Article•10.1109/ICDE.2014.6816682•
How to partition a billion-node graph

[...]

Lu Wang1, Yanghua Xiao2, Bin Shao3, Haixun Wang4•
Hong Kong University of Science and Technology1, Fudan University2, Microsoft3, Google4
1 Mar 2014
TL;DR: Experimental results show that the proposed multi-level label propagation (MLP) method can partition billion-node graphs within several hours on a distributed memory system consisting of merely several machines, and the quality of the partitions produced is comparable to state-of-the-art approaches applied on toy-size graphs.
Abstract: Billion-node graphs pose significant challenges at all levels from storage infrastructures to programming models. It is critical to develop a general purpose platform for graph processing. A distributed memory system is considered a feasible platform supporting online query processing as well as offline graph analytics. In this paper, we study the problem of partitioning a billion-node graph on such a platform, an important consideration because it has direct impact on load balancing and communication overhead. It is challenging not just because the graph is large, but because we can no longer assume that the data can be organized in arbitrary ways to maximize the performance of the partitioning algorithm. Instead, the algorithm must adopt the same data and programming model adopted by the system and other applications. In this paper, we propose a multi-level label propagation (MLP) method for graph partitioning. Experimental results show that our solution can partition billion-node graphs within several hours on a distributed memory system consisting of merely several machines, and the quality of the partitions produced by our approach is comparable to state-of-the-art approaches applied on toy-size graphs.

165 citations

Proceedings Article•10.1109/ICCD.2014.6974655•
The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost

[...]

Lavanya Subramanian1, Donghyuk Lee1, Vivek Seshadri1, Harsha Rastogi1, Onur Mutlu1 •
Carnegie Mellon University1
4 Dec 2014
TL;DR: The Blacklisting Memory Scheduler (BLISS), which achieves high system performance and fairness while incurring low hardware cost and complexity and is evaluated across a wide variety of workloads and system configurations.
Abstract: In a multicore system, applications running on different cores interfere at main memory. This inter-application interference degrades overall system performance and unfairly slows down applications. Prior works have developed application-aware memory request schedulers to tackle this problem. State-of-the-art application-aware memory request schedulers prioritize memory requests of applications that are vulnerable to interference, by ranking individual applications based on their memory access characteristics and enforcing a total rank order.

148 citations

Proceedings Article•10.1109/HOTCHIPS.2014.7478812•
HBM: Memory solution for bandwidth-hungry processors

[...]

Joonyoung Kim, Younsu Kim
1 Aug 2014
TL;DR: This article consists of a collection of slides from the author's conference presentation on the special features, system design and architectures, memory management and support for processors, and targeted markets for SK Hynix Inc.'s HBM, next generation memory family of products.
Abstract: This article consists of a collection of slides from the author's conference presentation on the special features, system design and architectures, memory management and support for processors, and targeted markets for SK Hynix Inc.'s HBM, next generation memory family of products.

113 citations

Proceedings Article•10.1109/SC.2014.50•
Faster parallel traversal of scale free graphs at extreme scale with vertex delegates

[...]

Roger Pearce1, Maya Gokhale1, Nancy M. Amato2•
Lawrence Livermore National Laboratory1, Texas A&M University2
16 Nov 2014
TL;DR: This work presents techniques to distribute storage, computation, and communication of hubs for extreme scale graphs in distributed memory supercomputers, and demonstrates scalability of the new algorithmic technique using Breadth-First Search (BFS), Single Source Shortest Path (SSSP), K-Core Decomposition, and Page-Rank on synthetically generated scale-free graphs.
Abstract: At extreme scale, irregularities in the structure of scale-free graphs such as social network graphs limit our ability to analyze these important and growing datasets. A key challenge is the presence of high-degree vertices (hubs), that leads to parallel workload and storage imbalances. The imbalances occur because existing partitioning techniques are not able to effectively partition high-degree vertices. We present techniques to distribute storage, computation, and communication of hubs for extreme scale graphs in distributed memory supercomputers. To balance the hub processing workload, we distribute hub data structures and related computation among a set of delegates. The delegates coordinate using highly optimized, yet portable, asynchronous broadcast and reduction operations. We demonstrate scalability of our new algorithmic technique using Breadth-First Search (BFS), Single Source Shortest Path (SSSP), K-Core Decomposition, and Page-Rank on synthetically generated scale-free graphs. Our results show excellent scalability on large scale-free graphs up to 131K cores of the IBM BG/P, and outperform the best known Graph500 performance on BG/P Intrepid by 15%

83 citations

Journal Article•10.1145/2616912•
Algorithm 945: modred—A Parallelized Model Reduction Library

[...]

Brandt A. Belson1, Jonathan H. Tu1, Clarence W. Rowley1•
Princeton University1
08 Jul 2014-ACM Transactions on Mathematical Software
TL;DR: A new parallelized Python library for model reduction, modal analysis, and system identification of large systems and datasets, called modred, which is designed to be easy to use, with an object-oriented design, and includes comprehensive automated tests.
Abstract: We describe a new parallelized Python library for model reduction, modal analysis, and system identification of large systems and datasets. Our library, called modred, handles a wide range of problems and any data format. The modred library contains implementations of the Proper Orthogonal Decomposition (POD), balanced POD (BPOD) Petrov-Galerkin projection, and a more efficient variant of the Dynamic Mode Decomposition (DMD). The library contains two implementations of these algorithms, each with its own advantages. One is for smaller and simpler datasets, requires minimal knowledge to use, and follows a common matrix-based formulation. The second, for larger and more complicated datasets, preserves the abstraction of vectors as elements of a vector space and, as a result, allows the library to work with arbitrary data formats and eases distributed memory parallelization. We also include implementations of the Eigensystem Realization Algorithm (ERA), and Observer/Kalman Filter Identification (OKID). These methods are typically not computationally demanding and are not parallelized. The library is designed to be easy to use, with an object-oriented design, and includes comprehensive automated tests. In almost all cases, parallelization is done internally so that scripts that use the parallelized classes can be run in serial or in parallel without any modifications.

79 citations

Journal Article•10.1109/TPWRD.2013.2297119•
Parallel Massive-Thread Electromagnetic Transient Simulation on GPU

[...]

Zhiyin Zhou1, Venkata Dinavahi1•
University of Alberta1
17 Feb 2014-IEEE Transactions on Power Delivery
TL;DR: A massive-thread EMT program (MT-EMTP) is proposed and an efficient node-mapping structure is proposed to transform the original power system admittance matrix into a block-node diagonal sparse format to exploit the massive- thread parallel GPU architecture.
Abstract: The electromagnetic transient (EMT) simulation of a large-scale power system consumes so much computational power that parallel programming techniques are urgently needed in this area. For example, realistic-sized power systems include thousands of buses, generators, and transmission lines. Massive-thread computing is one of the key developments that can increase the EMT computational capabilities substantially when the processing unit has enough hardware cores. Compared to the traditional CPU, the graphic-processing unit (GPU) has many more cores with distributed memory which can offer higher data throughput. This paper proposes a massive-thread EMT program (MT-EMTP) and develops massive-thread parallel modules for linear passive elements, the universal line model, and the universal machine model for offline EMT simulation. An efficient node-mapping structure is proposed to transform the original power system admittance matrix into a block-node diagonal sparse format to exploit the massive-thread parallel GPU architecture. The developed MT-EMTP program has been tested on large-scale power systems of up to 2458 three-phase buses with detailed component modeling. The simulation results and execution times are compared with mainstream commercial software, EMTP-RV, to show the improvement in performance with equivalent accuracy.

74 citations

Proceedings Article•10.5555/2691166.2691173•
The OPS domain specific abstraction for multi-block structured grid computations

[...]

Istvan Z. Reguly1, Gihan R. Mudalige1, Michael B. Giles1, Daniel Curran2, Simon McIntosh-Smith2 •
University of Oxford1, University of Bristol2
16 Nov 2014
TL;DR: This paper introduces the OPS high-level abstraction and active library aimed at multi-block structured grid computations, and discusses some of its key design points; it demonstrates how OPS can be embedded in C/C++ and the API made to look like a traditional library, and introduces a number of automated execution techniques that enable distributed memory parallelization, optimization of communication patterns, checkpointing and cache-blocking.
Abstract: Code maintainability, performance portability and future proofing are some of the key challenges in this era of rapid change in High Performance Computing. Domain Specific Languages and Active Libraries address these challenges by focusing on a single application domain and providing a high-level programming approach, and then subsequently using domain knowledge to deliver high performance on various hardware.In this paper, we introduce the OPS high-level abstraction and active library aimed at multi-block structured grid computations, and discuss some of its key design points; we demonstrate how OPS can be embedded in C/C++ and the API made to look like a traditional library, and how through a combination of simple text manipulation and back-end logic we can enable execution on a diverse range of hardware using different parallel programming approaches.Relying on the access-execute description of the OPS abstraction, we introduce a number of automated execution techniques that enable distributed memory parallelization, optimization of communication patterns, checkpointing and cache-blocking. Using performance results from CloverLeaf from the Mantevo suite of benchmarks, we demonstrate the utility of OPS.
Proceedings Article•10.1145/2628071.2628085•
kMAF: automatic kernel-level management of thread and data affinity

[...]

Matthias Diener1, Eduardo H. M. Cruz1, Philippe O. A. Navaux1, Anselm Busse2, Hans-Ulrich Heiß2 •
Universidade Federal do Rio Grande do Sul1, Technical University of Berlin2
24 Aug 2014
TL;DR: KMAF as discussed by the authors automatically manages thread and data affinity on the kernel level by analyzing the page faults of a running application during its execution, which is used by kMAF to migrate threads and memory pages, such that the overall memory access locality is optimized.
Abstract: One of the main challenges for parallel architectures is the increasing complexity of the memory hierarchy, which consists of several levels of private and shared caches, as well as interconnections between separate memories in NUMA machines. To make full use of this hierarchy, it is necessary to improve the locality of memory accesses by reducing accesses to remote caches and memories, and using local ones instead. Two techniques can be used to increase the memory access locality: executing threads and processes that access shared data close to each other in the memory hierarchy (thread affinity), and placing the memory pages they access on the NUMA node they are executing on (data affinity). Most related work in this area focuses on either thread or data affinity, but not both, which limits the improvements. Other mechanisms require expensive operations, such as memory access traces or binary analysis, require changes to hardware or work only on specific parallel APIs. In this paper, we introduce kMAF, a mechanism that automatically manages thread and data affinity on the kernel level. The memory access behavior of the running application is determined during its execution by analyzing its page faults. This information is used by kMAF to migrate threads and memory pages, such that the overall memory access locality is optimized. Extensive evaluation with 27 benchmarks from 4 benchmark suites shows substantial performance improvements, with results close to an oracle mechanism. Execution time was reduced by up to 35.7% (13.8% on average), while energy efficiency was improved by up to 34.6% (9.3% on average).
Proceedings Article•10.1145/2618128.2618129•
Main memory and cache performance of intel sandy bridge and AMD bulldozer

[...]

Daniel Molka1, Daniel Hackenberg1, Robert Schöne1•
Dresden University of Technology1
13 Jun 2014
TL;DR: This work tackles the important aspect of measuring and understanding undocumented memory performance numbers in order to create valuable insight into microprocessor details and builds upon a set of sophisticated benchmarks that support latency and bandwidth measurements to arbitrary locations in the memory subsystem.
Abstract: Application performance on multicore processors is seldom constrained by the speed of floating point or integer units. Much more often, limitations are caused by the memory subsystem, particularly shared resources such as last level caches or memory controllers. Measuring, predicting and modeling memory performance becomes a steeper challenge with each new processor generation due to the growing complexity and core count. We tackle the important aspect of measuring and understanding undocumented memory performance numbers in order to create valuable insight into microprocessor details. For this, we build upon a set of sophisticated benchmarks that support latency and bandwidth measurements to arbitrary locations in the memory subsystem. These benchmarks are extended to support AVX instructions for bandwidth measurements and to integrate the coherence states (O)wned and (F)orward. We then use these benchmarks to perform an indepth analysis of current ccNUMA multiprocessor systems with Intel (Sandy Bridge-EP) and AMD (Bulldozer) processors. Using our benchmarks we present fundamental memory performance data and illustrate performance-relevant architectural properties of both designs.
Proceedings Article•10.1145/2593069.2593090•
An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers

[...]

Jason Cong1, Peng Li1, Bingjun Xiao1, Peng Zhang1•
University of California, Los Angeles1
1 Jun 2014
TL;DR: This paper uses the stencil computation, a popular communication-intensive application domain, as a case study to show the potential benefits of nonuniform memory partitioning, and develops a generalized microarchitecture to decouple stencil accesses from computation and an automated design flow to integrate it with the HLS-generated computation kernel for a complete accelerator.
Abstract: High-level synthesis (HLS) tools have made significant progress in compiling high-level descriptions of computation into highly pipelined register-transfer level (RTL) specifications. The high-throughput computation raises a high data demand. To prevent data accesses from being the bottleneck, on-chip memories are used as data reuse buffers to reduce off-chip accesses. Also memory partitioning is explored to increase the memory bandwidth by scheduling multiple simultaneous memory accesses to different memory banks. Prior work on memory partitioning of data reuse buffers is limited to uniform partitioning. In this paper, we perform an early-stage exploration of non-uniform memory partitioning. We use the stencil computation, a popular communication-intensive application domain, as a case study to show the potential benefits of non-uniform memory partitioning. Our novel method can always achieve the minimum memory size and the minimum number of memory banks, which cannot be guaranteed in any prior work. We develop a generalized microarchitecture to decouple stencil accesses from computation, and an automated design flow to integrate our microarchitecture with the HLS-generated computation kernel for a complete accelerator.
Journal Article•10.1016/J.COMPSTRUCT.2013.07.049•
A hybrid shared/distributed memory parallel genetic algorithm for optimization of laminate composites

[...]

I.B.C.M. Rocha1, Evandro Parente1, Antônio Macário Cartaxo de Melo1•
Federal University of Ceará1
01 Jan 2014-Composite Structures
TL;DR: A genetic algorithm combining two types of computational parallelization methods is presented, resulting in a hybrid shared/distributed memory algorithm based on the island model using both OpenMP and MPI libraries.
Proceedings Article•10.1109/GEOINFORMATICS.2014.6950819•
A geohash-based index for spatial data management in distributed memory

[...]

Jiajun Liu1, Haoran Li1, Yong Gao1, Hao Yu1, Dan Jiang1 •
Peking University1
25 Jun 2014
TL;DR: Through the key technology above, the reading and writing performance of spatial data makes a great progress, which provides a solid technical foundation for high performance geographical computation.
Abstract: In Geographical Information System (GIS) fields, the performance of data access is critical for the whole performance of applications. But spatial data access becomes the performance bottleneck of high-tech research and development in GIS. In this study, features are mapped into a distributed memory through distributed hash functions. Then, the Geohash method is adopted to build a distributed spatial index for the distributed memory. At the end of this article, a contrast experiment is carried out between our method and a traditional spatial database. The results show that for complex data operation, our method is dozens faster. To sum it up, through the key technology above, the reading and writing performance of spatial data makes a great progress, which provides a solid technical foundation for high performance geographical computation.
Proceedings Article•10.4230/OASICS.WCET.2014.53•
A Time-Predictable Memory Network-on-Chip

[...]

Martin Schoeberl, David Vh Chong1, Wolfgang Puffitsch, Jens Sparsø•
Technical University of Denmark1
1 Jan 2014
TL;DR: The TDM based arbitration completely decouples processor cores and allows WCET analysis of the memory accesses on individual cores without considering the tasks on the other cores, and avoids a central arbiter and scales to a large number of processors.
Abstract: To derive safe bounds on worst-case execution times (WCETs), all components of a computer system need to be time-predictable: the processor pipeline, the caches, the memory controller, and memory arbitration on a multicore processor. This paper presents a solution for time-predictable memory arbitration and access for chip-multiprocessors. The memory network-on-chip is organized as a tree with time-division multiplexing (TDM) of accesses to the shared memory. The TDM based arbitration completely decouples processor cores and allows WCET analysis of the memory accesses on individual cores without considering the tasks on the other cores. Furthermore, we perform local, distributed arbitration according to the global TDM schedule. This solution avoids a central arbiter and scales to a large number of processors.
Proceedings Article•10.1109/SC.2014.51•
Pardicle: parallel approximate density-based clustering

[...]

Md. Mostofa Ali Patwary1, Nadathur Satish1, Narayanan Sundaram1, Fredrik Manne2, Salman Habib3, Pradeep Dubey1 •
Intel1, University of Bergen2, Argonne National Laboratory3
16 Nov 2014
TL;DR: A fast heuristic algorithm for DBSCAN using density based sampling is proposed, which performs equally well in quality compared to exact algorithms, but is more than an order of magnitude faster.
Abstract: Dbscan is a widely used isodensity-based clustering algorithm for particle data well-known for its ability to isolate arbitrarily-shaped clusters and to filter noise data. The algorithm is super-linear (O(nlogn)) and computationally expensive for large datasets. Given the need for speed, we propose a fast heuristic algorithm for Dbscan using density based sampling, which performs equally well in quality compared to exact algorithms, but is more than an order of magnitude faster. Our experiments on astrophysics and synthetic massive datasets (8.5 billion numbers) shows that our approximate algorithm is up to 56x faster than exact algorithms with almost identical quality (Omega-Index ≥ 0.99). We develop a new parallel Dbscan algorithm, which uses dynamic partitioning to improve load balancing and locality. We demonstrate near-linear speedup on shared memory (15x using 16 cores, single node Intel® Xeon® processor) and distributed memory (3917x using 4096 cores, multinode) computers, with 2x additional performance improvement using Intel® Xeon Phi™ coprocessors. Additionally, existing exact algorithms can achieve up to 3.4 times speedup using dynamic partitioning.
Proceedings Article•10.1145/2628071.2628075•
VAST: the illusion of a large memory space for GPUs

[...]

Janghaeng Lee1, Mehrzad Samadi1, Scott Mahlke1•
University of Michigan1
24 Aug 2014
TL;DR: Virtual Address Space for Throughput processors (VAST), an automatic GPU memory management system that provides an OpenCL program with the illusion of a virtual memory space, and improves the re-targetability of the OpenCL with moderate overhead.
Abstract: Heterogeneous systems equipped with traditional processors (CPUs) and graphics processing units (GPUs) have enabled processing large data sets. With new programming models, such as OpenCL and CUDA, programmers are encouraged to offload data parallel workloads to GPUs as much as possible in order to fully utilize the available resources. Unfortunately, offloading work is strictly limited by the size of the physical memory on a specific GPU. In this paper, we present Virtual Address Space for Throughput processors (VAST), an automatic GPU memory management system that provides an OpenCL program with the illusion of a virtual memory space. Based on the available physical memory on the target GPU, VAST does the following: automatically partitions the data parallel workload into chunks; efficiently extracts the precise working set required for the divided workload; rearranges the working set in contiguous memory space; and, transforms the kernel to operate on the reorganized working set. With VAST, the programmer is responsible for developing a data parallel kernel in OpenCL without concern for physical memory space limitations of individual GPUs. VAST transparently handles code generation dealing with the constraints of the actual physical memory and improves the re-targetability of the OpenCL with moderate overhead. Experiments demonstrate that a real GPU, NVIDIA GTX 760 with 2 GB of memory, can compute any size of data without program changes achieving 2.6× speedup over CPU exeuction, which is a realistic alternative for large data computation.
Proceedings Article•10.1109/ISPASS.2014.6844487•
Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs

[...]

Chao Li1, Yi Yang, Hongwen Dai1, Shengen Yan2, Frank Mueller1, Huiyang Zhou1 •
North Carolina State University1, Chinese Academy of Sciences2
23 Mar 2014
TL;DR: This paper presents an in-depth study to reveal interesting and sometimes unexpected tradeoffs between shared memory and the hardware-managed L1 D- caches in GPU architecture and shows that most benchmarks perform significantly better with shared memory than the L 1 D-caches.
Abstract: On-chip caches are commonly used in computer systems to hide long off-chip memory access latencies. To manage on-chip caches, either software-managed or hardware-managed schemes can be employed. State-of-art accelerators, such as the NVIDIA Fermi or Kepler GPUs and Intel's forthcoming MIC “Knights Landing” (KNL), support both software-managed caches, aka. shared memory (GPUs) or near memory (KNL), and hardware-managed L1 data caches (D-caches). Furthermore, shared memory and the L1 D-cache on a GPU utilize the same physical storage and their capacity can be configured at runtime (same for KNL). In this paper, we present an in-depth study to reveal interesting and sometimes unexpected tradeoffs between shared memory and the hardware-managed L1 D- caches in GPU architecture. In our study, the kernels utilizing the L1 D-caches are generated from those leveraging shared memory to ensure that the same optimizations such as tiling are applied equally in both versions. Our detailed analyses reveal that rather than cache hit rates, the following tradeoffs often have more profound performance impacts. On one hand, the kernels utilizing the L1 caches may support higher degrees of thread-level parallelism, offer more opportunities for data to be allocated in registers, and sometimes result in lower dynamic instruction counts. On the other hand, the applications utilizing shared memory enable more coalesced accesses and tend to achieve higher degrees of memory-level parallelism. Overall, our results show that most benchmarks perform significantly better with shared memory than the L1 D-caches due to the high impact of memory-level parallelism and memory coalescing.
Journal Article•10.1016/J.CAGEO.2014.06.003•
Finite-difference staggered grids in GPUs for anisotropic elastic wave propagation simulation

[...]

F. Rubio1, Mauricio Hanzich1, Albert Farrés1, Josep de la Puente1, José María Cela1 •
Barcelona Supercomputing Center1
01 Sep 2014-Computers & Geosciences
TL;DR: This work has tackled the problem of porting and optimizing a 3D elastic wave propagation engine which supports both standard- and fully-staggered grids to multi-GPU clusters and proposed and evaluated many optimization strategies and adopted the best performing ones for the final code.
Proceedings Article•10.1145/2588555.2595634•
SpongeFiles: mitigating data skew in mapreduce using distributed memory

[...]

Khaled Elmeleegy, Christopher Olston1, Benjamin Reed2•
Google1, Facebook2
18 Jun 2014
TL;DR: By enabling memory-sapped nodes to tap into the spare capacity of their neighbors, SpongeFiles minimize expensive disk spilling, thereby improving performance and in the authors' experiments with Hadoop and Pig, spongeFiles reduce overall job runtimes by up to 55% and by 85% under disk contention.
Abstract: Data skew is a major problem for data processing platforms like MapReduce. Skew causes worker tasks to spill to disk what they cannot fit in memory, which slows down the task and the overall job. Moreover, performance of other jobs sharing same disk degrades. In many cases, this situation occurs even as the cluster has plenty of spare memory it is just not used evenly. We introduce SpongeFiles, a novel distributed-memory abstraction tailored to data processing environments like MapReduce. A SpongeFile is a logical byte array, comprised of large chunks that can be stored in a variety of locations in the cluster. Spilled data goes to SpongeFiles, which route it to the nearest location with sufficient capacity (local memory, remote memory, local disk, or remote disk as a last resort). By enabling memory-sapped nodes to tap into the spare capacity of their neighbors, SpongeFiles minimize expensive disk spilling, thereby improving performance. In our experiments with Hadoop and Pig, SpongeFiles reduce overall job runtimes by up to 55% and by up to 85% under disk contention.
Proceedings Article•10.1109/SC.2014.19•
Dissecting on-node memory access performance: a semantic approach

[...]

Alfredo Gimenez1, Todd Gamblin2, Barry Rountree2, Abhinav Bhatele2, Ilir Jusufi1, Peer-Timo Bremer2, Bernd Hamann1 •
University of California, Davis1, Lawrence Livermore National Laboratory2
16 Nov 2014
TL;DR: The techniques provide useful insight into the memory behaviour of applications and allow programmers to understand the performance ramifications of key design decisions: domain decomposition, multi-threading, and data motion within distributed memory systems.
Abstract: Optimizing memory access is critical for performance and power efficiency. CPU manufacturers have developed sampling-based performance measurement units (PMUs) that report precise costs of memory accesses at specific addresses. However, this data is too low-level to be meaningfully interpreted and contains an excessive amount of irrelevant or uninteresting information. We have developed a method to gather fine-grained memory access performance data for specific data objects and regions of code with low overhead and attribute semantic information to the sampled memory accesses. This information provides the context necessary to more effectively interpret the data. We have developed a tool that performs this sampling and attribution and used the tool to discover and diagnose performance problems in real-world applications. Our techniques provide useful insight into the memory behaviour of applications and allow programmers to understand the performance ramifications of key design decisions: domain decomposition, multi-threading, and data motion within distributed memory systems.
Journal Article•10.1109/TVCG.2013.261•
Exploring the Spectrum of Dynamic Scheduling Algorithms for Scalable Distributed-MemoryRay Tracing

[...]

Paul A. Navrátil1, Hank Childs2, Donald S. Fussell1, Calvin Lin1•
University of Texas at Austin1, University of Oregon2
01 Jun 2014-IEEE Transactions on Visualization and Computer Graphics
TL;DR: This paper extends and evaluates a family of dynamic ray scheduling algorithms that can be performed in-situ on large distributed memory parallel computers and shows that the dynamic approach can render data sets that are larger than aggregate system memory and that cannot be rendered by existing statically scheduled ray tracers.
Abstract: This paper extends and evaluates a family of dynamic ray scheduling algorithms that can be performed in-situ on large distributed memory parallel computers. The key idea is to consider both ray state and data accesses when scheduling ray computations. We compare three instances of this family of algorithms against two traditional statically scheduled schemes. We show that our dynamic scheduling approach can render data sets that are larger than aggregate system memory and that cannot be rendered by existing statically scheduled ray tracers. For smaller problems that fit in aggregate memory but are larger than typical shared memory, our dynamic approach is competitive with the best static scheduling algorithm.
Proceedings Article•10.1109/SC.2014.86•
High-performance computation of distributed-memory parallel 3D voronoi and delaunay tessellation

[...]

Tom Peterka1, Dmitriy Morozov2, Carolyn L. Phillips1•
Argonne National Laboratory1, Lawrence Berkeley National Laboratory2
16 Nov 2014
TL;DR: In this article, a distributed-memory scalable parallel Delaunay and Voronoi tessellation algorithm is proposed that automatically determines which neighbor points need to be exchanged among the sub domains of a spatial decomposition.
Abstract: Computing a Voronoi or Delaunay tessellation from a set of points is a core part of the analysis of many simulated and measured datasets: N-body simulations, molecular dynamics codes, and LIDAR point clouds are just a few examples. Such computational geometry methods are common in data analysis and visualization, but as the scale of simulations and observations surpasses billions of particles, the existing serial and shared memory algorithms no longer suffice. A distributed-memory scalable parallel algorithm is the only feasible approach. The primary contribution of this paper is a new parallel Delaunay and Voronoi tessellation algorithm that automatically determines which neighbor points need to be exchanged among the sub domains of a spatial decomposition. Other contributions include periodic and wall boundary conditions, comparison of our method using two popular serial libraries, and application to numerous science datasets.
Journal Article•10.1109/TC.2013.38•
APC: A Novel Memory Metric and Measurement Methodology for Modern Memory Systems

[...]

Dawei Wang, Xian-He Sun
01 Jul 2014-IEEE Transactions on Computers
TL;DR: A novel memory metric called Access Per Cycle (APC), which is the number of data accesses per cycle, is proposed, to measure the overall memory performance with respect to the complexity of modern memory systems.
Abstract: Due to the infamous “memory wall” problem and a drastic increase in the number of data intensive applications, memory rather than processors has become the leading performance bottleneck in modern computing systems. Evaluating and understanding memory system performance is increasingly becoming the core of high-end computing. Conventional memory metrics, such as miss ratio, AMAT, etc., are designed to measure a given memory performance parameter, and do not reflect the overall performance or complexity of a modern memory system. On the other hand, widely used system-performance metrics, such as IPC, are designed to measure CPU performance, and do not directly reflect memory performance. In this paper, we propose a novel memory metric called Access Per Cycle (APC), which is the number of data accesses per cycle, to measure the overall memory performance with respect to the complexity of modern memory systems. A unique contribution of APC is its separation of memory evaluation from CPU evaluation; therefore, it provides a quantitative measurement of the “data-intensiveness” of an application. Simulation results show that the memory performance measured by APC captures the concurrency complexity of modern memory systems, while other metrics cannot. APC is simple, effective, and is significantly more appropriate than existing memory metrics in evaluating modern memory systems.
Patent•
Mixed type processing system and method oriented to industry big data diversity application

[...]

Qi Kaiyuan, Zhang Dong, Liu Zhengwei, Xin Guomao
3 Sep 2014
TL;DR: In this paper, a mixed type processing system and method oriented to industry big data diversity application is presented, which comprises a distributed data collection subsystem used for collecting data from an external system, a storing and parallel calculating subsystem for storing and calculating the collected data, and an integrated resource and system management platform used for managing the stored and calculated data.
Abstract: The invention discloses a mixed type processing system and method oriented to industry big data diversity application. The system comprises a distributed data collection subsystem used for collecting data from an external system, a storing and parallel calculating subsystem used for storing and calculating the collected data, and an integrated resource and system management platform used for managing the stored and calculated data. The storing and parallel calculating subsystem comprises a big data storing subsystem and a big data processing subsystem, wherein the big data processing subsystem comprises a memory calculating engine which is used for providing distributed memory abstract in a shared-nothing cluster and conducting parallel streamlined and thread lightweight processing on the collected data. The mixed type processing system and method can meet the requirements of industry big data diversity service application, the big data processing performance can be improved by ten or more times through acceleration of the memory calculating engine, and the usability, reliability and expandability of the system can be ensured through the integrated management platform.
Proceedings Article•10.1109/.41•
LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories

[...]

Hsin-Jung Yang1, Kermin Fleming2, Michael Adler2, Joel Emer1•
Massachusetts Institute of Technology1, Intel2
11 May 2014
TL;DR: This work proposes a shared-memory service that automatically manages coherent caches on multiple FPGAs and provides a set of easy-to-use declarative primitives to maintain coherency and consistency of accesses to shared memory resources.
Abstract: Parallel programming has been widely used in many scientific and technical areas to solve large problems. While general-purpose processors have rich infrastructure to support parallel programming on shared memory, such as coherent caches and synchronization libraries, parallel programming infrastructure for FPGAs is limited. Thus, development of FPGA-based parallel algorithms remains difficult. In this work, we seek to simplify parallel programming on FPGAs. We provide a set of easy-to-use declarative primitives to maintain coherency and consistency of accesses to shared memory resources. We propose a shared-memory service that automatically manages coherent caches on multiple FPGAs. Experimental results of a 2-dimensional heat transfer equation show that the shared memory service with our distributed coherent caches outperforms a centralized cache by 2.6x. To handle synchronization, we provide new lock and barrier primitives that leverage native FPGA communication capabilities and outperform traditional through-memory primitives by 1.8x.
Journal Article•10.1016/J.COMPFLUID.2014.01.005•
High order accurate simulation of compressible flows on GPU clusters over Software Distributed Shared Memory

[...]

Konstantinos I. Karantasis1, Eleftherios D. Polychronopoulos2, John A. Ekaterinaris3•
University of Illinois at Urbana–Champaign1, University of Patras2, Embry-Riddle Aeronautical University, Daytona Beach3
10 Apr 2014-Computers & Fluids
TL;DR: The results indicate that the presented approach is competitive with the message passing paradigm and they lay grounds for further research on the use of shared memory abstraction for future GPU clusters.
Proceedings Article•10.1109/IPDPS.2014.117•
Scaling Irregular Applications through Data Aggregation and Software Multithreading

[...]

Alessandro Morari1, Antonino Tumeo1, Daniel Chavarría-Miranda1, Oreste Villa2, Mateo Valero •
Pacific Northwest National Laboratory1, Nvidia2
19 May 2014
TL;DR: In this article, the authors describe GMT (Global Memory and Threading), a runtime system library that couples software multithreading and message aggregation together with a Partitioned Global Address Space (PGAS) data model to enable higher performance and scaling of irregular applications on multi-node systems.
Abstract: Emerging applications in areas such as bioinformatics, data analytics, semantic databases and knowledge discovery employ datasets from tens to hundreds of terabytes. Currently, only distributed memory clusters have enough aggregate space to enable in-memory processing of datasets of this size. However, in addition to large sizes, the data structures used by these new application classes are usually characterized by unpredictable and fine-grained accesses: i.e., they present an irregular behavior. Traditional commodity clusters, instead, exploit cache-based processor and high-bandwidth networks optimized for locality, regular computation and bulk communication. For these reasons, irregular applications are inefficient on these systems, and require custom, hand-coded optimizations to provide scaling in both performance and size. Lightweight software multithreading, which enables tolerating data access latencies by overlapping network communication with computation, and aggregation, which allows reducing overheads and increasing bandwidth utilization by coalescing fine-grained network messages, are key techniques that can speed up the performance of large scale irregular applications on commodity clusters. In this paper we describe GMT (Global Memory and Threading), a runtime system library that couples software multithreading and message aggregation together with a Partitioned Global Address Space (PGAS) data model to enable higher performance and scaling of irregular applications on multi-node systems. We present the architecture of the runtime, explaining how it is designed around these two critical techniques. We show that irregular applications written using our runtime can outperform, even by orders of magnitude, the corresponding applications written using other programming models that do not exploit these techniques.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve