TL;DR: In this article, the problem of constructing an erasure code for storage over a network when the data sources are distributed is considered, and it is shown that decentralized erasure codes are optimally sparse, and lead to reduced communication, storage and computation cost over random linear coding.
Abstract: In this correspondence, we consider the problem of constructing an erasure code for storage over a network when the data sources are distributed. Specifically, we assume that there are n storage nodes with limited memory and k n sources generating the data. We want a data collector, who can appear anywhere in the network, to query any k storage nodes and be able to retrieve the data. We introduce decentralized erasure codes, which are linear codes with a specific randomized structure inspired by network coding on random bipartite graphs. We show that decentralized erasure codes are optimally sparse, and lead to reduced communication, storage and computation cost over random linear coding.
TL;DR: The parallel implementation of second-order Møller-Plesset perturbation theory with the resolution-of-the-identity approximation (RI-MP2) extends the applicability of canonical MP2 to considerably larger systems.
Abstract: We present a parallel implementation of second-order Moller–Plesset perturbation theory with the resolution-of-the-identity approximation (RI-MP2). The implementation is based on a recent improved sequential implementation of RI-MP2 within the Turbomole program package and employs the message passing interface (MPI) standard for communication between distributed memory nodes. The parallel implementation extends the applicability of canonical MP2 to considerably larger systems. Examples are presented for full geometry optimizations with up to 60 atoms and 3300 basis functions and MP2 energy calculations with more than 200 atoms and 7000 basis functions.
TL;DR: The parallel SMO is developed using message passing interface (MPI) and shows great speedup on the adult data set and the Mixing National Institute of Standard and Technology (MNIST) data set when many processors are used.
Abstract: Sequential minimal optimization (SMO) is one popular algorithm for training support vector machine (SVM), but it still requires a large amount of computation time for solving large size problems. This paper proposes one parallel implementation of SMO for training SVM. The parallel SMO is developed using message passing interface (MPI). Specifically, the parallel SMO first partitions the entire training data set into smaller subsets and then simultaneously runs multiple CPU processors to deal with each of the partitioned data sets. Experiments show that there is great speedup on the adult data set and the Mixing National Institute of Standard and Technology (MNIST) data set when many processors are used. There are also satisfactory results on the Web data set.
TL;DR: ScalaBLAST is developed, which accommodates very large databases and which scales linearly to as many as thousands of processors on both distributed memory and shared memory architectures, representing a substantial improvement over the current state-of-the-art in high-performance sequence alignment with scaling and portability.
Abstract: Genes in an organism's DNA (genome) have embedded in them information about proteins, which are the molecules that do most of a cell's work. A typical bacterial genome contains on the order of 5,000 genes. Mammalian genomes can contain tens of thousands of genes. For each genome sequenced, the challenge is to identify protein components (proteome) being actively used for a given set of conditions. Fundamentally, sequence alignment is a sequence matching problem focused on unlocking protein information embedded in the genetic code, making it possible to assemble a "tree of life" by comparing new sequences against all sequences from known organisms. But, the memory footprint of sequence data is growing more rapidly than per-node core memory. Despite years of research and development, high-performance sequence alignment applications either do not scale well, cannot accommodate very large databases in core, or require special hardware. We have developed a high-performance sequence alignment application, ScalaBLAST, which accommodates very large databases and which scales linearly to as many as thousands of processors on both distributed memory and shared memory architectures, representing a substantial improvement over the current state-of-the-art in high-performance sequence alignment with scaling and portability. ScalaBLAST relies on a collection of techniques - distributing the target database over available memory, multilevel parallelism to exploit concurrency, parallel I/O, and latency hiding through data prefetching - to achieve high-performance and scalability. This demonstrated approach of database sharing combined with effective task scheduling should have broad ranging applications to other informatics-driven sciences
TL;DR: This paper is the first to integrate a software transactional memory system with a malloc/free based memory allocator and presents the first algorithm which ensures that space allocated in an aborted transaction is properly freed and does not lead to a space blowup.
Abstract: Emerging multi-core processors promise to provide an exponentially increasing number of hardware threads with every generation. Applications will need to be highly concurrent to fullyuse the power of these processors. To enable maximum concurrency, libraries (such as malloc-free packages) would therefore need to use non-blocking algorithms. But lock-free algorithms are notoriously difficult to reason about and inappropriate for average programmers. Transactional memory promises to significantly ease concurrent programming for the average programmer. This paper describes a highly efficient non-blocking malloc/free algorithm that supports memory allocation and deallocation inside transactional code blocks. Thus this paper describes a memory allocator that is suitable for emerging multi-core applications, while supporting modern concurrency constructs.This paper makes several novel contributions. It is the first to integrate a software transactional memory system with a malloc/free based memory allocator. We present the first algorithm which ensures that space allocated in an aborted transaction is properly freed and does not lead to a space blowup. Unlike previous lock-free malloc packages, our algorithm avoids atomic operations on typical code paths, making our algorithm substantially more efficient.
TL;DR: This paper examines applying phase analysis algorithms and how to adapt them to parallel applications running on shared memory processors, and examines using the phase analysis to pick simulation points to guide multithreaded simulation.
Abstract: Most programs are repetitive, where similar behavior can be seen at different execution times. Algorithms have been proposed that automatically group similar portions of a program's execution into phases, where samples of execution in the same phase have homogeneous behavior and similar resource requirements. In this paper, we examine applying these phase analysis algorithms and how to adapt them to parallel applications running on shared memory processors. Our approach relies on a separate representation of each thread's activity. We first focus on showing its ability to identify similar intervals of execution across threads for a single run. We then show that it is effective at identifying similar behavior of a program when the number of threads is varied between runs. This can be used by developers to examine how different phases scale across different number of threads. Finally, we examine using the phase analysis to pick simulation points to guide multithreaded simulation.
TL;DR: A parallel, distributed memory algorithm for volume rendering massive data sets and two techniques for calculating the sample points: a 3D rasterization technique and a kernel-based technique, which trade off between speed and generality.
Abstract: We introduce a parallel, distributed memory algorithm for volume rendering massive data sets. The algorithm's scalability has been demonstrated up to 400 processors, rendering one hundred million unstructured elements in under one second. The heart of the algorithm is a hybrid approach that parallelizes over both the elements of the input data and over the pixels of the output image. At each stage of the algorithm, there are strong limits on how much work each processor performs, ensuring good parallel efficiency. The algorithm is sample-based. We present two techniques for calculating the sample points: a 3D rasterization technique and a kernel-based technique, which trade off between speed and generality. Finally, the algorithm is very flexible. It can be deployed in general purpose visualization tools and can also support diverse mesh types, ranging from structured grids to curvilinear and unstructured meshes to point clouds.
TL;DR: This paper describes a unique long-message MPI library ‘pipeline' protocol that addresses constraints while avoiding some of the pitfalls of existing techniques by using portable send/receive semantics to hide the cost of initializing the pipeline algorithm.
Abstract: Modern network communication libraries that leverage Remote Directory Memory Access (RDMA) and OS bypass protocols, such as Infiniband [2] and Myrinet [10] can offer significant performance advantages over conventional send/receive protocols. However, this performance often comes with hidden per buffer setup costs [4]. This paper describes a unique long-message MPI [9] library ‘pipeline' protocol that addresses these constraints while avoiding some of the pitfalls of existing techniques. By using portable send/receive semantics to hide the cost of initializing the pipeline algorithm, and then effectively overlapping the cost of memory registration with RDMA operations, this protocol provides very good performance for any large-memory usage pattern. This approach avoids the use of non-portable memory hooks or keeping registered memory from being returned to the OS. Through this approach, bandwidth may be increased up to 67% when memory buffers are not effectively reused while providing superior performance in the effective bandwidth benchmark. Several user level protocols are explored using Open MPI's PML (Point to point messaging layer) and compared/contrasted to this ‘pipeline' protocol.
TL;DR: In this article, the authors propose a technique for providing access to shared data based on enhanced standard virtual memory mechanisms. But this technique can be configured to handle shared data at the sub-page-granular level using sparse virtual pages to minimize memory access contention and thus improve the likelihood of quick re-hits.
Abstract: A technique involves providing access to shared data based on enhanced standard virtual memory mechanisms. Once data from a shared area of memory is moved into primary memory of a first computerized device from a second computerized device, the first computerized device can retain that data in order to shorten the latency of subsequent accesses. Such a technique can be configured to handle shared data at the sub-page-granular level using sparse virtual pages to minimize memory access contention and thus improve the likelihood of quick re-hits. Furthermore, such a technique can be conveniently accomplished through an enhancement to a common page fault handler of an operating system and utilizing atomic remote access support from a standard communications protocol thus alleviating the need to employ more costly and complicated solutions such as inflexible hardware implementations or independent programs that could pose additional design burdens and reliability concerns.
TL;DR: A distributed shared memory architecture has been explored, that is suitable for low-power on-chip multiprocessors based on NoC, and the energy/delay exploration of on- chip physically distributed and logically shared memory address space for MP-SoCs based on a parameterizable NoC is explored.
Abstract: Multiprocessor system-on-chip (MP-SoC) platforms represent an emerging trend for embedded multimedia applications. To enable MP-SoC platforms, scalable communication-centric interconnect fabrics, such as networks-on-chip (NoC), have been recently proposed. The shared memory represents one of the key elements in designing MP-SoCs, since its function is to provide data exchange and synchronization support. In this paper, a distributed shared memory architecture has been explored, that is suitable for low-power on-chip multiprocessors based on NoC. In particular, the paper focuses on the energy/delay exploration of on-chip physically distributed and logically shared memory address space for MP-SoCs based on a parameterizable NoC. The data allocation on the physically distributed shared memory space is dynamically managed by an on-chip Hardware Memory Management Unit. Experimental results show the impact of different NoC topologies and distributed shared memory configurations for a selected set of parallel benchmark applications from the power/performance perspective.
TL;DR: In this article, the authors evaluate the performance of quantum arithmetic algorithms run on a distributed quantum computer (a quantum multicomputer), and show that the teledata approach performs better, and that carry-ripple adders perform well when the teleportation block is decomposed so that the key quantum operations can be parallelized.
Abstract: We evaluate the performance of quantum arithmetic algorithms run on a distributed quantum computer (a quantum multicomputer). We vary the node capacity and I/O capabilities, and the network topology. The tradeoff of choosing between gates executed remotely, through ``teleported gates'' on entangled pairs of qubits (telegate), versus exchanging the relevant qubits via quantum teleportation, then executing the algorithm using local gates (teledata), is examined. We show that the teledata approach performs better, and that carry-ripple adders perform well when the teleportation block is decomposed so that the key quantum operations can be parallelized. A node size of only a few logical qubits performs adequately provided that the nodes have two transceiver qubits. A linear network topology performs acceptably for a broad range of system sizes and performance parameters. We therefore recommend pursuing small, high-I/O bandwidth nodes and a simple network. Such a machine will run Shor's algorithm for factoring large numbers efficiently.
TL;DR: An experimental evaluation on BlueGene/L®, a distributed-memory machine, demonstrates that the combination of the compiler with the runtime system produces programs with performance comparable to that of efficient MPI programs and good performance scalability up to hundreds of thousands of processors.
Abstract: This paper describes the design and implementation of a scalable run-time system and an optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on BlueGene/L®, a distributed-memory machine, demonstrates that the combination of the compiler with the runtime system produces programs with performance comparable to that of efficient MPI programs and good performance scalability up to hundreds of thousands of processors.Our runtime system design solves the problem of maintaining shared object consistency efficiently in a distributed memory machine. Our compiler infrastructure simplifies the code generated for parallel loops in UPC through the elimination of affinity tests, eliminates several levels of indirection for accesses to segments of shared arrays that the compiler can prove to be local, and implements remote update operations through a lower-cost asynchronous message. The performance evaluation uses three well-known benchmarks --- HPC RandomAccess, HPC STREAM and NAS CG --- to obtain scaling and absolute performance numbers for these benchmarks on up to 131072 processors, the full BlueGene/L machine. These results were used to win the HPC Challenge Competition at SC05 in Seattle WA, demonstrating that PGAS languages support both productivity and performance.
TL;DR: Parallel External A*, a parallel variant of external memory directed model checking, assuming a sufficient number of processors and number of open file pointers per process, the I/O complexity is reduced to linear by exploiting a hash-function based state space partition scheme.
Abstract: In this paper we present Parallel External A*, a parallel variant of external memory directed model checking. As a model scales up, its successors generation becomes complex and, in turn, starts to impact the running time of the model checker. Probings of our external memory model checker IO-HSF-SPIN revealed that in some of the cases about 70% of the whole running time was consumed in the internal processing. Employing a multiprocessor machine or a cluster of workstations, we can distribute the internal working load of the algorithm on multiple processors.
Moreover, assuming a sufficient number of processors and number of open file pointers per process, the I/O complexity is reduced to linear by exploiting a hash-function based state space partition scheme.
TL;DR: The first implementation and experimental evaluation of a scalable parallel algorithm for suffix array construction that works on distributed memory computers using MPI and makes it possible to build suffix arrays for huge inputs very quickly.
Abstract: Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. We describe the first implementation and experimental evaluation of a scalable parallel algorithm for suffix array construction. The implementation works on distributed memory computers using MPI, Experiments with up to 128 processors show good constant factors and make it look likely that the algorithm would also scale to considerably larger systems. This makes it possible to build suffix arrays for huge inputs very quickly. Our algorithm is a parallelization of the linear time DC3 algorithm.
TL;DR: The paper describes the development and performance of parallel algorithms for the discrete element method (DEM) software and the novel algorithm combining link-cells for contact detection, the static domain decomposition for parallelization and MPI data transfer for processors exchanging particles has been developed for distributed memory PC clusters.
Abstract: The paper describes the development and performance of parallel algorithms for the discrete element method (DEM) software. Spatial domain decomposition strategy and message passing inter-processor communication have been implemented in the DEMMAT code for simulation of visco-elastic frictional granular media. The novel algorithm combining link-cells for contact detection, the static domain decomposition for parallelization and MPI data transfer for processors exchanging particles has been developed for distributed memory PC clusters. The parallel software DEMMAT_PAR has been applied to model compacting of spherical particles in the rectangular box. Two benchmark problems with different numbers of particles have been solved in order to measure parallel efficiency of the code. The inter-processor communication has been examined in order to improve domain decomposition topology and to achieve better load balancing. The speed-up equal to 11 has been obtained on 16 processors. The parallel performance study has been performed on the PC cluster VILKAS of Vilnius Gediminas Technical University, Lithuania.
TL;DR: Most HPC systems are clusters of shared memory nodes, but also “constelation” type systems with large SMP nodes, which means parallel programming must combine the distributed memory parallelization on the node inter-connect with the shared memory Parallel programming inside of each node.
Abstract: Most HPC systems are clusters of shared memory nodes. Such systems can be PC clusters with dual or quad boards, but also “constelation” type systems with large SMP nodes. Parallel programming must combine the distributed memory parallelization on the node inter-connect with the shared memory parallelization inside of each node.
TL;DR: The proposed distributed instruction memory organization requires minimal hardware overhead and allows execution of multiple loops in parallel in a uni-processor system and can reduce the energy consumed in the instruction and data memory hierarchy.
Abstract: Reduced energy consumption is one of the most important design goals for embedded application domains like wireless, multimedia and biomedical. Instruction memory hierarchy has been proven to be one of the most power hungry parts of the system. This paper introduces an architectural enhancement for the instruction memory to reduce energy and improve performance. The proposed distributed instruction memory organization requires minimal hardware overhead and allows execution of multiple loops in parallel in a uni-processor system. This architecture enhancement can reduce the energy consumed in the instruction and data memory hierarchy by 70.01 % and improve the performance by 32.89% compared to enhanced SMT based architectures
TL;DR: An automated application specific co-synthesis methodology for memory and communication architectures (COSMECA) in MPSoC designs is proposed to design a communication architecture having the least number of busses, which satisfies performance and memory area constraints, while the secondary objective is to reduce the memory area cost.
Abstract: Memory and communication architectures have a significant impact on the cost, performance, and time-to-market of complex multi-processor system-on-chip (MPSoC) designs. The memory architecture dictates most of the data traffic flow in a design, which in turn influences the design of the communication architecture. Thus there is a need to co-synthesize the memory and communication architectures to avoid making sub-optimal design decisions. This is in contrast to traditional platform-based design approaches where memory and communication architectures are synthesized separately. In this paper, we propose an automated application specific co-synthesis methodology for memory and communication architectures (COSMECA) in MPSoC designs. The primary objective is to design a communication architecture having the least number of busses, which satisfies performance and memory area constraints, while the secondary objective is to reduce the memory area cost. Results of applying COSMECA to several industrial strength MPSoC applications from the networking domain indicate a saving of as much as 40% in number of busses and 29% in memory area compared to the traditional approach.
TL;DR: A two-step Modified Critical Path and Area-based (MCPA) scheduling heuristic is developed which targets at improving the processor allocation phase of an existing Critical Path-based scheduling algorithm and turns out to be much better than the parent CPA algorithm and comparable to the high complexity Critical Path reduction algorithm.
Abstract: Scheduling of most of the parallel scientific applications demand simultaneous exploitation of task and data parallelism for efficient and effective utilization of system and other resources. Traditional optimization techniques, like optimal control-theoretic approaches, convex-programming, and bin-packing, have been suggested in the literature for dealing with the most critical processor allocation phase. However, their application onto the real world problems is not straightforward, which departs the solutions away from optimality. Heuristic based approaches, in contrast, work in the integer domain for the number of processors all through, and perform appreciably well. A two-step Modified Critical Path and Area-based (MCPA) scheduling heuristic is developed which targets at improving the processor allocation phase of an existing Critical Path and Area-based (CPA) scheduling algorithm. Strength of the suggested algorithm lies in bridging the gap between the processor allocation and task assignment phases of scheduling. It helps in making better processor allocations for data parallel tasks without sacrificing the essential task parallelism available in the application program. Performance of MCPA algorithm, in terms of normalized schedule length and speedup, is evaluated for random and real application task graph suites. It turns out to be much better than the parent CPA algorithm and comparable to the high complexity Critical Path Reduction (CPR) algorithm.
TL;DR: In this paper, the authors present a data reuse analysis technique that allows the system designer to explore a wide range of customized memory hierarchy organizations with different size and energy profiles, which enables the system designers to explore feasible memory subsystem solutions that meet power and area constraints while maintaining the necessary performance level.
Abstract: The increasing use of multiprocessor systems-on-chip (MPSoCs) for high performance demands of embedded applications results in high power dissipation. The memory subsystem is a large and critical contributor to both energy and performance, requiring system designers to perform exploration of low power memory organizations. In this paper we present a novel multiprocessor data reuse analysis technique that allows the system designer to explore a wide range of customized memory hierarchy organizations with different size and energy profiles. Our technique enables the system designer to explore feasible memory subsystem solutions that meet power and area constraints while maintaining the necessary performance level. Our experiments on the complex QSDPCM benchmark illustrate the exploration of a wide range of customized memory hierarchies for an MPSoC implementation.
TL;DR: In this article, a kernel scheduler is utilized to schedule a next application for execution in accordance with the memory availability and the determined memory requirements of the one or more monitored applications.
Abstract: A system, method, and program product for scheduling applications in a data processing system. The method of the invention includes a step of monitoring memory availability within the data processing system during execution of one or more applications. The method further includes recording a memory usage history for one or more applications from which a memory requirement may be determined. A kernel scheduler is utilized to schedule a next application for execution in accordance with the memory availability and the determined memory requirements of the one or more monitored applications.
TL;DR: In this paper, a shared memory is used by concurrent threads in a multithreaded processor, with any addressable storage location in the shared memory being readable and writeable by any of the threads.
Abstract: A shared memory is usable by concurrent threads in a multithreaded processor, with any addressable storage location in the shared memory being readable and writeable by any of the threads. Processing engines that execute the threads are coupled to the shared memory via an interconnect that transfers data in only one direction (e.g., from the shared memory to the processing engines); the same interconnect supports both read and write operations. The interconnect advantageously supports multiple parallel read or write operations.
TL;DR: In this article, a controller receives memory requests from the parallel threads and coalesces the memory requests to improve memory transfer efficiency in a multiprocessor system, where the memory request is made by the processor.
Abstract: A multiprocessor system executes parallel threads. A controller receives memory requests from the parallel threads and coalesces the memory requests to improve memory transfer efficiency.
TL;DR: In this paper, a high-speed protocol handler combines parallelism and pipelining in multiple programmable processors, along with specialized front-end logic at the network interface that handles time critical protocol operations.
Abstract: An architecture that achieves high speed performance in a network protocol handler combines parallelism and pipelining in multiple programmable processors, along with specialized front-end logic at the network interface that handles time critical protocol operations. The multiple processors are interconnected via high-speed interconnect, and each processor's memory is globally accessible by other processors. Each processor has multiple threads, each capable of fully executing programs. Each processors contains embedded dynamic random access memory (DRAM). Threads within a processor are assigned the processing of various protocol functions in a parallel/pipelined fashion. Data frame processing is done by one or more of the threads to identify relates frames. Related frames are dispatch to the same thread so as to minimize the overhead associated with memory accesses and general protocol processing. The high-speed protocol handler may also provide built-in monitors for examining the activity of its hardware resources and reallocating the workload to the resources that are not heavily used, thus balancing the resource utilization and increasing the workload throughput.
TL;DR: The findings on hiding memory latency using the direct memory access (DMA) mode, which is present in all modern systems, combined with a software prefetch mechanism, and a customized on-chip memory hierarchy mapping are presented.
Abstract: Memory latency has always been a major issue in embedded systems that execute memory-intensive applications. This is even more true as the gap between processor and memory speed continues to grow. Hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherit in large off-chip memories; however, both types of prefetching have their shortcomings. Hardware schemes are more complex and require extra circuitry to compute data access strides, while software schemes generate prefetch instructions, which if not computed carefully may hamper performance. On the other hand, some applications domains (such as multimedia) have a uniform and known a priori memory access pattern, that if exploited, could yield significant application performance improvement. With this characteristic in mind, we present our findings on hiding memory latency using the direct memory access (DMA) mode, which is present in all modern systems, combined with a software prefetch mechanism, and a customized on-chip memory hierarchy mapping. Compared to previous approaches, we are able to estimate the performance and power metrics, without actually implementing the embedded system. Experimental results on nine well known multimedia and imaging applications prove the efficiency of our technique. Finally, we verify the performance estimations by implementing and simulating the algorithms on the TI C6201 processor.
TL;DR: This work proposes a service-oriented grid memory sharing scheme, distributed paging RAM grid (DPRG), and designs major operations and optimizations to fit the usage of grid systems.
Abstract: Memory-intensive applications often suffer from the poor performance of disk swapping when memory is inadequate. Remote memory sharing schemes, which provide a remote memory that is faster than the local hard disk, are able to improve the performance of such applications. Due to the limitation of being applicable within single clusters only, however, most of the previous remote memory mechanisms, such as the network memory scheme, fail to be extendable into a large scale, distributed, heterogeneous, and dynamic environment. In this work, we propose a service-oriented grid memory sharing scheme, distributed paging RAM grid (DPRG). We study the properties and criteria of large scale memory sharing, and then design major operations and optimizations to fit the usage of grid systems. We collect trace from our grid environment, and evaluate DPRG through comprehensive trace-driven simulations. Results show that DPRG significantly outperforms existing remote memory sharing schemes and supports grid computing applications effectively.
TL;DR: In this paper, the authors describe a method for gathering performance related data in a multiprocessing environment using a set of profiling processors, which are executed on a processor that minimizes the distortion to the processor resources used to execute the program to be profiled.
Abstract: A computer implemented method, apparatus, and computer usable medium for gathering performance related data in a multiprocessing environment. Instrumentation code is executed on a processor that minimizes the distortion to the processor resources used to execute the program to be profiled. Data is written by the instrumentation code to a shared memory in response to an event occurring during execution of the program. The data is generated during execution of the program on the processor and the instrumentation code uses shared memory to convey the data to a profiling application running on a set of profiling processors. The data is collected by the set of profiling processors in the shared memory written by the instrumentation code.
TL;DR: This tutorial will survey the micro-architectural and architectural differences across these processors with data management in mind, and will present previous work and future opportunities for expanding query processing algorithms to other hardware than general-purpose processors.
Abstract: The rapid increase in the data volumes for the past few decades has intensified the need for high processing power for database and data mining applications. Researchers have actively sought to design and develop new architectures for improving the performance. Recent research shows that the performance can be significantly improved using either (a) effective utilization of architectural features and memory hierarchies used by the conventional processors, or (b) the high computational power and memory bandwidth in commodity hardware such as network processing units (NPUs), Cell processors and graphics processing units (GPUs). This tutorial will survey the micro-architectural and architectural differences across these processors with data management in mind, and will present previous work and future opportunities for expanding query processing algorithms to other hardware than general-purpose processors. In addition to the database community, we intend to increase awareness in the computer architecture scene about opportunities to construct heterogeneous chips.
TL;DR: A new memory scheduler is presented that makes decisions based on the history of recently scheduled operations, providing two advantages: it can better reason about the delays associated with complex DRAM structure, and it can adapt to different observed workload.
Abstract: Careful memory scheduling can increase memory bandwidth and overall system performance. We present a new memory scheduler that makes decisions based on the history of recently scheduled operations, providing two advantages: it can better reason about the delays associated with complex DRAM structure, and it can adapt to different observed workload
TL;DR: In this paper, a transactional-memory manager is used to coordinate memory access requests directed at the memory from a plurality of transactions in a software transactional memory system, where the commit request from a given transaction of the set, determines whether to commit or abort the given transaction based at least in part on transactional states of other transactions in the set.
Abstract: A system for implementing synchronized objects for software transactional memory may include one or more processors and a memory storing program instructions executable by the processor to implement a transactional-memory manager configured to coordinate memory access requests directed at the memory from a plurality of transactions. The transactional-memory manager records, within a collaborator record for a shared data object in the memory, identifications of a set of two or more transactions that have requested synchronization on the object. In response to a commit request from a given transaction of the set, the transactional-memory manager determines whether to commit or abort the given transaction based at least in part on the transactional states of other transactions in the set, examining the collaborator record to identify the other transactions.