TL;DR: The coupling tool preCICE is presented which offers the complete coupling functionality required for a fast development of a multi-physics environment using existing, possibly black-box solvers and numerical examples show the high flexibility, the correctness, and the high performance and parallel scalability of coupled simulations withpreCICE as the coupling unit.
TL;DR: OpenRAM is introduced, an open-source memory compiler that provides a platform for the generation, characterization, and verification of fabricable memory designs across various technologies, sizes, and configurations and enables research in computer architecture, system-on-chip design, memory circuit and device research, and computer-aided design.
Abstract: Computer systems research is often inhibited by the availability of memory designs. Existing Process Design Kits (PDKs) frequently lack memory compilers, while expensive commercial solutions only provide memory models with immutable cells, limited configurations, and restrictive licenses. Manually creating memories can be time consuming and tedious and the designs are usually inflexible. This paper introduces OpenRAM, an open-source memory compiler, that provides a platform for the generation, characterization, and verification of fabricable memory designs across various technologies, sizes, and configurations. It enables research in computer architecture, system-on-chip design, memory circuit and device research, and computer-aided design.
TL;DR: A distributed-memory library for computations with dense structured matrices using Hierarchically Semi-Separable (HSS) representations and the compression algorithm that computes the HSS form of an input dense matrix relies on randomized sampling with a novel adaptive sampling mechanism.
Abstract: We present a distributed-memory library for computations with dense structured matrices. A matrix is considered structured if its off-diagonal blocks can be approximated by a rank-deficient matrix with low numerical rank. Here, we use Hierarchically Semi-Separable (HSS) representations. Such matrices appear in many applications, for example, finite-element methods, boundary element methods, and so on. Exploiting this structure allows for fast solution of linear systems and/or fast computation of matrix-vector products, which are the two main building blocks of matrix computations. The compression algorithm that we use, that computes the HSS form of an input dense matrix, relies on randomized sampling with a novel adaptive sampling mechanism. We discuss the parallelization of this algorithm and also present the parallelization of structured matrix-vector product, structured factorization, and solution routines. The efficiency of the approach is demonstrated on large problems from different academic and industrial applications, on up to 8,000 cores.This work is part of a more global effort, the STRUctured Matrices PACKage (STRUMPACK) software package for computations with sparse and dense structured matrices. Hence, although useful on their own right, the routines also represent a step in the direction of a distributed-memory sparse solver.
TL;DR: The MEE is a successful feat of real-world cryptographic engineering: it's the first time such cryptographic memory protection has been added to a widely deployed general-purpose processor.
Abstract: Intel's Software Guard Extensions allows general-purpose computing platforms to run software in a trustworthy manner and securely handle encrypted data. To satisfy the technology's security goals, the external system memory must be cryptographically protected. A new hardware unit added to the processor's memory controller--the Memory Encryption Engine (MEE)--was recently developed to protect the confidentiality, integrity, and freshness of this external memory traffic, against eavesdropping and tampering. The MEE is a successful feat of real-world cryptographic engineering: it's the first time such cryptographic memory protection has been added to a widely deployed general-purpose processor.
TL;DR: Shreds, a set of OS-backed programming primitives that addresses developers' currently unmet needs for fine-grained, convenient, and efficient protection of sensitive memory content against in-process adversaries, is proposed.
Abstract: Once attackers have injected code into a victim program's address space, or found a memory disclosure vulnerability, all sensitive data and code inside that address space are subject to thefts or manipulation. Unfortunately, this broad type of attack is hard to prevent, even if software developers wish to cooperate, mostly because the conventional memory protection only works at process level and previously proposed in-process memory isolation methods are not practical for wide adoption. We propose shreds, a set of OS-backed programming primitives that addresses developers' currently unmet needs for fine-grained, convenient, and efficient protection of sensitive memory content against in-process adversaries. A shred can be viewed as a flexibly defined segment of a thread execution (hence the name). Each shred is associated with a protected memory pool, which is accessible only to code running in the shred. Unlike previous works, shreds offer in-process private memory without relying on separate page tables, nested paging, or even modified hardware. Plus, shreds provide the essential data flow and control flow guarantees for running sensitive code. We have built the compiler toolchain and the OS module that together enable shreds on Linux. We demonstrated the usage of shreds and evaluated their performance using 5 non-trivial open source software, including OpenSSH and Lighttpd. The results show that shreds are fairly easy to use and incur low runtime overhead (4.67%).
TL;DR: This paper is primarily dedicated to the distributed memory parallelization of particle methods, targeting several thousands of CPU cores, targeting one particular particle method which is Smoothed Particle Hydrodynamics (SPH), one of the most widespread today in the literature as well as in engineering.
TL;DR: This article introduces a parallel and distributed memory-based algorithm that builds vulnerability-based attack graphs on a distributed multi-agent platform and introduces a rich attack template and network model in order to form chains of vulnerability exploits in attack graphs more precisely.
Abstract: Attack graphs show possible paths that an attacker can use to intrude into a target network and gain privileges through series of vulnerability exploits. The computation of attack graphs suffers from the state explosion problem occurring most notably when the number of vulnerabilities in the target network grows large. Parallel computation of attack graphs can be utilized to attenuate this problem. When employed in online network security evaluation, the computation of attack graphs can be triggered with the correlated intrusion alerts received from sensors scattered throughout the target network. In such cases, distributed computation of attack graphs becomes valuable. This article introduces a parallel and distributed memory-based algorithm that builds vulnerability-based attack graphs on a distributed multi-agent platform. A virtual shared memory abstraction is proposed to be used over such a platform, whose memory pages are initialized by partitioning the network reachability information. We demonstrate the feasibility of parallel distributed computation of attack graphs and show that even a small degree of parallelism can effectively speed up the generation process as the problem size grows. We also introduce a rich attack template and network model in order to form chains of vulnerability exploits in attack graphs more precisely.
TL;DR: In this paper, a processor having multiple domains including at least a core domain and a non-core domain that is transparent to an operating system (OS) can be controlled by a driver.
Abstract: In one embodiment, the present invention includes a processor having multiple domains including at least a core domain and a non-core domain that is transparent to an operating system (OS). The non-core domain can be controlled by a driver. In turn, the processor further includes a memory interconnect to interconnect the core domain and the non-core domain to a memory coupled to the processor. Still further, a power controller, which may be within the processor, can control a frequency of the memory interconnect based on memory boundedness of a workload being executed on the non-core domain. Other embodiments are described and claimed.
TL;DR: It is found that memory interference can be significantly reduced by partitioning DRAM banks, and co-locating memory-intensive tasks on the same processing core, which provides a significant improvement in task schedulability over previous work, with as much as 96 % more tasksets being schedulable.
Abstract: In multi-core systems, main memory is a major shared resource among processor cores. A task running on one core can be delayed by other tasks running simultaneously on other cores due to interference in the shared main memory system. Such memory interference delay can be large and highly variable, thereby posing a significant challenge for the design of predictable real-time systems. In this paper, we present techniques to reduce this interference and provide an upper bound on the worst-case interference on a multi-core platform that uses a commercial-off-the-shelf (COTS) DRAM system. We explicitly model the major resources in the DRAM system, including banks, buses, and the memory controller. By considering their timing characteristics, we analyze the worst-case memory interference delay imposed on a task by other tasks running in parallel. We find that memory interference can be significantly reduced by (i) partitioning DRAM banks, and (ii) co-locating memory-intensive tasks on the same processing core. Based on these observations, we develop a memory interference-aware task allocation algorithm for reducing memory interference. We evaluate our approach on a COTS-based multi-core platform running Linux/RK. Experimental results show that the predictions made by our approach are close to the measured worst-case interference under workloads with both high and low memory contention. In addition, our memory interference-aware task allocation algorithm provides a significant improvement in task schedulability over previous work, with as much as 96 % more tasksets being schedulable.
TL;DR: According to the Causal Theory of Memory (CTM), remembering a particular past event requires a causal connection between that event and its subsequent representation in memory, specifically, a connection sustained by a memory trace as discussed by the authors.
Abstract: According to the Causal Theory of Memory (CTM), remembering a particular past event requires a causal connection between that event and its subsequent representation in memory, specifically, a connection sustained by a memory trace. The CTM is the default view of memory in contemporary philosophy, but debates persist over what the involved memory traces must be like. Martin and Deutscher (Philos Rev 75:161–196, 1966) argued that the CTM required memory traces to be structural analogues of past events. Bernecker (Memory: A philosophical study. Oxford University Press, Oxford, 2010) and Michaelian (Philos Psychol 24:323–342, 2011), contemporary CTM proponents, reject structural analogues in favor of memory traces as distributed patterns of event features. The proposals are understood as distinct accounts of how memory traces represent past events. But there are two distinct questions one could ask about a trace’s representational features. One might ask how memory traces, qua mental representations, have their semantic properties. Or, what makes memory traces, qua mental representations of memories, distinct from other mental representations. Proponents of the CTM, both past and present, have failed to keep these two questions distinct. The result is a serious but unnoticed problem for the CTM in its current form. Distributed memory traces are incompatible with the CTM. Such traces do not provide a way to track the causal history of individual memories, as the CTM requires. If memory traces are distributed patterns of event features, as Bernecker and Michaelian each claim, then the CTM cannot be right.
TL;DR: In this article, the authors compile the latest results from a variety of different systems aiming at the construction of a scalable quantum computer and compare their results with the results of a number of other systems.
Abstract: Experimental groups are now fabricating quantum processors powerful enough to execute small instances of quantum algorithms and definitively demonstrate quantum error correction that extends the lifetime of quantum data, adding urgency to architectural investigations. Although other options continue to be explored, effort is coalescing around topological coding models as the most practical implementation option for error correction on realizable microarchitectures. Scalability concerns have also motivated architects to propose distributed memory multicomputer architectures, with experimental efforts demonstrating some of the basic building blocks to make such designs possible. We compile the latest results from a variety of different systems aiming at the construction of a scalable quantum computer.
TL;DR: This paper introduces the Distributed Multiloop Language (DMLL), a new intermediate language based on common parallel patterns that captures the necessary semantic knowledge to efficiently target distributed heterogeneous architectures and shows straightforward analyses that determine what data to distribute based on its usage.
Abstract: High performance in modern computing platforms requires programs to be parallel, distributed, and run on heterogeneous hardware. However programming such architectures is extremely difficult due to the need to implement the application using multiple programming models and combine them together in ad-hoc ways. To optimize distributed applications both for modern hardware and for modern programmers we need a programming model that is sufficiently expressive to support a variety of parallel applications, sufficiently performant to surpass hand-optimized sequential implementations, and sufficiently portable to support a variety of heterogeneous hardware. Unfortunately existing systems tend to fall short of these requirements. In this paper we introduce the Distributed Multiloop Language (DMLL), a new intermediate language based on common parallel patterns that captures the necessary semantic knowledge to efficiently target distributed heterogeneous architectures. We show straightforward analyses that determine what data to distribute based on its usage as well as powerful transformations of nested patterns that restructure computation to enable distribution and optimize for heterogeneous devices. We present experimental results for a range of applications spanning multiple domains and demonstrate highly efficient execution compared to manually-optimized counterparts in multiple distributed programming models.
TL;DR: This paper analyzes Spark's primary framework, core technologies, and run a machine learning instance on it and will analyze the results and introduce the hardware equipment.
Abstract: Apache Spark is a distributed memory-based computing framework which is natural suitable for machine learning. Compared to Hadoop, Spark has a better ability of computing. In this paper, we analyze Spark's primary framework, core technologies, and run a machine learning instance on it. Finally, we will analyze the results and introduce our hardware equipment.
TL;DR: This work systematically explores what it considers to be the most promising part of the space, precisely defining semantics and identifying implementation costs, which allows for much more explicit and precise about semantic and implementation trade-offs that were usually glossed over in prior work.
Abstract: It is expected that DRAM memory will be augmented, and perhaps eventually replaced, by one of several up-and-coming memory technologies. These are all non-volatile, in that they retain their contents without power. This allows primary memory to be used as a fast disk replacement. It also enables more aggressive programming models that directly leverage persistence of primary memory. However, it is challenging to maintain consistency of memory in such an environment. There is no consensus on the right programming model for doing so, and subtle differences can have large, and sometimes surprising, effects on the implementation and its performance. The existing literature describes multiple programming systems that provide point solutions to the selective persistence for user data structures. Real progress in this area requires a choice of programming model, which we cannot reasonably make without a real understanding of the design space. Point solutions are insufficient. We systematically explore what we consider to be the most promising part of the space, precisely defining semantics and identifying implementation costs. This allows us to be much more explicit and precise about semantic and implementation trade-offs that were usually glossed over in prior work. It also exposes some promising new design alternatives.
TL;DR: Three novel ways to alleviate the costs of the memory barriers associated with hazard pointers and related techniques are proposed, which move the cost of memory management from the principal code path to the infrequent memory reclamation procedure, significantly reducing or eliminating memory barriers executed on the principalcode path.
Abstract: Current memory reclamation mechanisms for highly-concurrent data structures present an awkward trade-off. Techniques such as epoch-based reclamation perform well when all threads are running on dedicated processors, but the delay or failure of a single thread will prevent any other thread from reclaiming memory. Alternatives such as hazard pointers are highly robust, but they are expensive because they require a large number of memory barriers. This paper proposes three novel ways to alleviate the costs of the memory barriers associated with hazard pointers and related techniques. These new proposals are backward-compatible with existing code that uses hazard pointers. They move the cost of memory management from the principal code path to the infrequent memory reclamation procedure, significantly reducing or eliminating memory barriers executed on the principal code path. These proposals include (1) exploiting the operating system's memory protection ability, (2) exploiting certain x86 hardware features to trigger memory barriers only when needed, and (3) a novel hardware-assisted mechanism, called a hazard lookaside buffer (HLB) that allows a reclaiming thread to query whether there are hazardous pointers that need to be flushed to memory. We evaluate our proposals using a few fundamental data structures (linked lists and skiplists) and libcuckoo, a recent high-throughput hash-table library, and show significant improvements over the hazard pointer technique.
TL;DR: A survey of architectural techniques for using DWM for designing components in both CPU and GPU is presented and techniques related to performance, energy, and reliability are discussed and works that compare DWM with other memory technologies are discussed.
Abstract: Recent trends of increasing core-count and bandwidth/memory wall have motivated researchers to explore novel memory technologies for designing processor components such as cache, register file, shared memory, and so on. Domain-wall memory (DWM), also known as racetrack memory, is a promising emerging technology due to its non-volatility and very high density. However, use of DWM presents challenges due to characteristics of both DWM itself (e.g., requirement of shift operations, variable latency) and processor components. Recently, several techniques have been proposed to address these challenges. This article presents a survey of architectural techniques for using DWM for designing components in both CPU and GPU. We discuss techniques related to performance, energy, and reliability and also discuss works that compare DWM with other memory technologies. We also highlight the opportunities and obstacles in using DWM for designing processor components. This survey is expected to spark further research in this area and be useful for researchers, chip designers, and computer architects.
TL;DR: Evaluating the benefits of placing hardware accelerators at the bottom layer of a 3D stacked memory system compared to accelerators that are placed external to the memory stack shows that, for important data intensive kernels, near-memory accelerators inside a single 3D memory package provide 3x-13x speedup over a Quad-core Xeon processor.
Abstract: Emerging 3D stacked memory systems provide significantly more bandwidth than current DDR modules. However, general purpose processors do not take full advantage of these resources offered by the memory modules. Taking advantage of the increased bandwidth requires the use of specialized processing units. In this paper, we evaluate the benefits of placing hardware accelerators at the bottom layer of a 3D stacked memory system compared to accelerators that are placed external to the memory stack. Our evaluation of the design using cycle-accurate simulation and RTL synthesis shows that, for important data intensive kernels, near-memory accelerators inside a single 3D memory package provide 3x-13x speedup over a Quad-core Xeon processor. Most of the benefits are from the application of accelerators, as the near-memory configurations provide marginal benefits compared to the same number of accelerators placed on a die external to the memory package. This comparable performance for external accelerators is due to the high bandwidth afforded by the high-speed off-chip links. On the other hand, near-memory accelerators consume 7%–39% less energy than the external accelerators.
TL;DR: MITTS (Memory Inter-arrival Time Traffic Shaping), a simple, distributed hardware mechanism which limits memory traffic at the source (Core or LLC), enabling fine-grain bandwidth allocation and evaluates across SPECint, PARSEC, Apache, and bhm Mail Server workloads.
Abstract: Memory bandwidth severely limits the scalability and performance of multicore and manycore systems. Application performance can be very sensitive to both the delivered memory bandwidth and latency. In multicore systems, a memory channel is usually shared by multiple cores. Having the ability to precisely provision, schedule, and isolate memory bandwidth and latency on a per-core basis is particularly important when different memory guarantees are needed on a per-customer, per-application, or per-core basis. Infrastructure as a Service (IaaS) Cloud systems, and even general purpose multicores optimized for application throughput or fairness all benefit from the ability to control and schedule memory access on a fine-grain basis. In this paper, we propose MITTS (Memory Inter-arrival Time Traffic Shaping), a simple, distributed hardware mechanism which limits memory traffic at the source (Core or LLC). MITTS shapes memory traffic based on memory request inter-arrival time, enabling fine-grain bandwidth allocation. In an IaaS system, MITTS enables Cloud customers to express their memory distribution needs and pay commensurately. For instance, MITTS enables charging customers that have bursty memory traffic more than customers with uniform memory traffic for the same aggregate bandwidth. Beyond IaaS systems, MITTS can also be used to optimize for throughput or fairness in a general purpose multi-program workload. MITTS uses an online genetic algorithm to configure hardware bins, which can adapt for program phases and variable input sets. We have implemented MITTS in Verilog and have taped-out the design in a 25-core 32nm processor and find that MITTS requires less than 0.9% of core area. We evaluate across SPECint, PARSEC, Apache, and bhm Mail Server workloads, and find that MITTS achieves an average 1.18× performance gain compared to the best static bandwidth allocation, a 2.69× average performance/cost advantage in an IaaS setting, and up to 1.17× better throughput and 1.52× better fairness when compared to conventional memory bandwidth provisioning techniques.
TL;DR: This paper highlights a recent development within the software package that allows the dominant computational task, solving a set of complex linear systems, to be performed with a distributed memory solver.
Abstract: The FEAST algorithm and eigensolver for interior eigenvalue problems naturally possesses three distinct levels of parallelism. The solver is then suited to exploit modern computer architectures containing many interconnected processors. This paper highlights a recent development within the software package that allows the dominant computational task, solving a set of complex linear systems, to be performed with a distributed memory solver. The software, written with a reverse-communication-interface, can now be interfaced with any generic MPI linear-system solver using a customized data distribution for the eigenvector solutions. This work utilizes two common "black-box" distributed memory linear-systems solvers (Cluster-MKL-Pardiso and MUMPS), as well as our own application-specific domain-decomposition MPI solver, for a collection of 3-dimensional finite-element systems. We discuss and analyze how parallel resources can be placed at all three levels simultaneously in order to achieve good scalability and optimal use of the computing platform.
TL;DR: Two separate techniques to reduce the power consumption and area are shown and the main technique is Scratch Pad Memory (SPM) replacement instead of cache replacement, second technique is Network on Chip (NOC) instead of Advanced Microcontroller Bus Architecture (AMBA) communication medium between processors.
Abstract: Multiprocessor system-on-chip (MPSoC) architectures have risen as a prevalent answer to the ever-increasing performance reduce the power consumption requirements, that are customized to a specific application have the potential to achieve efficient area, while additionally obliging low power consumption. The power consumed and area of the system majorly depends on the memory Communication medium of Processors, some issues involved in Memory communication of processors. In this Paper we avoid that issue and show two separate techniques to reduce the power consumption and area. The main technique is Scratch Pad Memory (SPM) replacement instead of cache replacement, second technique is Network on Chip (NOC) instead of Advanced Microcontroller Bus Architecture (AMBA) communication medium between processors.
TL;DR: RUMA: Rewired User-space Memory Access allows for physiological data management and allows developers to freely rewire the mappings from virtual to physical memory (in user space) while at the same time exploiting the virtual memory support offered by hardware and operating system.
Abstract: Memory management is one of the most boring topics in database research. It plays a minor role in tasks like free-space management or efficient space usage. Here and there we also realize its impact on database performance when worrying about NUMA-aware memory allocation, data compacting, snapshotting, and defragmentation. But, overall, let's face it: the entire topic sounds as exciting as 'garbage collection' or 'debugging a program for memory leaks'.What if there were a technique that would promote memory management from a third class helper thingie to a first class citizen in algorithm and systems design? What if that technique turned the role of memory management in a database system (and any other data processing system) upside-down? What if that technique could be identified as a key for re-designing various core algorithms with the effect of outperforming existing state-of-the-art methods considerably? Then we would write this paper.We introduce RUMA: Rewired User-space Memory Access. It allows for physiological data management, i.e. we allow developers to freely rewire the mappings from virtual to physical memory (in user space) while at the same time exploiting the virtual memory support offered by hardware and operating system. We show that fundamental database building blocks such as array operations, partitioning, sorting, and snapshotting benefit strongly from RUMA.
TL;DR: The dCUDA programming model is introduced, which implements device-side remote memory access with target notification for GPU clusters and applies latency hiding at cluster scale to automatically overlap computation and communication.
Abstract: Over the last decade, CUDA and the underlying GPU hardware architecture have continuously gained popularity in various high-performance computing application domains such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent programming model for GPU clusters. We therefore introduce the dCUDA programming model, which implements device-side remote memory access with target notification. To hide instruction pipeline latencies, CUDA programs over-decompose the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a thread stalls, the hardware scheduler immediately proceeds with the execution of another thread ready for execution. This latency hiding technique is key to make best use of the available hardware resources. With dCUDA, we apply latency hiding at cluster scale to automatically overlap computation and communication. Our benchmarks demonstrate perfect overlap for memory bandwidth-bound tasks and good overlap for compute-bound tasks.
TL;DR: It is shown that the task paradigm allows to control the memory footprint of the application by throttling the task submission flow rate, striking a compromise between the performance benefits of anticipative task submission and the resulting memory consumption.
Abstract: The ever-increasing supercomputer architectural complexity emphasizes the need for high-level parallel programming paradigms. Among such paradigms, task-based programming manages to abstract away much of the architecture complexity while efficiently meeting the performance challenge, even at large scale. Dynamic run-time systems are typically used to execute task-based applications, to schedule computation resource usage and memory allocations. While computation scheduling has been well studied, the dynamic management of memory resource subscription inside such run-times has however been little explored. This paper studies the cooperation between a task-based distributed application code and a run-time system engine to control the memory subscription levels throughout the execution. We show that the task paradigm allows to control the memory footprint of the application by throttling the task submission flow rate, striking a compromise between the performance benefits of anticipative task submission and the resulting memory consumption. We illustrate the benefits of our contribution on a compressed dense linear algebra distributed application.
TL;DR: It is shown that a computational cognitive model that assumes a distributed version of working memory accounts for both behavioral and neuroimaging data better than a model that takes a more centralized approach.
TL;DR: This paper proposes an OpenMP4 run-time that reduces the memory consumption while providing the same performance, and relies on a new compiler pass capable to generate the task dependency graph of OpenMP programs, which is then efficiently stored in memory.
Abstract: OpenMP is increasingly being adopted by current many-core embedded processors to exploit their parallel computation capabilities. Unfortunately, current run-time implementations of the latest specification (v4.0) are not suitable for processors relying on small and fast on-chip memories, due to its memory consumption. This paper proposes an OpenMP4 run-time that reduces the memory consumption while providing the same performance. Our run-time relies on a new compiler pass capable to generate the task dependency graph of OpenMP programs, which is then efficiently stored in memory.
TL;DR: To the authors' knowledge, Kmerind is the first k-mer indexing library for distributed memory environments, and the first fully customizable and extensible library for general k-MER indexing and counting.
Abstract: Counting and indexing fixed length substrings, or k-mers, in biological sequences is a key step in many bioinformatics tasks including genome alignment and mapping, genome assembly, and error correction. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, there exist few bioinformatics tools and libraries that can efficiently process the data sets at the current generation rate of 1.8 terabases every 3 days. We present Kmerind, a high performance k-mer indexing library for distributed memory environments. The Kmerind library provides a set of simple and consistent APIs with sequential semantics and parallel implementations that are designed to be flexible and extensible. Using Kmerind, a user can easily instantiate application-specific indices, such as k-mer counter and position index, from biult-in or user-supplied components without extensive high performance computing expertise. Kmerind's k-mer counter performs similarly or better than existing, best-in-class k-mer counting tools even on shared memory systems. In a distributed memory environment, Kmerind counts k-mers in a 120 GB sequence read data set in less than 13 seconds on 1024 Xeon CPU cores, and fully indexes their positions in approximately 17 seconds. Querying for 1% of the k-mers in these indices can be completed in 0.23 seconds and 28 seconds, respectively. To our knowledge, Kmerind is the first k-mer indexing library for distributed memory environments, and the first fully customizable and extensible library for general k-mer indexing and counting. Kmerind is available from https://github.com/ParBLiSS/kmerind.
TL;DR: DegAwareRHH, a high performance dynamic graph data store designed for scaling out to store large, scale-free graphs by leveraging compact hash tables with high data locality, is proposed and extended for multiple processes and distributed memory.
Abstract: In many graph applications, the structure of the graph changes dynamically over time and may require real time analysis. However, constructing a large graph is expensive, and most studies for large graphs have not focused on a dynamic graph data structure, but rather a static one. To address this issue, we propose DegAwareRHH, a high performance dynamic graph data store designed for scaling out to store large, scale-free graphs by leveraging compact hash tables with high data locality. We extend DegAwareRHH for multiple processes and distributed memory, and perform dynamic graph construction on large scale-free graphs using emerging 'Big Data HPC' systems such as the Catalyst cluster at LLNL. We demonstrate that DegAwareRHH processes a request stream 206.5x faster than a state-of-the-art shared-memory dynamic graph processing framework, when both implementations use 24 threads/processes to construct a graph with 1 billion edge insertion requests and 54 million edge deletion requests. DegAwareRHH also achieves a processing rate of over 2 billion edge insertion requests per second using 128 compute nodes to construct a large-scale web graph, containing 128 billion edges, the largest open-source real graph dataset to our knowledge.
TL;DR: Simulation results show that the combined techniques reduce the performance degradation for supporting full confidentiality and integrity, from 25-34 percent to less than 8-14 percent in 8-core and 16-core secure processors, with minimal extra hardware costs.
Abstract: To prevent physical attacks on systems, secure processors have been proposed to reduce trusted computing base to the processor itself. In a secure processor, all off-chip data are encrypted and their integrity is protected. This paper investigates how the limited memory bandwidth of multi-core processors affects the design of secure processors. Although the performance of a single-core secure processor has improved significantly with the counter-mode encryption combined with Bonsai Merkle Tree, our results indicate that multi-core secure processors can suffer from significant performance degradation due to the limited memory bandwidth. To mitigate the performance overheads, this paper proposes three techniques for the multi-core design of secure processors. First, the paper advocates to use a combined cache for all normal and security-supporting data. Second, the paper proposes memory scheduling and mapping schemes for secure processors. Finally, the paper investigates a type-aware cache insertion scheme considering the distinct characteristics of normal and security-supporting data. Our simulation results show that the combined techniques reduce the performance degradation for supporting full confidentiality and integrity, from 25-34 percent to less than 8-14 percent in 8-core and 16-core secure processors, with minimal extra hardware costs.
TL;DR: The objective is to fully utilize PCM to reduce the energy consumption while ensuring that the real-time performance of applications are guaranteed, and a two-phase approach to solve hybrid main memory address mapping problem.
Abstract: In embedded systems, especially battery-driven mobile devices, energy is one of the most critical performance metrics. Due to its high density and low standby power, phase change memory (PCM), an emerging non-volatile memory device, is becoming a promising dynamic random access memory (DRAM) alternative. Recent studies have proposed the hybrid main memory architecture integrating both PCM and DRAM to fully take advantage of the properties of both memories. However, the low power performance of PCM in the hybrid main memory architecture has not been fully explored. Therefore, it becomes an interesting problem to utilize PCM and DRAM as hybrid main memory for energy optimization in embedded systems. In this paper, we present an energy optimization technique for hybrid main memory architecture. The objective is to fully utilize PCM to reduce the energy consumption while ensuring that the real-time performance of applications are guaranteed. We propose a two-phase approach to solve hybrid main memory address mapping problem. In the first phase, we calculate energy and time cost for each address based on the task models. Then the applications can be modeled as data-flow graph nodes, and different access times will associate with different energy consumption. In the second phase, for different memory types and the given timing constraint, we formulate the scheduling problem as an integer linear programming (ILP) model and obtain an optimal solution. The ILP model can map a proper memory type for each address such that the total energy consumption can be minimized while the timing constraint is satisfied. In addition, we propose a heuristic approach to efficiently obtain a near-optimal solution. We conduct experiments on an ARM-based simulator. The experimental results show that our method can effectively reduce the energy consumption with the least system cost compared with the previous work.
TL;DR: The system proposed here is implemented on Field Programmable Gate Arrays (FPGA) Zynq XC7Z020 board using Modified Background Subtraction algorithm for real-time Object Detection and Tracking using VHDL software.
Abstract: Computer vision has played a key role in developing object detection and tracking techniques for Surveillance system. Most of the implementations currently employed are based on Serial execution on General Purpose Processors. But the high cost and complexity of such implementations doesn't make it a viable option for real time surveillance system. The system proposed here is implemented on Field Programmable Gate Arrays (FPGA) Zynq XC7Z020 board using Modified Background Subtraction algorithm for real-time Object Detection and Tracking. The presence of numerous configurable logic blocks, distributed memory and hard Digital Signal Processing (DSP) modules offers a great flexibility in achieving Temporal and Spatial parallelism. This paper uses Xilinx ISE software for implementation which is programmed in VHDL. OV7670 camera used in the paper has a resolution of 0.3 Megapixel and it captures the video at a speed of 30fps. The reference frame and the subsequent incoming frames are stored in different memory modules before the Modified Background Subtraction algorithm is applied on these frames to obtain the difference image. After comparing it with the threshold, the resultant image is displayed and its addresses are stored in order to track it. The system works in real time with minimum time lag between the capture and display. Moreover the entire system is optimized in terms of speed, memory requirements as well as the number of logic elements used which makes it suitable for application in real-time surveillance system.