TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.
Abstract: There has been significant recent interest in parallel frameworks for processing graphs due to their applicability in studying social networks, the Web graph, networks in biology, and unstructured meshes in scientific simulation. Due to the desire to process large graphs, these systems have emphasized the ability to run on distributed memory machines. Today, however, a single multicore server can support more than a terabyte of memory, which can fit graphs with tens or even hundreds of billions of edges. Furthermore, for graph algorithms, shared-memory multicores are generally significantly more efficient on a per core, per dollar, and per joule basis than distributed memory systems, and shared-memory algorithms tend to be simpler than their distributed counterparts.In this paper, we present a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write. The framework has two very simple routines, one for mapping over edges and one for mapping over vertices. Our routines can be applied to any subset of the vertices, which makes the framework useful for many graph traversal algorithms that operate on subsets of the vertices. Based on recent ideas used in a very fast algorithm for breadth-first search (BFS), our routines automatically adapt to the density of vertex sets. We implement several algorithms in this framework, including BFS, graph radii estimation, graph connectivity, betweenness centrality, PageRank and single-source shortest paths. Our algorithms expressed using this framework are very simple and concise, and perform almost as well as highly optimized code. Furthermore, they get good speedups on a 40-core machine and are significantly more efficient than previously reported results using graph frameworks on machines with many more cores.
TL;DR: Preliminary performance results show the new solution achieves competitive, if not superior, performance on large clusters, and a simple yet effective alternative to the traditional MPI-based approaches.
Abstract: Parallelizing dense matrix computations to distributed memory architectures is a well-studied subject and generally considered to be among the best understood domains of parallel computing. Two packages, developed in the mid 1990s, still enjoy regular use: ScaLAPACK and PLAPACK. With the advent of many-core architectures, which may very well take the shape of distributed memory architectures within a single processor, these packages must be revisited since the traditional MPI-based approaches will likely need to be extended. Thus, this is a good time to review lessons learned since the introduction of these two packages and to propose a simple yet effective alternative. Preliminary performance results show the new solution achieves competitive, if not superior, performance on large clusters.
TL;DR: A simple Memory-Interference-induced Slowdown Estimation model that estimates slowdowns caused by memory interference is presented and two new memory scheduling schemes are developed that provide soft quality-of-service guarantees and another that explicitly attempts to minimize maximum slowdown in the system.
Abstract: Applications running concurrently on a multicore system interfere with each other at the main memory. This interference can slow down different applications differently. Accurately estimating the slow down of each application in such a system can enable mechanisms that can enforce quality-of-service. While much prior work has focused on mitigating the performance degradation due to inter-application interference, there is little work on estimating slow down of individual applications in a multi-programmed environment. Our goal in this work is to build such an estimation scheme. To this end, we present our simple Memory-Interference-induced Slowdown Estimation (MISE) model that estimates slowdowns caused by memory interference. We build our model based on two observations. First, the performance of a memory-bound application is roughly proportional to the rate at which its memory requests are served, suggesting that request-service-rate can be used as a proxy for performance. Second, when an application's requests are prioritized over all other applications' requests, the application experiences very little interference from other applications. This provides a means for estimating the uninterfered request-service-rate of an application while it is run alongside other applications. Using the above observations, our model estimates the slowdown of an application as the ratio of its uninterfered and interfered request service rates. We propose simple changes to the above model to estimate the slowdown of non-memory-bound applications. We demonstrate the effectiveness of our model by developing two new memory scheduling schemes: 1) one that provides soft quality-of-service guarantees and 2) another that explicitly attempts to minimize maximum slowdown (i.e., unfairness) in the system. Evaluations show that our techniques perform significantly better than state-of-the-art memory scheduling approaches to address the above problems.
TL;DR: In this article, a switching device is configured to route memory requests based on the mappings between data addresses associated with memory requests from a consumer device relating to a data object and information relating to storage location in the one or more memory resources associated with the data from the data object.
Abstract: Systems, methods and devices for distributed memory management comprising a network component configured for network communication with one or more memory resources that store data and one or more consumer devices that use data, the network component comprising a switching device in operative communication with a mapping resource, wherein the mapping resource is configured to associate mappings between data addresses associated with memory requests from a consumer device relating to a data object and information relating to a storage location in the one or more memory resources associated with the data from the data object, wherein each data address has contained therein identification information for identifying the data from the data object associated with that data address; and the switching device is configured to route memory requests based on the mappings.
TL;DR: A parallel FFT framework that is based on a combination of local FFTs, local data permutations, and global data transpositions is proposed that can be generalized to arbitrary multidimensional data and process meshes.
Abstract: We present an MPI based software library for computing fast Fourier transforms (FFTs) on massively parallel, distributed memory architectures based on the Message Passing Interface standard (MPI). Similar to established transpose FFT algorithms, we propose a parallel FFT framework that is based on a combination of local FFTs, local data permutations, and global data transpositions. This framework can be generalized to arbitrary multidimensional data and process meshes. All performance-relevant building blocks can be implemented with the help of the FFTW software library. Therefore, our library offers great flexibility and portable performance. Similarly to FFTW, we are able to compute FFTs of complex data, real data, and even- or odd-symmetric real data. All the transforms can be performed completely in place. Furthermore, we propose an algorithm to calculate pruned FFTs more efficiently on distributed memory architectures. For example, we provide performance measurements of FFTs of sizes between $512^3$ ...
TL;DR: In this article, a non-locking, stOchastic multi-machine algorithm for asynchronous and decentralized matrix completion (NOMAD) is proposed. But it is not a lock-free parallel algorithm.
Abstract: We develop an efficient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion). NOMAD is a decentralized algorithm with non-blocking communication between processors. One of the key features of NOMAD is that the ownership of a variable is asynchronously transferred between processors in a decentralized fashion. As a consequence it is a lock-free parallel algorithm. In spite of being an asynchronous algorithm, the variable updates of NOMAD are serializable, that is, there is an equivalent update ordering in a serial implementation. NOMAD outperforms synchronous algorithms which require explicit bulk synchronization after every iteration: our extensive empirical evaluation shows that not only does our algorithm perform well in distributed setting on commodity hardware, but also outperforms state-of-the-art algorithms on a HPC cluster both in multi-core and distributed memory settings.
TL;DR: An efficient MPI-based distributed memory parallel algorithm, called PATRIC, for counting triangles in massive networks, which scales well to networks with billions of nodes and can compute the exact number of triangles in a network with one billion nodes and 10 billion edges in 16 minutes.
Abstract: Massive networks arising in numerous application areas poses significant challenges for network analysts as these networks grow to billions of nodes and are prohibitively large to fit in the main memory. Finding the number of triangles in a network is an important problem in the analysis of complex networks. Several interesting graph mining applications depend on the number of triangles in the graph. In this paper, we present an efficient MPI-based distributed memory parallel algorithm, called PATRIC, for counting triangles in massive networks. PATRIC scales well to networks with billions of nodes and can compute the exact number of triangles in a network with one billion nodes and 10 billion edges in 16 minutes. Balancing computational loads among processors for a graph problem like counting triangles is a challenging issue. We present and analyze several schemes for balancing load among processors for the triangle counting problem. These schemes achieve very good load balancing. We also show how our parallel algorithm can adapt an existing edge sparsification technique to approximate the number of triangles with very high accuracy. This modification allows us to count triangles in even larger networks.
TL;DR: To the best of the knowledge, this is the first work reporting end-to-end fully automatic distributed-memory parallelization and code generation for input programs and transformation techniques as general as those the authors allow.
Abstract: We present new techniques for compilation of arbitrarily nested loops with affine dependences for distributed-memory parallel architectures. Our framework is implemented as a source-level transformer that uses the polyhedral model, and generates parallel code with communication expressed with the Message Passing Interface (MPI) library. Compared to all previous approaches, ours is a significant advance either (1) with respect to the generality of input code handled, or (2) efficiency of communication code, or both. We provide experimental results on a cluster of multicores demonstrating its effectiveness. In some cases, code we generate outperforms manually parallelized codes, and in another case is within 25% of it. To the best of our knowledge, this is the first work reporting end-to-end fully automatic distributed-memory parallelization and code generation for input programs and transformation techniques as general as those we allow.
TL;DR: This paper compares in detail the OpenMP implementation of two applications of Jacobi solvers with the corresponding implementation of these applications using HPX, the first feature-complete, open-source implementation of ParalleX, and analyzes the results of both implementations on a multi-socket NUMA node.
Abstract: In the prospect of the upcoming exa-scale era with millions of execution units, the question of how to deal with this level of parallelism efficiently is of time-critical relevance State-of-the-Art parallelization techniques such as OpenMP and MPI are not guaranteed to solve the expected problems of starvation, growing latencies, overheads, and contention On the other hand, new parallelization paradigms promise to efficiently hide latencies and contain starvation and contention
In this paper we analyze the performance of one novel parallelization strategy for shared and distributed memory machines We will focus on shared memory architectures and compare the performance of the ParalleX execution model against the quasi-standard OpenMP for a standard stencil-based problem We compare in detail the OpenMP implementation of two applications of Jacobi solvers (one based on regular grid and one based on an irregular grid structure) with the corresponding implementation of these applications using HPX (High Performance ParalleX), the first feature-complete, open-source implementation of ParalleX, and analyze the results of both implementations on a multi-socket NUMA node
TL;DR: A tighter lower bound on latency is provided, which dictates that the latency overhead is necessary to reduce bandwidth along the critical path of execution, and achieves impressive performance and scaling to 24,576 cores of a Cray XE6 supercomputer by utilizing well-tuned intra-node kernels within the distributed memory algorithm.
Abstract: We consider distributed memory algorithms for the all-pairs shortest paths (APSP) problem. Scaling the APSP problem to high concurrencies requires both minimizing inter-processor communication as well as maximizing temporal data locality. The 2.5D APSP algorithm, which is based on the divide-and-conquer paradigm, satisfies both of these requirements: it can utilize any extra available memory to perform asymptotically less communication, and it is rich in semiring matrix multiplications, which have high temporal locality. We start by introducing a block-cyclic 2D (minimal memory) APSP algorithm. With a careful choice of block-size, this algorithm achieves known communication lower-bounds for latency and bandwidth. We extend this 2D block-cyclic algorithm to a 2.5D algorithm, which can use c extra copies of data to reduce the bandwidth cost by a factor of c1/2, compared to its 2D counterpart. However, the 2.5D algorithm increases the latency cost by c1/2. We provide a tighter lower bound on latency, which dictates that the latency overhead is necessary to reduce bandwidth along the critical path of execution. Our implementation achieves impressive performance and scaling to 24,576 cores of a Cray XE6 supercomputer by utilizing well-tuned intra-node kernels within the distributed memory algorithm.
TL;DR: Efficient numerical integrators such as the one demonstrated here bring closer to reality the goal of implementing fully distributed real-time flood forecasting systems supported by physics based hydrological models and high-quality/high-resolution rainfall products.
TL;DR: Hash-Distributed A^@? (HDA^@?), a simple approach to parallel best-first search that asynchronously distributes and schedules work among processors based on a hash function of the search state, scales well, allowing the effective utilization of large amounts of distributed memory to optimally solve problems which require terabytes of RAM.
TL;DR: An automated cache management system that both intelligently decides how to scale a distributed caching system and uses a new, adaptive partitioning algorithm that ensures that load is evenly distributed despite variations in object size and popularity is presented.
Abstract: Distributed in-memory caching systems such as memcached have become crucial for improving the performance of web applications. However, memcached by itself does not control which node is responsible for each data object, and inefficient partitioning schemes can easily lead to load imbalances. Further, a statically sized memcached cluster can be insufficient or inefficient when demand rises and falls. In this paper we present an automated cache management system that both intelligently decides how to scale a distributed caching system and uses a new, adaptive partitioning algorithm that ensures that load is evenly distributed despite variations in object size and popularity. We have implemented an adaptive hashing system1 as a proxy and node control framework for memcached, and evaluate it on EC2 using a set of realistic benchmarks including database dumps and traces from Wikipedia.
TL;DR: In this paper, a packet handling system is described that can include at least one main processor, a plurality of offload processors connected to a memory bus and configured to provide security related services on packets prior to redirection to the main processor.
Abstract: A packet handling system is disclosed that can include at least one main processor, a plurality of offload processors connected to a memory bus and configured to provide security related services on packets prior to redirection to the main processor; an arbiter connected to each of the plurality of offload processors, the arbiter capable of scheduling resource priority for instructions or data received from the memory bus; and a virtual switch respectively connected to the main processor and the plurality of offload processors using the memory bus, with the virtual switch capable of receiving memory read/write data over the memory bus, and further directing at least some memory read/write data to the arbiter.
TL;DR: A set of simple algorithmic changes coupled with commonly available OS functionality suffice to eliminate data sharing and to regularize the memory access patterns for a subset of the PARSEC parallel benchmarks and lead to a fairer and more accurate performance evaluation on NUMA-multicore systems.
Abstract: An important aspect of workload characterization is understanding memory system performance (i.e., understanding a workload's interaction with the memory system). On systems with a non-uniform memory architecture (NUMA) the performance critically depends on the distribution of data and computations. The actual memory access patterns have a large influence on performance on systems with aggressive prefetcher units. This paper describes an analysis of the memory system performance of multithreaded programs and shows that some programs are (unintentionally) structured so that they use the memory system of today's NUMA-multicores inefficiently: Programs exhibit program-level data sharing, a performance-limiting factor that makes data and computation distribution in NUMA systems difficult. Moreover, many programs have irregular memory access patterns that are hard to predict by processor prefetcher units. The memory system performance as observed for a given program on a specific platform depends also on many algorithm and implementation decisions. The paper shows that a set of simple algorithmic changes coupled with commonly available OS functionality suffice to eliminate data sharing and to regularize the memory access patterns for a subset of the PARSEC parallel benchmarks. These simple source-level changes result in performance improvements of up to 3.1X, but more importantly, they lead to a fairer and more accurate performance evaluation on NUMA-multicore systems. They also illustrate the importance of carefully considering all details of algorithms and architectures to avoid drawing incorrect conclusions.
TL;DR: A real-time multi-channel memory controller architecture with a new programmable Multi-Channel Interleaver unit and a novel method for logical-to-physical address translation that enables inter-leaving memory requests across multiple memory channels at different granularities are contributed.
Abstract: Optimal utilization of a multi-channel memory, such as Wide IO DRAM, as shared memory in multi-processor platforms depends on the mapping of memory clients to the memory channels, the granularity at which the memory requests are interleaved in each channel, and the bandwidth and memory capacity allocated to each memory client in each channel. Firm real-time applications in such platforms impose strict requirements on shared memory bandwidth and latency, which must be guaranteed at design-time to reduce verification effort. However, there is currently no real-time memory controller for multichannel memories, and there is no methodology to optimally configure multi-channel memories in real-time systems. This paper has four key contributions: (1) A real-time multi-channel memory controller architecture with a new programmable Multi-Channel Interleaver unit. (2) A novel method for logical-to-physical address translation that enables inter-leaving memory requests across multiple memory channels at different granularities. (3) An optimal algorithm based on an Integer Linear Program (ILP) formulation to map memory clients to memory channels considering their communication dependencies, and to configure the memory controller for minimum bandwidth utilization. (4) We experimentally evaluate the run-time of the algorithm and show that an optimal solution can be found within 15 minutes for realistically sized problems. We also demonstrate configuring a multi-channel Wide IO DRAM in a High-Definition (HD) video and graphics processing system to emphasize the effectiveness of our approach.
TL;DR: This paper uses a combination of powerful static analyses relying on the polyhedral compiler framework and lightweight runtime routines they generate, to build a source-to-source transformation tool that automatically generates communication code and demonstrates that the tool is scalable and leads to substantial gains in efficiency.
Abstract: Programming for parallel architectures that do not have a shared address space is extremely difficult due to the need for explicit communication between memories of different compute devices. A heterogeneous system with CPUs and multiple GPUs, or a distributed-memory cluster are examples of such systems. Past works that try to automate data movement for distributed-memory architectures can lead to excessive redundant communication. In this paper, we propose an automatic data movement scheme that minimizes the volume of communication between compute devices in heterogeneous and distributed-memory systems. We show that by partitioning data dependences in a particular non-trivial way, one can generate data movement code that results in the minimum volume for a vast majority of cases. The techniques are applicable to any sequence of affine loop nests and works on top of any choice of loop transformations, parallelization, and computation placement. The data movement code generated minimizes the volume of communication for a particular configuration of these. We use a combination of powerful static analyses relying on the polyhedral compiler framework and lightweight runtime routines they generate, to build a source-to-source transformation tool that automatically generates communication code. We demonstrate that the tool is scalable and leads to substantial gains in efficiency. On a heterogeneous system, the communication volume is reduced by a factor of 11× to 83× over state-of-the-art, translating into a mean execution time speedup of 1.53×. On a distributed-memory cluster, our scheme reduces the communication volume by a factor of 1.4× to 63.5× over state-of-the-art, resulting in a mean speedup of 1.55×. In addition, our scheme yields a mean speedup of 2.19× over hand-optimized UPC codes.
TL;DR: This paper presents MPI-based distributed memory parallel algorithms for generating random scale-free networks using the preferential-attachment model that scale very well to a large number of processors and provide almost linear speedups.
Abstract: Recently, there has been substantial interest in the study of various random networks as mathematical models of complex systems. As these complex systems grow larger, the ability to generate progressively large random networks becomes all the more important. This motivates the need for efficient parallel algorithms for generating such networks. Naive parallelization of the sequential algorithms for generating random networks may not work due to the dependencies among the edges and the possibility of creating duplicate (parallel) edges. In this paper, we present MPI-based distributed memory parallel algorithms for generating random scale-free networks using the preferential-attachment model. Our algorithms scale very well to a large number of processors and provide almost linear speedups. The algorithms can generate scale-free networks with 50 billion edges in 123 seconds using 768 processors.
TL;DR: Turbine is described, a new highly scalable and distributed many- task dataflow engine that executes a generalized many-task intermediate representation with automated self-distribution and is scalable to multi-petaflop infrastructures.
Abstract: Efficiently utilizing the rapidly increasing concurrency of multi-petaflop computing systems is a significant programming challenge One approach is to structure applications with an upper layer of many loosely coupled coarse-grained tasks, each comprising a tightly-coupled parallel function or program “Many-task” programming models such as functional parallel dataflow may be used at the upper layer to generate massive numbers of tasks, each of which generates significant tightly coupled parallelism at the lower level through multithreading, message passing, and/or partitioned global address spaces At large scales, however, the management of task distribution, data dependencies, and intertask data movement is a significant performance challenge In this work, we describe Turbine, a new highly scalable and distributed many-task dataflow engine Turbine executes a generalized many-task intermediate representation with automated self-distribution and is scalable to multi-petaflop infrastructures We present here the architecture of Turbine and its performance on highly concurrent systems
TL;DR: The implementation of the new algorithm with the DAGuE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.
Abstract: This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed nodes, where a node is a multi-core processor. These platforms represent the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of nodes, in order to minimize the number of inter-processor communications (aka, ''communication-avoiding''), it is natural to consider hierarchical trees composed of an ''inter-node'' tree which acts on top of ''intra-node'' trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) ''TS level'' for cache-friendliness, (1) ''low-level'' for decoupled highly parallel inter-node reductions, (2) ''domino level'' to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-node and intra-node. Numerical experiments on a cluster of multi-core nodes (i) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (ii) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGuE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.
TL;DR: A new parallel algorithm for calculating nonequispaced fast Fourier transforms on massively parallel distributed memory architectures and a newallel distributed memory algorithm for the fast computation of fully Coulomb interactions in a charged particle system with nonperiodic boundary conditions based on a particle-mesh approximation scheme are developed.
Abstract: Starting from an approved serial algorithm, we develop a new parallel algorithm for calculating nonequispaced fast Fourier transforms on massively parallel distributed memory ar- chitectures. We demonstrate how to deal with the inherent load imbalance of the serial algorithm due to the use of oversampled FFT. This algorithm has been implemented in a new open source software library called PNFFT. Furthermore, we derive a new parallel distributed memory algorithm for the fast computation of fully Coulomb interactions in a charged particle system with nonperiodic boundary conditions based on a particle-mesh approximation scheme. We show that an appropri- ate adjustment of the underlying parallel nonequispaced fast Fourier transform circumvents severe load imbalance due to particle scaling. To prove the high scalability of our algorithms we provide performance results on a BlueGene/P system using up to 65536 cores.
TL;DR: In this paper, a new form of computer memory has been devised, which does not function like a computer memory where specific data is stored in specific numbered registers and retrieval is done by reading the contents of the specified memory register or matching key words as with a document search.
TL;DR: Using a hybrid parallelization approach that uses both threading and message-passing, the core linear algebra operations used in spatial statistics and Gaussian process regression are implemented in an R package called bigGP that relies on C and MPI.
Abstract: We consider parallel computation for Gaussian process calculations to overcome computational and memory constraints on the size of datasets that can be analyzed. Using a hybrid parallelization approach that uses both threading (shared memory) and message-passing (distributed memory), we implement the core linear algebra operations used in spatial statistics and Gaussian process regression in an R package called bigGP that relies on C and MPI. The approach divides the matrix into blocks such that the computational load is balanced across processes while communication between processes is limited. The package provides an API enabling R programmers to implement Gaussian process-based methods by using the distributed linear algebra operations without any C or MPI coding. We illustrate the approach and software by analyzing an astrophysics dataset with n=67,275 observations.
TL;DR: This work explores the selection of tasks that are favourable for migration across nodes in a distributed memory cluster, a lesser-explored dimension to distributed work-stealing, guided by the application-level task locality rather than hardware memory topology.
Abstract: Improving the performance of work-stealing load-balancing algorithms in distributed shared-memory systems is challenging. These algorithms need to overcome high costs of contention among workers, communication and remote data-references between nodes, and their impact on the locality preferences of tasks. Prior research focus on stealing from a victim that best exploits data locality, and on using special deques that minimize the contention between local and remote workers. This work explores the selection of tasks that are favourable for migration across nodes in a distributed memory cluster, a lesser-explored dimension to distributed work-stealing. The selection of tasks is guided by the application-level task locality rather than hardware memory topology as is the norm in the literature. The prototype for the performance evaluation of these ideas is implemented in X10, a realization of the asynchronous partitioned global address space programming model. This evaluation reveals the applicability of this new approach on several real-world applications chosen from the Cowichan and the Lone star suites. On a cluster of 128 processors, the new work-stealing strategy demonstrates a speedup between 12% and 31% over X10's existing scheduler. Moreover, the new strategy does not degrade the performance of any of the applications studied.
TL;DR: In this paper, a fast data and big data combined data processing method is proposed, which includes steps: (1) data input of different data sources is received and is classified and transmitted according to fast data, and (2) fast data enter a real-time trading module which performs realtime calculation and inquiring on fast data by aid of a distributed memory; (3) a full-text retrieval module performs fulltext retrieval according to the fast data result; and (4) big data into a volume historical data analysis module, are stored and are subjected to complete inqu
Abstract: The invention discloses a fast data and big data combined data processing method which includes steps: (1) data input of different data sources is received and is classified and transmitted according to fast data and big data; (2) fast data enter a real-time trading module which performs real-time calculation and inquiring on fast data by aid of a distributed memory; (3) a full-text retrieval module performs full-text retrieval according to the fast data result; (4) big data enter a volume historical data analysis module, are stored and are subjected to complete inquiring and statistic analysis; and (5) an application module receives data processed in the step (2), the step (3) and the step (4), and terminal display is carried out as required. The invention further provides a fast data and big data combined data processing system. The fast data and big data combined data processing method and system are low in cost and convenient to maintain, resources are distributed according to needs, and the performance is linearly expanded.
TL;DR: An API and a static analysis technique are proposed to transform legacy MPI codes automatically and transparently to the programmer, demonstrating that this scheme is easy to use in practice and up to 51% communication speedups over a standard message passing implementation on state-of-the art multicore systems.
Abstract: The number of cores in multi- and many-core high-performance processors is steadily increasing. MPI, the de-facto standard for programming high-performance computing systems offers a distributed memory programming model. MPI's semantics force a copy from one process' send buffer to another process' receive buffer. This makes it difficult to achieve the same performance on modern hardware than shared memory programs which are arguably harder to maintain and debug. We propose generalizing MPI's communication model to include ownership passing, which make it possible to fully leverage the shared memory hardware of multi- and many-core CPUs to stream communicated data concurrently with the receiver's computations on it. The benefits and simplicity of message passing are retained by extending MPI with calls to send (pass) ownership of memory regions, instead of their contents, between processes. Ownership passing is achieved with a hybrid MPI implementation that runs MPI processes as threads and is mostly transparent to the user. We propose an API and a static analysis technique to transform legacy MPI codes automatically and transparently to the programmer, demonstrating that this scheme is easy to use in practice. Using the ownership passing technique, we see up to 51% communication speedups over a standard message passing implementation on state-of-the art multicore systems. Our analysis and interface will lay the groundwork for future development of MPI-aware optimizing compilers and multi-core specific optimizations, which will be key for success in current and next-generation computing platforms.
TL;DR: How the memory management of a recent Linux kernel is re-architected to realize a three way collaboration between hardware, supervisory software, and application tasks is described.
Abstract: This paper proposes a collaborative approach in which applications can provide guidance to the operating system regarding allocation and recycling of physical memory. The operating system incorporates this guidance to decide which physical page should be used to back a particular virtual page. The key intuition behind this approach is that application software, as a generator of memory accesses, is best equipped to inform the operating system about the relative access rates and overlapping patterns of usage of its own address space. It is also capable of steering its own algorithms in order to keep its dynamic memory footprint under check when there is a need to reduce power or to contain the spillover effects from bursts in demand. Application software, working cooperatively with the operating system, can therefore help the latter schedule memory more effectively and efficiently than when the operating system is forced to act alone without such guidance. It is particularly difficult to achieve power efficiency without application guidance since power expended in memory is a function not merely of the intensity with which memory is accessed in time but also how many physical ranks are affected by an application's memory usage.Our framework introduces an abstraction called "colors" for the application to communicate its intent to the operating system. We modify the operating system to receive this communication in an efficient way, and to organize physical memory pages into intermediate level grouping structures called "trays" which capture the physically independent access channels and self-refresh domains, so that it can apply this guidance without entangling the application in lower level details of power or bandwidth management.This paper describes how we re-architect the memory management of a recent Linux kernel to realize a three way collaboration between hardware, supervisory software, and application tasks.
TL;DR: In this article, a parallel scalability of the seismic response analysis (SRA) module of the Integrated Earthquake Simulator (IES) is enhanced to achieve near-ideal scalability.
Abstract: The parallel scalability of the seismic response analysis (SRA) module of the Integrated Earthquake Simulator (IES) is enhanced to achieve near-ideal scalability. In previous studies, the SRA module was enhanced with standard distributed memory parallel computing techniques. However, some bottlenecks seriously hinder its scalability. Parallel computing performance is significantly improved by eradicating all the bottlenecks; input/output modules are enhanced with advanced MPI functions, a static load balancer is introduced, and the link to seismic structure analysis programs is strengthened. Details of the bottlenecks, remedies implemented, and other performance enhancements are presented. With a moderate size problem, it is demonstrated that the modified SRA module has near-ideal scalability. Further, it is shown that this near-ideal scalability is sustained even in the simulation of a large urban area.
TL;DR: In this article, multiple variants of a data processing system, which maintains I/O priority from the time a process makes an IO request until the hardware services that request, are described.
Abstract: Multiple variants of a data processing system, which maintains I/O priority from the time a process makes an I/O request until the hardware services that request, will be described In one embodiment, a data processing system has one or more processors having one or more processor cores, which execute an operating system and one or more applications of the data processing system The data processing system also can have one or more non-volatile memory device coupled to the one or more processors to store data of the data processing system, and one or more non-volatile memory controller coupled to the one or more processors The one or more non-volatile memory controller enables a transfer of data to at least one non-volatile memory device, and the priority level assigned by the operating system is maintained throughout the logical data path of the data processing system
TL;DR: The design and implementation of a storage system for high performance systems based on a multiple level I/O caching architecture that relies on Memcached as a parallel storage system, preserving its powerful capacities such as transparency, quick deployment, and scalability is presented.
Abstract: This paper presents the design and implementation of a storage system for high performance systems based on a multiple level I/O caching architecture. The solution relies on Memcached as a parallel storage system, preserving its powerful capacities such as transparency, quick deployment, and scalability. The designed parallel storage system targets to reduce the I/O latency in data-intensive high performance applications. The proposed solution consists of a user-level library and extended Memcached servers. The solution aims to be hierarchical by deploying Memcached-based I/O servers across all the infrastructure data path. Our experiments demonstrate that our solution is up to 40% faster than PVFS2.