TL;DR: Empirical evidence is provided that the proposed scalable and efficient parallel solution for incremental gradient descent in GLADE is limited only by the physical hardware characteristics, uses effectively the available resources, and achieves maximum scalability.
Abstract: Incremental gradient descent is a general technique to solve a large class of convex optimization problems arising in many machine learning tasks. GLADE is a parallel infrastructure for big data analytics providing a generic task specification interface. In this paper, we present a scalable and efficient parallel solution for incremental gradient descent in GLADE. We provide empirical evidence that our solution is limited only by the physical hardware characteristics, uses effectively the available resources, and achieves maximum scalability. When deployed in the cloud, our solution has the potential to dramatically reduce the cost of complex analytics over massive datasets.
TL;DR: This article presents a methodology to categorize a specified CA algorithm as a compute bound or an I/O bound, and takes rigorous analysis for each of the two cases identifying the various parameters that control the mapping process and are defined both by the Cellular Automata algorithm and the given FPGA hardware specifications.
Abstract: FPGA-based computation engines have been used as Cellular Automata accelerators in the scientific community for some time now. With the recent availability of more advanced FPGA logic it becomes necessary to better understand the mapping of Cellular Automata to these systems. There are many trade-offs to consider when mapping a Cellular Automata algorithm from an abstract system to the physical implementation using FPGA logic. The trade-offs include both the available FPGA resources and the Cellular Automata algorithm's execution time. The most important aspect is to fully understand the behavior of the specified CA algorithm in terms of its execution times which are either compute bound or I/O bound. In this article, we present a methodology to categorize a specified CA algorithm as a compute bound or an I/O bound. We take the methodology further by presenting rigorous analysis for each of the two cases identifying the various parameters that control the mapping process and are defined both by the Cellular Automata algorithm and the given FPGA hardware specifications. This methodology helps to predict the performance of running Cellular Automata algorithms on specific FPGA hardware and to determine optimal values for the various parameters that control the mapping process. The model is validated for both compute and I/O bound two-dimensional Cellular Automata algorithms. We find that our model predictions are accurate within 7p.
TL;DR: This work presents an investigation into accelerating I/O bound streaming applications through the coupling of custom computing cores, a hardware filesystem, and an integrated on-chip and off-chip network on the all-FPGA node cluster.
Abstract: The Reconfigurable Computing Cluster project is exploring novel parallel computing architectures in high performance computing with FPGA devices. Although there are no discrete microprocessors in the system, highly-integrated FPGAs (with embedded processors) are capable of hosting Linux-based systems and can run arbitrary MPI applications. This work present an investigation into accelerating I/O bound streaming applications through the coupling of custom computing cores, a hardware filesystem, and an integrated on-chip and off-chip network on the all-FPGA node cluster. Such an infrastructure enables productivity by minimizing hardware design while maintaining high performance. A hardware implementation of the BLASTn algorithm is used to demonstrate the performance gains and scalability of the custom computing cores across the Spirit cluster. Results show linear speedup across multiple nodes while supporting productivity by eliminating modifications to the original hardware core when scaling up to 512 parallel cores on the cluster.
TL;DR: A new scheduling algorithm for Hadoop based distributed system is proposed, based on the classification of workloads to assign a specific category to a particular cluster according to current load of the cluster, which increases the performance of both CPU and I/O resources in a cluster under heterogeneous workloads.
Abstract: Currently, most cloud based applications require large scale data processing capability. Data to be processed is growing at a rate much faster than available computing power. Hadoop is used to enable distributed processing on large clusters of commodity hardware. In large clusters, the workloads may be heterogeneous in nature, that is, I/O bound, CPU bound or network intensive jobs that demand different types of resources requirement so as to run simultaneously on large cluster. Hadoops job scheduling is based on FIFO where, parallelization based on types of job has not been taken into account for scheduling. In this paper, we propose a new scheduling algorithm for Hadoop based distributed system, based on the classification of workloads to assign a specific category to a particular cluster according to current load of the cluster. The proposed scheduler increases the performance of both CPU and I/O resources in a cluster under heterogeneous workloads, by approximately 12% when compared to Hadoops FIFO scheduler.
TL;DR: In this paper, the authors describe a hierarchical simulation structure using the Integrated Plasma Simulator (IPS) that enables the flexible execution of coupled simulations at the system, node, and core level using the same coupling abstraction and API.
Abstract: We present our experience using containers to scale up a massive ensemble of coupled I/O bound workloads on the NERSC Cori supercomputer. We describe the design of a hierarchical simulation structure using the Integrated Plasma Simulator (IPS) that enables the flexible execution of coupled simulations at the system, node, and core level using the same coupling abstraction and API. The hierarchical design allows for the node-level execution to be efficiently executed using containers while not impacting the structure of the simulation at the system level. We demonstrate the viability of the approach by presenting experimental results from applications in coupled fusion plasma simulations that illustrate the performance impact of using containers to deploy the node-level workloads, in conjunction with the user mountable XFS file systems to ameliorate the load on the Lustre parallel file system. We also present results from production runs showing the ability of the ensemble simulations to scale to hundreds of Cori Haswell nodes, with little or no overhead.