Zhong‐Zhi Bai, Qizhong Wu, Kai Cao, Yiming Sun, Huaqiong Cheng
2 Jan 2024
TL;DR: The Loongson 3A4000 CPU platform with MIPS64 architecture is well-suited for running the WRF-CAMx air quality modelling system in the Beijing-Tianjin-Hebei region. It has high energy efficiency and scientific usability.
Abstract: Abstract. The MIPS processor architecture is a type of Reduced Instruction Set Computing (RISC) processor architecture, which has advantages in terms of energy consumption and efficiency. There are few studies on the application of MIPS CPUs in the geoscientific numerical models. In this study, Loongson 3A4000 CPU platform with MIPS64 architecture was used to establish the runtime environment for the air quality modelling system WRF-CAMx in Beijing-Tianjin-Hebei region. The results show that the relative errors for the major species (NO2, SO2, O3, CO, PNO3 and PSO4) between the MIPS and X86 benchmark platform are within ± 0.1 %. The maximum Mean Absolute Error (MAE) of major species ranged to 10−2 ppbV or μg m−3, the maximum Root Mean Square Error (RMSE) ranged to 10−1 ppbV or μg m−3, and the Mean Absolute Percentage Error (MAPE) remained within 0.5 %. The CAMx takes about 15.2 minutes on Loongson 3A4000 CPU and 4.8 minutes on Intel Xeon E5-2697 v4 CPU, when simulating a 2h-case with four parallel processes using MPICH. As a result, the single-core computing capability of Loongson 3A4000 CPU for the WRF-CAMx modeling system is about one-third of Intel Xeon E5-2697 v4 CPU, but the thermal design power (TDP) of Loongson 3A4000 is 30W, only about one-fifth of Intel Xeon E5-2697 v4, which TDP is 145W. Thus, Loongson 3A4000 has higher energy efficiency in the application of the WRF-CAMx modeling system. The results also verify the feasibility of cross-platform porting and the scientific usability of the ported model. This study provides a technical foundation for the porting and optimization of numerical models based on MIPS or other RISC platforms.
TL;DR: This study explores secure deep learning inference using Intel SGX on Intel Ice Lake-SP Xeon Processor, demonstrating feasibility despite a 70% inference rate loss and 13X runtime overhead, with potential for future performance improvements.
Abstract: Deep learning, being one of the most reputable techniques deployed for artificial intelligence, has been actively used in many applications nowadays including those that are security critical. With more industries now migrated their applications to the cloud or to the edge, it has sparked some serious security concerns. Although securing a deep learning inference has been an area for explorations for many researchers over the years, majority of the efforts revolve around securing the input data and the deep learning model, without much focus on securing the application code or the inference forward pass. Among the most popular methodologies proposed to secure a deep learning inference are cryptographic primitives and trusted hardware. Due to the high performance overhead incurred by cryptographic primitives, this paper proposed to secure a deep learning inference application through the trusted hardware approach, particularly via the Intel SGX on 3rd Gen Intel® Xeon Scalable processor. Through this research, it was discovered that Intel SGX incurred up to around 70% loss in the number of inferences per second and an overhead of up to 13X for the overall application runtime. Nevertheless, this research has demonstrated that with the greatly expanded Intel SGX enclave size on the first Intel Xeon Scalable Processor that comes with Intel SGX support, it is feasible to secure a deep learning application with Intel SGX without any code modification despite the trade-off on the performance.
TL;DR: This paper introduces the Intel Xeon 6 SoC, a system-on-chip designed for edge computing, emphasizing security, connectivity, and management capabilities to support real-time processing and data analytics in IoT and industrial applications.
TL;DR: The Loongson 3A4000 CPU platform with MIPS64 architecture is well-suited for running the WRF-CAMx air quality modelling system, achieving high energy efficiency and scientific usability.
Abstract: Abstract. The MIPS processor architecture is a type of Reduced Instruction Set Computing (RISC) processor architecture, which has advantages in terms of energy consumption and efficiency. There are few studies on the application of MIPS CPUs in the geoscientific numerical models.In this study, Loongson 3A4000 CPU platform with MIPS64 architecture was used to establish the runtime environment for the air quality modelling system WRF-CAMx in Beijing-Tianjin-Hebei region.The results show that the relative errors for the major species (NO2, SO2, O3, CO, PNO3 and PSO4) between the MIPS and X86 benchmark platform are within ± 0.1 %.The maximum Mean Absolute Error (MAE) of major species ranged to 10−2 ppbV or μgm−3, the maximum Root Mean Square Error (RMSE) ranged to 10−1 ppbV or μg m−3, and the Mean Absolute Percentage Error (MAPE) remained within 0.5 %. The CAMx takes about 15.2 minutes on Loongson 3A4000 CPU and 4.8 minutes on Intel Xeon E5-2697 v4 CPU, when simulating a 2h-case with four parallel processes using MPICH. As a result, the single-core computing capability of Loongson 3A4000 CPU for the WRF-CAMx modeling system is about one-third of Intel Xeon E5-2697 v4 CPU, but the thermal design power (TDP) of Loongson 3A4000 is 30W, only about one-fifth of Intel Xeon E5-2697 v4, which TDP is 145W. Thus, Loongson 3A4000has higher energy efficiency in the application of the WRF-CAMx modeling system. The results also verify the feasibility of cross-platform porting and the scientific usability of the ported model. This study provides a technical foundation for the porting and optimization of numerical models based on MIPS or other RISC platforms.
TL;DR: The Loongson 3A4000 CPU platform with MIPS64 architecture is well-suited for running the WRF-CAMx air quality modelling system, achieving high energy efficiency and scientific usability.
Abstract: Abstract. The MIPS processor architecture is a type of Reduced Instruction Set Computing (RISC) processor architecture, which has advantages in terms of energy consumption and efficiency. There are few studies on the application of MIPS CPUs in the geoscientific numerical models.In this study, Loongson 3A4000 CPU platform with MIPS64 architecture was used to establish the runtime environment for the air quality modelling system WRF-CAMx in Beijing-Tianjin-Hebei region.The results show that the relative errors for the major species (NO2, SO2, O3, CO, PNO3 and PSO4) between the MIPS and X86 benchmark platform are within ± 0.1 %.The maximum Mean Absolute Error (MAE) of major species ranged to 10−2 ppbV or μgm−3, the maximum Root Mean Square Error (RMSE) ranged to 10−1 ppbV or μg m−3, and the Mean Absolute Percentage Error (MAPE) remained within 0.5 %. The CAMx takes about 15.2 minutes on Loongson 3A4000 CPU and 4.8 minutes on Intel Xeon E5-2697 v4 CPU, when simulating a 2h-case with four parallel processes using MPICH. As a result, the single-core computing capability of Loongson 3A4000 CPU for the WRF-CAMx modeling system is about one-third of Intel Xeon E5-2697 v4 CPU, but the thermal design power (TDP) of Loongson 3A4000 is 30W, only about one-fifth of Intel Xeon E5-2697 v4, which TDP is 145W. Thus, Loongson 3A4000has higher energy efficiency in the application of the WRF-CAMx modeling system. The results also verify the feasibility of cross-platform porting and the scientific usability of the ported model. This study provides a technical foundation for the porting and optimization of numerical models based on MIPS or other RISC platforms.
TL;DR: The study investigates the performance of CICE and its EVP solver and identifies two bottlenecks. The study refactors the standard EVP solver based on two generic patterns and achieves significant performance improvements.
Abstract: Abstract. This study focuses on the performance of CICE and its Elastic-Viscous-Plastic (EVP) dynamical solver. The study has been conducted in two steps. First, the standard EVP solver has been extracted from CICE for experiments with refactored versions of it. Secondly, one refactored version was integrated and tested as part of the full model. Two dominant bottlenecks were revealed. The first is the number of MPI and OpenMP synchronization points required for halo exchanges during each time-step combined with the irregular domain of active sea ice points. The second is the lack of Single Instruction Multiple Data (SIMD) code generation. The study refactors the standard EVP solver based on two generic patterns. The first pattern exposes how general finite-differences on masked multi-dimensional arrays can be expressed in order to produce significantly better code generation. The primary change is that the memory access pattern is changed from random access to direct access. The second pattern exposes an alternative approach to handle static grid properties. The measured single core improvement is increased by more than a factor of five compared to the standard implementation. The refactored implementation strong scales on the Intel® Xeon® Scalable Processor Series node until the available bandwidth of the node is used. For the Intel® Xeon® CPU Max Series Series there is sufficient bandwidth to allow the strong scaling to continue for all the cores on the node resulting in a single node improvement factor of 35 over the standard implementation. This study also show improved performance on GPU processors.
TL;DR: This paper explores FPGA acceleration for nuclear particle transport simulators using Intel oneAPI, implementing XS-Bench on Stratix10 FPGA and comparing its performance with Intel Xeon CPU-based systems for heterogeneous computing applications.
Abstract:
Field-Programmable Gate Arrays (FPGAs) are becoming an interesting component for heterogeneous computing systems in the post-Moore era thanks to their reconfigurable nature. The current generation of FPGAs includes specialized hard blocks for floating point operations, making them attractive for scientific computing. FPGA programming has historically been done in hardware description languages, which required a deep understanding of hardware design. Emerging high-level synthesis tools, such as Intel oneAPI and AMD Vitis™, provide a more common programming environment for FPGAs. In this paper, we explore the capabilities of FPGAs for acceleration in the context of nuclear particle transport simulators. As a case study, we implement XS-Bench in Intel oneAPI targeting FPGAs, including basic optimizations. We then compare the performance of Intel Stratix10 FPGA and Intel Xeon CPU-based systems and evaluate the viability of FPGA use in heterogeneous systems.
Abstract: Today, genetic algorithms are widely used in many fields such as bioinformatics, computer science, artificial intelligence, finance ... Genetic algorithms are applied to create high quality solutions for complex optimization problems in the above industries. There have been many studies based on the proposed new hardware architecture that aims to speed up the execution of genetic algorithms as quickly as possible. Some studies suggest parallel genetic algorithms on systems with multicore CPUs and / or graphics processing units (GPUs). However, very few solutions propose a genetic algorithm that can be run on systems that use the new Intel Xeon Phi co-processor (Intel Many-Integrated Core (MIC) architecture). For that reason, we propose and develop the study of the genetic algorithm on high-performance computing systems with Intel Xeon Phi co-processors. This study will present the results of parallel approaches of genetic algorithm on one and more Intel Xeon Phi co-processors by the following methods: (i) Intel Xeon Phi programming model Offload and Native; and (ii) a combined model of MPI and OpenMP. The proposed genetic algorithm can find the optimal schedule for the energy-efficient scheduling problem of virtual machines on physical machines with the goal of minimization total energy consumption. The results of the simulations show the feasibility of implementing a genetic algorithm on one or many Intel Xeon Phi. Genetic algorithm on one or more distributed Intel Xeon Phi always results in faster algorithm execution time than sequential genetic algorithm and the ability to find better solutions using more Intel Xeon Phi. This research result can be applied to other meta-heuristic like TABU search, Ant Colony Optimization.
TL;DR: This study benchmarks Open MPI collective communication on Intel Xeon dual quad-core clusters, comparing Gigabit Ethernet and InfiniBand performance. Results show InfiniBand outperforms Gigabit Ethernet in latency and throughput for most collective operations.
Abstract: The performance of MPI implementation operations still presents critical issues for high performance computing systems, particularly for more advanced processor technology.Consequently, this study concentrates on benchmarking MPI implementation on multi-core architecture by measuring the performance of Open MPI collective communication on Intel Xeon dual quad-core Gigabit Ethernet and InfiniBand clusters using SKaMPI.It focuses on well known collective communication routines such as MPI-Bcast, MPI-AlltoAll, MPI-Scatter and MPI-Gather.From the collection of results, MPI collective communication on InfiniBand clusters had distinctly better performance in terms of latency and throughput.The analysis indicates that the algorithm used for collective communication performed very well for all message sizes except for MPI-Bcast and MPI-Alltoall operation of inter-node communication.However, InfiniBand provides the lowest latency for all operations since it provides applications with an easy to use messaging service, compared to Gigabit Ethernet, which still requests the operating system for access to one of the server communication resources with the complex dance between an application and a network.