About: Single instruction, multiple threads is a research topic. Over the lifetime, 39 publications have been published within this topic receiving 324 citations. The topic is also known as: single instruction, multiple threads.
TL;DR: The results show that CUDA implementation of EM when applied to an input of 230K for a 32-order mixture of 32-dimensional Gaussian model takes 264 msec on Quadro FX 5800 (NVIDIA 200 series) with 240 cores to complete one iteration, thus accelerating the computations.
Abstract: Expectation Maximization (EM) algorithm is an iterative technique widely used in the fields of signal processing and data mining. We present a parallel implementation of EM for finding maximum likelihood estimates of parameters of Gaussian mixture models, designed for many-core architecture of Graphics Processing Units (GPU). The algorithm is implemented on NVIDIA's GPUs using CUDA, following the single instruction multiple threads model. In this paper, the emphasis is laid on exploiting the data parallelism with CUDA, thus accelerating the computations. CUDA implementation of EM is designed in such a way that the speed of computation of the algorithm scales up with the number of GPU cores. Experimental results confirm the scalability across cores. The results also show that CUDA implementation of EM when applied to an input of 230K for a 32-order mixture of 32-dimensional Gaussian model takes 264 msec on Quadro FX 5800 (NVIDIA 200 series) with 240 cores to complete one iteration which is about 164 times faster when compared to a naive single threaded C implementation on CPU.
TL;DR: In order to assess the costs incurred by incompletely vectorized code, a micro-benchmark is developed that measures the characteristics of the CUDA thread divergence model on different architectures focusing on the loops performance.
Abstract: All modern processors include a set of vector instructions. While this gives a tremendous boost to the performance, it requires a vectorized code that can take advantage of such instructions. As an ideal vectorization is hard to achieve in practice, one has to decide when different instructions may be applied to different elements of the vector operand. This is especially important in implicit vectorization as in NVIDIA CUDA Single Instruction Multiple Threads (SIMT) model, where the vectorization details are hidden from the programmer. In order to assess the costs incurred by incompletely vectorized code, we have developed a micro-benchmark that measures the characteristics of the CUDA thread divergence model on different architectures focusing on the loops performance.
TL;DR: It is proved that the reformulation of the context modeling of EBCOT that allows full parallelization for massively parallel architectures such as GPUs with their single instruction multiple threads architecture is equivalent to the EBCot specification in JPEG2000 standard.
Abstract: Embedded Block Coding with Optimal Truncation (EBCOT) is the fundamental and computationally very demanding part of the compression process of JPEG2000 image compression standard. In this paper, we present a reformulation of the context modeling of EBCOT that allows full parallelization for massively parallel architectures such as GPUs with their single instruction multiple threads architecture. We prove that the reformulation is equivalent to the EBCOT specification in JPEG2000 standard. Behavior of the reformulated algorithm is demonstrated using NVIDIA CUDA platform and compared to other state-of-the-art implementations.
TL;DR: A novel warp scheduling scheme to maintain data locality and to relieve cache pollution and thrashing issues is proposed and a new insertion method called LPI (Locality Protected Insertion) is put forward to reorder warps in the supervised warp queue to better hide long-latency warps with short-latencies warps such as ALU operations and on-chip accesses.
TL;DR: The design and implementation of a conflict-driven ASP solver, that is capable of exploiting the parallelism offered by GPUs, and preliminary experimental results confirm the feasibility and scalability of the approach, and the potential to enhance performance of ASP solvers.
Abstract: General Purpose Graphical Processing Units (GPUs) are affordable multi-core platforms, providing access to large number of cores, but at the price of a complex architecture with non-trivial synchronization and communication costs. This paper presents the design and implementation of a conflict-driven ASP solver, that is capable of exploiting the parallelism offered by GPUs. The proposed system builds on the notion of ASP computation, that avoids the generation of unfounded sets, enhanced by conflict analysis and learning. The proposed system uses the CPU exclusively for input and output, in order to reduce the negative impact of the expensive data transfers between the CPU and the GPU. All the solving components, i.e., the management of nogoods, the search strategy, backjumping, the search heuristics, conflict analysis and learning, and unit propagation, are performed on the GPU, by exploiting Single Instruction Multiple Threads (SIMT) parallelism. The preliminary experimental results confirm the feasibility and scalability of the approach, and the potential to enhance performance of ASP solvers.