Single instruction, multiple threads

Topic Tools

Papers

Proceedings Article•10.1109/HPCC.2009.45•

Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs Using CUDA

[...]

N. S. L. Phani Kumar¹, Sanjiv Satoor¹, Ian Buck¹•Institutions (1)

25 Jun 2009

TL;DR: The results show that CUDA implementation of EM when applied to an input of 230K for a 32-order mixture of 32-dimensional Gaussian model takes 264 msec on Quadro FX 5800 (NVIDIA 200 series) with 240 cores to complete one iteration, thus accelerating the computations.

...read moreread less

Abstract: Expectation Maximization (EM) algorithm is an iterative technique widely used in the fields of signal processing and data mining. We present a parallel implementation of EM for finding maximum likelihood estimates of parameters of Gaussian mixture models, designed for many-core architecture of Graphics Processing Units (GPU). The algorithm is implemented on NVIDIA's GPUs using CUDA, following the single instruction multiple threads model. In this paper, the emphasis is laid on exploiting the data parallelism with CUDA, thus accelerating the computations. CUDA implementation of EM is designed in such a way that the speed of computation of the algorithm scales up with the number of GPU cores. Experimental results confirm the scalability across cores. The results also show that CUDA implementation of EM when applied to an input of 230K for a 32-order mixture of 32-dimensional Gaussian model takes 264 msec on Quadro FX 5800 (NVIDIA 200 series) with 240 cores to complete one iteration which is about 164 times faster when compared to a naive single threaded C implementation on CPU.

...read moreread less

108 citations

Book Chapter•10.1007/978-3-319-32149-3_53•

Benchmarking the Cost of Thread Divergence in CUDA

[...]

Piotr Białas¹, Adam Strzelecki¹•Institutions (1)

Jagiellonian University¹

6 Sep 2015

TL;DR: In order to assess the costs incurred by incompletely vectorized code, a micro-benchmark is developed that measures the characteristics of the CUDA thread divergence model on different architectures focusing on the loops performance.

...read moreread less

Abstract: All modern processors include a set of vector instructions. While this gives a tremendous boost to the performance, it requires a vectorized code that can take advantage of such instructions. As an ideal vectorization is hard to achieve in practice, one has to decide when different instructions may be applied to different elements of the vector operand. This is especially important in implicit vectorization as in NVIDIA CUDA Single Instruction Multiple Threads (SIMT) model, where the vectorization details are hidden from the programmer. In order to assess the costs incurred by incompletely vectorized code, we have developed a micro-benchmark that measures the characteristics of the CUDA thread divergence model on different architectures focusing on the loops performance.

...read moreread less

31 citations

Proceedings Article•10.1109/DCC.2011.49•

Efficient JPEG2000 EBCOT Context Modeling for Massively Parallel Architectures

[...]

Jiri Matela¹, Vit Rusnak¹, Petr Holub¹•Institutions (1)

Masaryk University¹

29 Mar 2011

TL;DR: It is proved that the reformulation of the context modeling of EBCOT that allows full parallelization for massively parallel architectures such as GPUs with their single instruction multiple threads architecture is equivalent to the EBCot specification in JPEG2000 standard.

...read moreread less

Abstract: Embedded Block Coding with Optimal Truncation (EBCOT) is the fundamental and computationally very demanding part of the compression process of JPEG2000 image compression standard. In this paper, we present a reformulation of the context modeling of EBCOT that allows full parallelization for massively parallel architectures such as GPUs with their single instruction multiple threads architecture. We prove that the reformulation is equivalent to the EBCOT specification in JPEG2000 standard. Behavior of the reformulated algorithm is demonstrated using NVIDIA CUDA platform and compared to other state-of-the-art implementations.

...read moreread less

22 citations

Journal Article•10.1016/J.FUTURE.2017.02.036•

Locality based warp scheduling in GPGPUs

[...]

Yang Zhang¹, Zuocheng Xing¹, Cang Liu¹, Chuan Tang¹, Qinglin Wang¹ - Show less +1 more•Institutions (1)

National University of Defense Technology¹

24 Feb 2017-Future Generation Computer Systems

TL;DR: A novel warp scheduling scheme to maintain data locality and to relieve cache pollution and thrashing issues is proposed and a new insertion method called LPI (Locality Protected Insertion) is put forward to reorder warps in the supervised warp queue to better hide long-latency warps with short-latencies warps such as ALU operations and on-chip accesses.

...read moreread less

20 citations

Book Chapter•10.1007/978-3-319-28228-2_3•

A GPU Implementation of the ASP Computation

[...]

Agostino Dovier¹, Andrea Formisano², Enrico Pontelli³, Flavio Vella⁴•Institutions (4)

University of Udine¹, University of Perugia², New Mexico State University³, Sapienza University of Rome⁴

18 Jan 2016

TL;DR: The design and implementation of a conflict-driven ASP solver, that is capable of exploiting the parallelism offered by GPUs, and preliminary experimental results confirm the feasibility and scalability of the approach, and the potential to enhance performance of ASP solvers.

...read moreread less

Abstract: General Purpose Graphical Processing Units (GPUs) are affordable multi-core platforms, providing access to large number of cores, but at the price of a complex architecture with non-trivial synchronization and communication costs. This paper presents the design and implementation of a conflict-driven ASP solver, that is capable of exploiting the parallelism offered by GPUs. The proposed system builds on the notion of ASP computation, that avoids the generation of unfounded sets, enhanced by conflict analysis and learning. The proposed system uses the CPU exclusively for input and output, in order to reduce the negative impact of the expensive data transfers between the CPU and the GPU. All the solving components, i.e., the management of nogoods, the search strategy, backjumping, the search heuristics, conflict analysis and learning, and unit propagation, are performed on the GPU, by exploiting Single Instruction Multiple Threads (SIMT) parallelism. The preliminary experimental results confirm the feasibility and scalability of the approach, and the potential to enhance performance of ASP solvers.

...read moreread less

19 citations

...

Expand

Year	Papers
2021	2
2020	1
2019	6
2018	3
2017	2
2016	4

Topic Tools

Papers

Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs Using CUDA

Benchmarking the Cost of Thread Divergence in CUDA

Efficient JPEG2000 EBCOT Context Modeling for Massively Parallel Architectures

Locality based warp scheduling in GPGPUs

A GPU Implementation of the ASP Computation

Related Topics (5)

Performance Metrics