Conference

Symposium on Application Specific Processors

About: Symposium on Application Specific Processors is an academic conference. The conference publishes majorly in the area(s): Instruction set & Field-programmable gate array. Over the lifetime, 83 publications have been published by the conference receiving 1469 citations.

...read moreread less

Topics: Instruction set, Field-programmable gate array, Graphics processing unit, System on a chip, Application software ...read more

Conference Tools

Create Scientific Poster

Create Conference poster

Create Presentation with AI

Papers published on a yearly basis

Papers

Proceedings Article•10.1109/SASP.2011.5941084•

A massively parallel implementation of QC-LDPC decoder on GPU

[...]

Guohui Wang¹, Michael Wu¹, Yang Sun¹, Joseph R. Cavallaro¹•Institutions (1)

Rice University¹

5 Jun 2011

TL;DR: A GPU-based implementation of a real-world digital signal processing (DSP) application: low-density parity-check (LDPC) decoder that can take advantage of the multi-core computational power provided by GPU and achieve high throughput up to 100.3Mbps.

...read moreread less

Abstract: The graphics processor unit (GPU) is able to provide a low-cost and flexible software-based multi-core architecture for high performance computing. However, it is still very challenging to efficiently map the real-world applications to GPU and fully utilize the computational power of GPU. As a case study, we present a GPU-based implementation of a real-world digital signal processing (DSP) application: low-density parity-check (LDPC) decoder. The paper shows the efforts we made to map the algorithm onto the massively parallel architecture of GPU and fully utilize GPU's computational resources to significantly boost the performance. Moreover, several efficient data structures have been proposed to reduce the memory access latency and the memory bandwidth requirement. Experimental results show that the proposed GPU-based LDPC decoding accelerator can take advantage of the multi-core computational power provided by GPU and achieve high throughput up to 100.3Mbps.

...read moreread less

57 citations

Proceedings Article•10.1109/SASP.2010.5521139•

Minimizing write activities to non-volatile memory via scheduling and recomputation

[...]

Jingtong Hu¹, Chun Jason Xue², Wei-Che Tseng¹, Qingfeng Zhuge¹, Edwin H.-M. Sha¹ - Show less +1 more•Institutions (2)

University of Texas at Dallas¹, City University of Hong Kong²

13 Jun 2010

TL;DR: This paper proposes two optimization techniques, write-aware scheduling and recomputation, to minimize write activities on non-volatile memory, and shows that these techniques can both speed up the completion time of programs and extend non-Volatile memory's lifetime.

...read moreread less

Abstract: Non-volatile memories, such as flash memory, Phase Change Memory (PCM), and Magnetic Random Access Memory (MRAM), have many desirable characteristics for embedded DSP systems to employ them as main memory. These characteristics include low-cost, shock-resistivity, non-volatility, power-economy and high density. However, there are two common challenges we need to answer before we can apply non-volatile memory as main memory practically. First, non-volatile memory has limited write/erase cycles compared to DRAM. Second, a write operation is slower than a read operation on non-volatile memory. These two challenges can be answered by reducing the number of write activities on non-volatile main memory. In this paper, we propose two optimization techniques, write-aware scheduling and recomputation, to minimize write activities on non-volatile memory. With the proposed techniques, we can both speed up the completion time of programs and extend non-volatile memory's lifetime. The experimental results show that the proposed techniques can reduce the number of write activities on non-volatile memory by 55.71% on average. Thus, the lifetime of non-volatile memory is extend to 2.5 times as long as before on average. The completion time of programs can be reduced by 55.32% on systems with NOR flash memory and by 40.69% on systems with NAND flash memory on average.

...read moreread less

44 citations

Proceedings Article•10.1109/SASP.2010.5521145•

Accelerating DNA analysis applications on GPU clusters

[...]

Antonino Tumeo¹, Oreste Villa¹•Institutions (1)

Pacific Northwest National Laboratory¹

13 Jun 2010

TL;DR: This paper presents an efficient implementation of the Aho-Corasick algorithm for high performance clusters accelerated with Graphic Processing Units (GPUs) and compares this implementation to MPI and MPI with pthreads based implementations for a homogeneous cluster of x86 processors.

...read moreread less

Abstract: DNA analysis is an emerging application of high performance bioinformatics. Modern sequencing machinery are able to provide, in few hours, large input streams of data which needs to be matched against exponentially growing databases of known fragments. The ability to recognize these patterns effectively and fastly may allow extending the scale and the reach of the investigations performed by biology scientists. Aho-Corasick is an exact, multiple pattern matching algorithm often at the base of this application. In this paper we present an efficient implementation of the Aho-Corasick algorithm for high performance clusters accelerated with Graphic Processing Units (GPUs). We discuss how we partitioned and adapted the algorithm to fit the Tesla C1060 GPU and then present a MPI based implementation for a heterogeneous high performance cluster. We compare this implementation to MPI and MPI with pthreads based implementations for a homogeneous cluster of x86 processors, discussing the stability vs. the performance and the scaling of the solutions, taking into consideration aspects such as the bandwidth among the different nodes.

...read moreread less

44 citations

Proceedings Article•10.1109/SASP.2008.4570784•

An FPGA Design Space Exploration Tool for Matrix Inversion Architectures

[...]

Ali Irturk¹, Bridget Benson¹, Shahnam Mirzaei², Ryan Kastner³•Institutions (3)

University of California, Berkeley¹, University of California, Santa Barbara², University of California, San Diego³

8 Jun 2008

TL;DR: GUSTO is the first tool of its kind to provide automatic generation of a variety of general purpose matrix inversion architectures with different parameterization options, and provides an optimized application specific architecture with an average of 59% area decrease and 3X throughput increase over its general purpose architecture.

...read moreread less

Abstract: Matrix inversion is a common function found in many algorithms used in wireless communication systems. As FPGAs become an increasingly attractive platform for wireless communication, it is important to understand the tradeoffs in designing a matrix inversion core on an FPGA. This paper describes a matrix inversion core generator tool, GUSTO, that we developed to ease the design space exploration across different matrix inversion architectures. GUSTO is the first tool of its kind to provide automatic generation of a variety of general purpose matrix inversion architectures with different parameterization options. GUSTO also provides an optimized application specific architecture with an average of 59% area decrease and 3X throughput increase over its general purpose architecture. The optimized architectures generated by GUSTO provide comparable results to published matrix inversion architecture implementations, but offer the advantage of providing the designer the ability to study the tradeoffs between architectures with different design parameters.

...read moreread less

36 citations

Proceedings Article•10.1109/SASP.2009.5226334•

A memory optimization technique for software-managed scratchpad memory in GPUs

[...]

Maryam Moazeni¹, Alex A. T. Bui¹, Majid Sarrafzadeh¹•Institutions (1)

University of California, Los Angeles¹

27 Jul 2009

TL;DR: A memory optimization scheme that minimizes the usage of memory space by discovering the chances of memory reuse with the goal of maximizing the application performance is proposed, based on graph coloring.

...read moreread less

Abstract: With the appearance of massively parallel and inexpensive platforms such as the G80 generation of NVIDIA GPUs, more real-life applications will be designed or ported to these platforms. This requires structured transformation methods that remove existing application bottlenecks in these platforms. Balancing the usage of on-chip resources, used for improving the application performance, in these platforms is often non-intuitive and some applications will run into resource limits. In this paper, we present a memory optimization technique for the software-managed scratchpad memory in the G80 architecture to alleviate the constraints of using the scratchpad memory. We propose a memory optimization scheme that minimizes the usage of memory space by discovering the chances of memory reuse with the goal of maximizing the application performance. Our solution is based on graph coloring. We evaluated our memory optimization scheme by a set of experiments on an image processing benchmark suite in medical imaging domain using NVIDIA Quadro FX 5600 and CUDA. Implementations based on our proposed memory optimization scheme showed up to 37% decrease in execution time comparing to their naive GPU implementations.

...read moreread less

32 citations

...

Expand

Performance Metrics

Papers

662

Citations

No. of papers from the Conference in previous years
Year	Papers
2011	22
2010	20
2009	19
2008	19
1987	1
1982	2