Conference
Symposium on Application Specific Processors
About: Symposium on Application Specific Processors is an academic conference. The conference publishes majorly in the area(s): Instruction set & Field-programmable gate array. Over the lifetime, 83 publications have been published by the conference receiving 1469 citations.
Topics: Instruction set, Field-programmable gate array, Graphics processing unit, System on a chip, Application software
Papers
5 Jun 2011
TL;DR: A GPU-based implementation of a real-world digital signal processing (DSP) application: low-density parity-check (LDPC) decoder that can take advantage of the multi-core computational power provided by GPU and achieve high throughput up to 100.3Mbps.
Abstract: The graphics processor unit (GPU) is able to provide a low-cost and flexible software-based multi-core architecture for high performance computing. However, it is still very challenging to efficiently map the real-world applications to GPU and fully utilize the computational power of GPU. As a case study, we present a GPU-based implementation of a real-world digital signal processing (DSP) application: low-density parity-check (LDPC) decoder. The paper shows the efforts we made to map the algorithm onto the massively parallel architecture of GPU and fully utilize GPU's computational resources to significantly boost the performance. Moreover, several efficient data structures have been proposed to reduce the memory access latency and the memory bandwidth requirement. Experimental results show that the proposed GPU-based LDPC decoding accelerator can take advantage of the multi-core computational power provided by GPU and achieve high throughput up to 100.3Mbps.
57 citations
13 Jun 2010
TL;DR: This paper proposes two optimization techniques, write-aware scheduling and recomputation, to minimize write activities on non-volatile memory, and shows that these techniques can both speed up the completion time of programs and extend non-Volatile memory's lifetime.
Abstract: Non-volatile memories, such as flash memory, Phase Change Memory (PCM), and Magnetic Random Access Memory (MRAM), have many desirable characteristics for embedded DSP systems to employ them as main memory. These characteristics include low-cost, shock-resistivity, non-volatility, power-economy and high density. However, there are two common challenges we need to answer before we can apply non-volatile memory as main memory practically. First, non-volatile memory has limited write/erase cycles compared to DRAM. Second, a write operation is slower than a read operation on non-volatile memory. These two challenges can be answered by reducing the number of write activities on non-volatile main memory. In this paper, we propose two optimization techniques, write-aware scheduling and recomputation, to minimize write activities on non-volatile memory. With the proposed techniques, we can both speed up the completion time of programs and extend non-volatile memory's lifetime. The experimental results show that the proposed techniques can reduce the number of write activities on non-volatile memory by 55.71% on average. Thus, the lifetime of non-volatile memory is extend to 2.5 times as long as before on average. The completion time of programs can be reduced by 55.32% on systems with NOR flash memory and by 40.69% on systems with NAND flash memory on average.
44 citations
13 Jun 2010
TL;DR: This paper presents an efficient implementation of the Aho-Corasick algorithm for high performance clusters accelerated with Graphic Processing Units (GPUs) and compares this implementation to MPI and MPI with pthreads based implementations for a homogeneous cluster of x86 processors.
Abstract: DNA analysis is an emerging application of high performance bioinformatics. Modern sequencing machinery are able to provide, in few hours, large input streams of data which needs to be matched against exponentially growing databases of known fragments. The ability to recognize these patterns effectively and fastly may allow extending the scale and the reach of the investigations performed by biology scientists. Aho-Corasick is an exact, multiple pattern matching algorithm often at the base of this application. In this paper we present an efficient implementation of the Aho-Corasick algorithm for high performance clusters accelerated with Graphic Processing Units (GPUs). We discuss how we partitioned and adapted the algorithm to fit the Tesla C1060 GPU and then present a MPI based implementation for a heterogeneous high performance cluster. We compare this implementation to MPI and MPI with pthreads based implementations for a homogeneous cluster of x86 processors, discussing the stability vs. the performance and the scaling of the solutions, taking into consideration aspects such as the bandwidth among the different nodes.
44 citations
8 Jun 2008
TL;DR: GUSTO is the first tool of its kind to provide automatic generation of a variety of general purpose matrix inversion architectures with different parameterization options, and provides an optimized application specific architecture with an average of 59% area decrease and 3X throughput increase over its general purpose architecture.
Abstract: Matrix inversion is a common function found in many algorithms used in wireless communication systems. As FPGAs become an increasingly attractive platform for wireless communication, it is important to understand the tradeoffs in designing a matrix inversion core on an FPGA. This paper describes a matrix inversion core generator tool, GUSTO, that we developed to ease the design space exploration across different matrix inversion architectures. GUSTO is the first tool of its kind to provide automatic generation of a variety of general purpose matrix inversion architectures with different parameterization options. GUSTO also provides an optimized application specific architecture with an average of 59% area decrease and 3X throughput increase over its general purpose architecture. The optimized architectures generated by GUSTO provide comparable results to published matrix inversion architecture implementations, but offer the advantage of providing the designer the ability to study the tradeoffs between architectures with different design parameters.
36 citations
27 Jul 2009
TL;DR: A memory optimization scheme that minimizes the usage of memory space by discovering the chances of memory reuse with the goal of maximizing the application performance is proposed, based on graph coloring.
Abstract: With the appearance of massively parallel and inexpensive platforms such as the G80 generation of NVIDIA GPUs, more real-life applications will be designed or ported to these platforms. This requires structured transformation methods that remove existing application bottlenecks in these platforms. Balancing the usage of on-chip resources, used for improving the application performance, in these platforms is often non-intuitive and some applications will run into resource limits. In this paper, we present a memory optimization technique for the software-managed scratchpad memory in the G80 architecture to alleviate the constraints of using the scratchpad memory. We propose a memory optimization scheme that minimizes the usage of memory space by discovering the chances of memory reuse with the goal of maximizing the application performance. Our solution is based on graph coloring. We evaluated our memory optimization scheme by a set of experiments on an image processing benchmark suite in medical imaging domain using NVIDIA Quadro FX 5600 and CUDA. Implementations based on our proposed memory optimization scheme showed up to 37% decrease in execution time comparing to their naive GPU implementations.
32 citations
Performance Metrics
| Year | Papers |
|---|---|
| 2011 | 22 |
| 2010 | 20 |
| 2009 | 19 |
| 2008 | 19 |
| 1987 | 1 |
| 1982 | 2 |