1. What are the contributions in "A quantitative performance analysis model for gpu architectures" ?
The authors develop a microbenchmark-based performance model for NVIDIA GeForce 200-series GPUs.. Because their model is based on the GPU ’ s native instruction set, the authors can predict performance with a 5–15 % error.. To demonstrate the usefulness of the model, the authors analyze three representative real-world and already highly-optimized programs: dense matrix multiply, tridiagonal systems solver, and sparse matrix vector multiply.. The model provides us detailed quantitative analysis on performance, allowing us to understand the configuration of the fastest dense matrix multiply implementation and to optimize the tridiagonal solver and sparse matrix vector multiply by 60 % and 18 % respectively.. Their model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improvements.. Furthermore, their model applied to analysis on these codes allows us to suggest architectural improvements on hardware resource allocation, avoiding bank conflicts, block scheduling, and memory transaction granularity.
read more
2. What are the future works in "A quantitative performance analysis model for gpu architectures" ?
Their quantitative performance model for the GPU allows programmers and architects to identify optimization possibilities in modern GPU programs and architectures.. Their work has several limitations that the authors hope to address with future research: ( 1 ) incorporate a cache model in memory system simulation ( for texture memory and Fermi hardware caches ), ( 2 ) develop a bank-conflict simulator for more general cases, ( 3 ) model the synchronization barrier ’ s effects on warp-level parallelism, and ( 4 ) identify and model situations of non-perfect overlap of instruction execution, shared memory, and global memory access.. Today, programmers do not know how effective an potential optimization will be until they try it out.. In contrast, their performance analysis tool enables programmers to identify the performance bottlenecks, foresee the benefit of removing a certain bottleneck in a quantitative way, and decide if a potential optimization is worth the programming efforts.
read more
3. How does the padding technique improve CR?
The padding technique has shifted the bottleneck from shared memory to the instruction pipeline, which improves the performance of CR by 1.6×.
read more
4. Why do the authors assume different stages could still be overlapped?
Because GPU synchronization is local to a block, if there are multiple blocks, the authors assume different stages could still be overlapped, and the authors estimate a single performance bottleneck for the whole program.
read more





