TL;DR: On August 24, 2006, the Standard Performance Evaluation Corporation (SPEC) announced CPU2006, which replaces CPU2000, and the SPEC CPU benchmarks are widely used in both industry and academia.
Abstract: On August 24, 2006, the Standard Performance Evaluation Corporation (SPEC) announced CPU2006 [2], which replaces CPU2000. The SPEC CPU benchmarks are widely used in both industry and academia [3].
TL;DR: This paper recommends benchmarking selection and evaluation methodologies, and introduces the DaCapo benchmarks, a set of open source, client-side Java benchmarks that improve over SPEC Java in a variety of ways, including more complex code, richer object behaviors, and more demanding memory system requirements.
Abstract: Since benchmarks drive computer science research and industry product development, which ones we use and how we evaluate them are key questions for the community. Despite complex runtime tradeoffs due to dynamic compilation and garbage collection required for Java programs, many evaluations still use methodologies developed for C, C++, and Fortran. SPEC, the dominant purveyor of benchmarks, compounded this problem by institutionalizing these methodologies for their Java benchmark suite. This paper recommends benchmarking selection and evaluation methodologies, and introduces the DaCapo benchmarks, a set of open source, client-side Java benchmarks. We demonstrate that the complex interactions of (1) architecture, (2) compiler, (3) virtual machine, (4) memory management, and (5) application require more extensive evaluation than C, C++, and Fortran which stress (4) much less, and do not require (3). We use and introduce new value, time-series, and statistical metrics for static and dynamic properties such as code complexity, code size, heap composition, and pointer mutations. No benchmark suite is definitive, but these metrics show that DaCapo improves over SPEC Java in a variety of ways, including more complex code, richer object behaviors, and more demanding memory system requirements. This paper takes a step towards improving methodologies for choosing and evaluating benchmarks to foster innovation in system design and implementation for Java and other managed languages.
TL;DR: The MinneSPEC inputset for the SPEC CPU 2000 benchmark suite is developed to facilitate efficient simulations with a range of benchmarkprograms and it is found that for some programs, the Minne SPECprofiles match the SPEC reference dataset program behavior very closely; for other programs, however, theMinneSPEC inputs produce significantly different programbehavior.
Abstract: Computer architects must determine how tomost effectively use finite computational resources whenrunning simulations to evaluate new architectural ideas.To facilitate efficient simulations with a range of benchmarkprograms, rn have developed the MinneSPEC inputset for the SPEC CPU 2000 benchmark suite. Thisnew workload allows computer architects to obtain simulationresults in a reasonable time using existing sirnulators.While the MinneSPEC workload is derived from thestandard SPEC CPU 2000 warklcad, it is a valid benchmarksuite in and of itself for simulation-based research.MinneSPEC also may be used to run Iarge numbers ofsimulations to find "sweet spots" in the evaluation parameterspace. This small number of promising designpoints subsequently may be investigated in more detailwith the full SPEC reference workload. In the processof developing the MinneSPEC datasets, we quantify itsdifferences in terms of function-level execution patterns,instruction mixes, and memory behaviors compared tothe SPEC programs when executed with the reference inputs.We find that for some programs, the MinneSPECprofiles match the SPEC reference dataset program behaviorvery closely. For other programs, however, theMinneSPEC inputs produce significantly different programbehavior. The MinneSPEC workload has been recognizedby SPEC and is distributed with Version 1.2 andhigher of the SPEC CPU 2000 benchmark suite.
TL;DR: This paper analyzes the SPEC CPU2006 benchmarks using performance counter based experimentation from several state of the art systems, and uses statistical techniques such as principal component analysis and clustering to draw inferences on the similarity of the benchmarks and the redundancy in the suite and arrive at meaningful subsets.
Abstract: The recently released SPEC CPU2006 benchmark suite is expected to be used by computer designers and computer architecture researchers for pre-silicon early design analysis. Partial use of benchmark suites by researchers, due to simulation time constraints, compiler difficulties, or library or system call issues is likely to happen; but a random subset can lead to misleading results. This paper analyzes the SPEC CPU2006 benchmarks using performance counter based experimentation from several state of the art systems, and uses statistical techniques such as principal component analysis and clustering to draw inferences on the similarity of the benchmarks and the redundancy in the suite and arrive at meaningful subsets.The SPEC CPU2006 benchmark suite contains several programs from areas such as artificial intelligence and includes none from the electronic design automation (EDA) application area. Hence there is a concern on the application balance in the suite. An analysis from the perspective of fundamental program characteristics shows that the included programs offer characteristics broader than the EDA programs' space. A subset of 6 integer programs and 8 floating point programs can yield most of the information from the entire suite.