Proceedings Article10.1109/IPDPS.2003.1213137
Quantifying locality effect in data access delay: memory logP
Kirk W. Cameron,Xian-He Sun +1 more
- 22 Apr 2003
- pp 48
TL;DR: This work presents a simple and useful model of point-to-point memory communication to predict and analyze the latency of memory copy, pack and unpack and uses the model to isolate contributions of hardware, middleware, and software to data transfers on Intel- and MIPS-based platforms.
read more
Abstract: The application of hardware-parameterized models to distributed systems can result in omission of key bottlenecks such as the full cost of inter-node communication in a shared memory cluster. However, inclusion in the model of message characteristics and complex memory hierarchies may result in impractical models. Nonetheless, the growing gap between memory and CPU performance combined with the trend toward large scale clustered shared memory platforms implies an increased need to consider the impact of local memory communication on parallel processing in distributed systems. We present a simple and useful model of point-to-point memory communication to predict and analyze the latency of memory copy, pack and unpack. We use the model to isolate contributions of hardware, middleware, and software to data transfers on Intel- and MIPS-based platforms.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi
Sabela Ramos,Torsten Hoefler +1 more
- 17 Jun 2013
TL;DR: An intuitive performance model for cache-coherent architectures is developed and used to develop several optimal and optimized algorithms for complex parallel data exchanges that beat the performance of the highly-tuned vendor-specific Intel OpenMP and MPI libraries.
Improving the performance of MPI derived datatypes by optimizing memory-access cost
Byna,Gropp,Xian-He Sun,Thakur +3 more
- 01 Jan 2003
TL;DR: This paper presents performance results for a matrix-transpose example that demonstrate that the implementation of derived datatypes significantly outperforms both manual packing by the user and the existing derived-datatype code in the MPI implementation (MPICH).
Performance analysis and optimization of MPI collective operations on multi-core clusters
TL;DR: A new parallel computation model to unitedly abstract memory hierarchy on multi-core clusters in vertical and horizontal levels is proposed, which provides the theoretical underpinning for the optimal design of MPI collective operations.
45
$\log_{\rm n}{\rm P}$ and $\log_{3}{\rm P}$: Accurate Analytical Models of Point-to-Point Communication in Distributed Systems
Kirk W. Cameron,R. Ge,X.-H. Sun +2 more
TL;DR: This work presents a general software-parameterized model of point-to-point communication for use in performance prediction and evaluation, and illustrates the utility of the model in three ways: to derive a simplified, useful, more accurate, and to express, compare, and contrast existing communication models.
44
Models of parallel computation: a survey and classification
TL;DR: The state-of-the-art parallel computational model research is reviewed and various models that were developed during the past decades are introduced, according to their targeting architecture features, especially memory organization, into three generations.
26
References
A bridging model for parallel computation
TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.
4.1K
LogP: a practical model of parallel computation
David E. Culler,Richard M. Karp,David A. Patterson,Abhijit Sahay,Eunice E. Santos,Klaus Erik Schauser,Ramesh Subramonian,Thorsten von Eicken +7 more
TL;DR: Enough to be generally useful and to keep the algorithm analysis tractable to produce a better program in practice.
344
Reproducible Measurements of MPI Performance Characteristics
William Gropp,Ewing Lusk +1 more
- 26 Sep 1999
TL;DR: The mpptest suite of performance measurement programs developed at Argonne National Laboratory as mentioned in this paper attempts to avoid such mistakes and obtain reproducible measures of MPI performance that can be useful to both MPI implementors and MPI application writers.
Memory hierarchy considerations for cost-effective cluster computing
TL;DR: This study shows that the depth of the memory hierarchy is the most sensitive factor affecting the execution time for many types of workloads, and presents quantitative recommendations for building cost-effective clusters for different workloads.
Exploiting transparent remote memory access for non-contiguous- and one-sided-communication
J. Worringen,A. Gaer,F. Reker +2 more
- 15 Apr 2002
TL;DR: This paper presents two of the most recent optimizations in SCI-MPICH, an MPICH variant for the SCI interconnect, which make use of the global shared memory provided by this interconnect: efficient communication with non-contiguous MPI datatypes and one-sided communication according to the MPI-2 standard.
30