TILEPro64

Topic Tools

Papers

Journal Article•10.1109/MM.2007.4378780•

On-Chip Interconnection Architecture of the Tile Processor

[...]

Wentzlaff, Griffin, Hoffmann, Bao, Edwards, Ramey, Mattina, Miao, Brown, Agarwal - Show less +6 more

01 Jan 2007-IEEE Micro

555 citations

Proceedings Article•10.1109/PDP.2013.27•

Parallel Patterns for General Purpose Many-Core

[...]

Daniele Buono¹, Marco Danelutto¹, Silvia Lametti¹, Massimo Torquati¹•Institutions (1)

University of Pisa¹

27 Feb 2013

TL;DR: The porting of the FastFlow framework on the Tilera TilePro64 architecture is discussed and the results obtained running synthetic benchmarks as well as true application kernels demonstrate the efficiency achieved while using patterns on the TilePro 64 both to program stand-alone skeleton-based parallel applications and to accelerate existing sequential code.

...read moreread less

Abstract: Efficient programming of general purpose many-core accelerators poses several challenging problems. The high number of cores available, the peculiarity of the interconnection network, and the complex memory hierarchy organization, all contribute to make efficient programming of such devices difficult. We propose to use parallel design patterns, implemented using algorithmic skeletons, to abstract and hide most of the difficulties related to the efficient programming of many-core accelerators. In particular, we discuss the porting of the FastFlow framework on the Tilera TilePro64 architecture and the results obtained running synthetic benchmarks as well as true application kernels. These results demonstrate the efficiency achieved while using patterns on the TilePro64 both to program stand-alone skeleton-based parallel applications and to accelerate existing sequential code.

...read moreread less

21 citations

Posted Content•10.3233/978-1-61499-381-0-63•

An Efficient Thread Mapping Strategy for Multiprogramming on Manycore Processors

[...]

Ashkan Tousimojarad¹, Wim Vanderbauwhede¹•Institutions (1)

University of Glasgow¹

31 Mar 2014-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This paper proposes a novel, low-overhead technique, a heuristic based on the amount of time spent by each CPU doing some useful work, to fairly distribute the workloads amongst the cores in a multiprogramming environment and shows that this thread mapping scheme can outperform the native GNU/Linux thread scheduler in both single-programming and multiprograming environments.

...read moreread less

Abstract: The emergence of multicore and manycore processors is set to change the parallel computing world. Applications are shifting towards increased parallelism in order to utilise these architectures efficiently. This leads to a situation where every application creates its desirable number of threads, based on its parallel nature and the system resources allowance. Task scheduling in such a multithreaded multiprogramming environment is a significant challenge. In task scheduling, not only the order of the execution, but also the mapping of threads to the execution resources is of a great importance. In this paper we state and discuss some fundamental rules based on results obtained from selected applications of the BOTS benchmarks on the 64-core TILEPro64 processor. We demonstrate how previously efficient mapping policies such as those of the SMP Linux scheduler become inefficient when the number of threads and cores grows. We propose a novel, low-overhead technique, a heuristic based on the amount of time spent by each CPU doing some useful work, to fairly distribute the workloads amongst the cores in a multiprogramming environment. Our novel approach could be implemented as a pragma similar to those in the new task-based OpenMP versions, or can be incorporated as a distributed thread mapping mechanism in future manycore programming frameworks. We show that our thread mapping scheme can outperform the native GNU/Linux thread scheduler in both single-programming and multiprogramming environments.

...read moreread less

19 citations

Proceedings Article•10.3233/978-1-61499-381-0-63•

An Efficient Thread Mapping Strategy for Multiprogramming on Manycore Processors

[...]

Ashkan Tousimojarad¹, Wim Vanderbauwhede¹•Institutions (1)

University of Glasgow¹

1 Mar 2014

TL;DR: In this article, the authors propose a low-overhead heuristic based on the amount of time spent by each CPU doing some useful work, to fairly distribute the workloads among the cores in a multiprogramming environment.

...read moreread less

18 citations

Proceedings Article•10.1109/IISWC.2012.6402921•

Model-based, memory-centric performance and power optimization on NUMA multiprocessors

[...]

Chun-Yi Su¹, Dong Li², Dimitrios S. Nikolopoulos³, Kirk W. Cameron¹, Bronis R. de Supinski⁴, Edgar A. León⁴ - Show less +2 more•Institutions (4)

Virginia Tech¹, Oak Ridge National Laboratory², Queen's University Belfast³, Lawrence Livermore National Laboratory⁴

4 Nov 2012

TL;DR: DyNUMA is a framework for dynamic optimization of programs on NUMA architectures that uses an artificial neural network with input, output, and intermediate layers that emulate program threads, memory controllers, processor cores, and their interactions to capture the complex and interacting effects of system layout, program concurrency, data placement, and memory controller contention.

...read moreread less

Abstract: Non-Uniform Memory Access (NUMA) architectures are ubiquitous in HPC systems. NUMA along with other factors including socket layout, data placement, and memory contention significantly increase the search space to find an optimal mapping of applications to NUMA systems. This search space may be intractable for online optimization and challenging for efficient offline search. This paper presents DyNUMA, a framework for dynamic optimization of programs on NUMA architectures. DyNUMA uses simple, memory-centric, performance and energy models with non-linear terms to capture the complex and interacting effects of system layout, program concurrency, data placement, and memory controller contention. DyNUMA leverages an artificial neural network (ANN) with input, output, and intermediate layers that emulate program threads, memory controllers, processor cores, and their interactions. Using an ANN in conjunction with critical path analysis, DyNUMA autonomously optimizes programs for performance or energy-efficiency metrics. We used DyNUMA on a variety of benchmarks from the NPB and ASC Sequoia suites on three different architectures (a 16-core AMD Barcelona system, a 32-core AMD Magny-Cours system, and a 64-core Tilera TilePro64 system). Our results show that DyNUMA achieves on average 8.7% improvement in performance (12.9% in the best case), 16% improvement in Energy-Delay (30.6% in the best case) and 9.1% improvement in MFLOPS/Watt (10.7% in the best case) compared to the default Linux scheduling.

...read moreread less

18 citations

...

Expand

Topic Tools

Papers

On-Chip Interconnection Architecture of the Tile Processor

Parallel Patterns for General Purpose Many-Core

An Efficient Thread Mapping Strategy for Multiprogramming on Manycore Processors

An Efficient Thread Mapping Strategy for Multiprogramming on Manycore Processors

Model-based, memory-centric performance and power optimization on NUMA multiprocessors

Related Topics (5)

Performance Metrics

No. of papers in the topic in previous years
Year	Papers
2017	1
2016	3
2015	3
2014	6
2013	6
2012	11