Topic

Dryad (programming)

About: Dryad (programming) is a research topic. Over the lifetime, 78 publications have been published within this topic receiving 3727 citations.

...read moreread less

Topic Tools

Find unexplored research gaps

Generate a literature review

Explore related concepts

Papers published on a yearly basis

Papers

Proceedings Article•10.5555/1855741.1855742•

DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

[...]

Yuan Yu¹, Michael Isard¹, Dennis Fetterly¹, Mihai Budiu¹, Úlfar Erlingsson², Pradeep Kumar Gunda¹, Jon Currey¹ - Show less +3 more•Institutions (2)

Microsoft¹, Reykjavík University²

8 Dec 2008

TL;DR: It is shown that excellent absolute performance can be attained--a general-purpose sort of 1012 Bytes of data executes in 319 seconds on a 240-computer, 960- disk cluster--as well as demonstrating near-linear scaling of execution time on representative applications as the authors vary the number of computers used for a job.

...read moreread less

Abstract: DryadLINQ is a system and a set of language extensions that enable a new programming model for large scale distributed computing. It generalizes previous execution environments such as SQL, MapReduce, and Dryad in two ways: by adopting an expressive data model of strongly typed .NET objects; and by supporting general-purpose imperative and declarative operations on datasets within a traditional high-level programming language.A DryadLINQ program is a sequential program composed of LINQ expressions performing arbitrary side-effect-free transformations on datasets, and can be written and debugged using standard .NET development tools. The DryadLINQ system automatically and transparently translates the data-parallel portions of the program into a distributed execution plan which is passed to the Dryad execution platform. Dryad, which has been in continuous operation for several years on production clusters made up of thousands of computers, ensures efficient, reliable execution of this plan.We describe the implementation of the DryadLINQ compiler and runtime. We evaluate DryadLINQ on a varied set of programs drawn from domains such as web-graph analysis, large-scale log mining, and machine learning. We show that excellent absolute performance can be attained--a general-purpose sort of 1012 Bytes of data executes in 319 seconds on a 240-computer, 960- disk cluster--as well as demonstrating near-linear scaling of execution time on representative applications as we vary the number of computers used for a job.

...read moreread less

942 citations

Journal Article•10.1111/2041-210X.12790•

Distance sampling with camera traps

[...]

Eric J. Howe¹, Stephen T. Buckland¹, Marie-Lyne Després-Einspenner², Hjalmar S. Kühl²•Institutions (2)

University of St Andrews¹, Max Planck Society²

01 Nov 2017-Methods in Ecology and Evolution

TL;DR: In this article, the densities of Maxwell's duikers were estimated using Distance software, and data describing start times of videos of the duikers was archived at the Dryad data repository.

...read moreread less

Abstract: We thank the Robert Bosch Foundation, the Max Planck Society, and the University of St Andrews for funding. The data files from which densities of Maxwell's duikers were estimated using Distance software, and data describing start times of videos of Maxwell's duikers, have been archived at the Dryad data repository (https://doi.org/10.5061/dryad.b4c70) (Howe et al. 2017).

...read moreread less

234 citations

Book Chapter•10.1007/978-3-642-12636-9_2•

High Performance Parallel Computing with Clouds and Cloud Technologies

[...]

Jaliya Ekanayake¹, Geoffrey C. Fox¹•Institutions (1)

Indiana University¹

19 Oct 2009

TL;DR: This paper first discusses large scale data analysis using different MapReduce implementations and then, a performance analysis of high performance parallel applications on virtualized resources is presented.

...read moreread less

Abstract: Infrastructure services (Infrastructure-as-a-service), provided by cloud vendors, allow any user to provision a large number of compute instances fairly easily. Whether leased from public clouds or allocated from private clouds, utilizing these virtual resources to perform data/compute intensive analyses requires employing different parallel runtimes to implement such applications. Among many parallelizable problems, most “pleasingly parallel” applications can be performed using MapReduce technologies such as Hadoop, CGL-MapReduce, and Dryad, in a fairly easy manner. However, many scientific applications, which have complex communication patterns, still require low latency communication mechanisms and rich set of communication constructs offered by runtimes such as MPI. In this paper, we first discuss large scale data analysis using different MapReduce implementations and then, we present a performance analysis of high performance parallel applications on virtualized resources.

...read moreread less

184 citations

Proceedings Article•

Camdoop: exploiting in-network aggregation for big data applications

[...]

Paolo Costa¹, Austin Donnelly², Antony Rowstron², Greg O'Shea²•Institutions (2)

Imperial College London¹, Microsoft²

25 Apr 2012

TL;DR: Camdoop, a MapReduce-like system running on CamCube, a cluster design that uses a direct-connect network topology with servers directly linked to other servers, is built and demonstrated that it significantly reduces the network traffic and provides high performance increase over a version of Camdoop running over a switch and against two production systems, Hadoop and Dryad/DryadLINQ.

...read moreread less

Abstract: Large companies like Facebook, Google, and Microsoft as well as a number of small and medium enterprises daily process massive amounts of data in batch jobs and in real time applications. This generates high network traffic, which is hard to support using traditional, oversubscribed, network infrastructures. To address this issue, several novel network topologies have been proposed, aiming at increasing the bandwidth available in enterprise clusters. We observe that in many of the commonly used workloads, data is aggregated during the process and the output size is a fraction of the input size. This motivated us to explore a different point in the design space. Instead of increasing the bandwidth, we focus on decreasing the traffic by pushing aggregation from the edge into the network. We built Camdoop, a MapReduce-like system running on CamCube, a cluster design that uses a direct-connect network topology with servers directly linked to other servers. Camdoop exploits the property that CamCube servers forward traffic to perform in-network aggregation of data during the shuffle phase. Camdoop supports the same functions used in MapReduce and is compatible with existing MapReduce applications. We demonstrate that, in common cases, Camdoop significantly reduces the network traffic and provides high performance increase over a version of Camdoop running over a switch and against two production systems, Hadoop and Dryad/DryadLINQ.

...read moreread less

179 citations

Proceedings Article•10.1145/1646468.1646474•

Cloud technologies for bioinformatics applications

[...]

Xiaohong Qiu¹, Jaliya Ekanayake¹, Scott Beason¹, Thilina Gunarathne¹, Geoffrey C. Fox¹, Roger Barga², Dennis Gannon² - Show less +3 more•Institutions (2)

Indiana University¹, Microsoft²

16 Nov 2009

TL;DR: This paper's experience in applying two cloud technologies Apache Hadoop and Microsoft DryadLINQ to two bioinformatics applications with the above characteristics is presented and a comparison of performance of the cloud technologies under virtual and nonvirtual hardware platforms is presented.

...read moreread less

Abstract: Executing large number of independent tasks or tasks that perform minimal inter-task communication in parallel is a common requirement in many domains. In this paper, we present our experience in applying two new Microsoft technologies Dryad and Azure to three bioinformatics applications. We also compare with traditional MPI and Apache Hadoop MapReduce implementation in one example. The applications are an EST (Expressed Sequence Tag) sequence assembly program, PhyloD statistical package to identify HLA-associated viral evolution, and a pairwise Alu gene alignment application. We give detailed performance discussion on a 768 core Windows HPC Server cluster and an Azure cloud. All the applications start with a "doubly data parallel step" involving independent data chosen from two similar (EST, Alu) or two different databases (PhyloD). There are different structures for final stages in each application.

...read moreread less

172 citations

...

Expand

Performance Metrics

Papers

749

Citations

No. of papers in the topic in previous years
Year	Papers
2020	1
2019	2
2018	2
2017	6
2016	6
2015	7

Dryad (programming)

Topic Tools

Papers published on a yearly basis

Papers

DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

Distance sampling with camera traps

High Performance Parallel Computing with Clouds and Cloud Technologies

Camdoop: exploiting in-network aggregation for big data applications

Cloud technologies for bioinformatics applications

Related Topics (5)

Performance Metrics