Efficient Metadata Generation to Enable Interactive Data Discovery over Large-Scale Scientific Data Collections

doi:10.1109/CLOUDCOM.2010.99

Proceedings Article10.1109/CLOUDCOM.2010.99

Efficient Metadata Generation to Enable Interactive Data Discovery over Large-Scale Scientific Data Collections

Sangmi Lee Pallickara, +3 more

- 30 Nov 2010

- pp 573-580

26

TL;DR: The Granules cloud runtime is used to orchestrate the MapReduce computations that extract metadata from the datasets, and fine-grained metadata is automatically extracted to provide a micro view and profile of the large dataset to the users.

Abstract: Discovering the correct dataset efficiently is critical for computations and effective simulations in scientific experiments In contrast to searching web documents over the Internet, massive binary datasets are difficult to browse or search Users must select a reliable data publisher from the large collection of data services available over the Internet Once a publisher is selected, the user must then discover the dataset that matches the computationâ€™s needs, among tens of thousands of large data packages that are available Some of the data hosting services provide advanced data search interfaces but their search scope is often limited to local datasets Because scientific datasets are often encoded as binary data formats, querying or validating missing data over hundreds of Megabytes of a binary file involves a compute intensive decoding process We have developed a system, GLEAN, that provides an efficient data discovery environment for users in scientific computing Fine-grained metadata is automatically extracted to provide a micro view and profile of the large dataset to the users We have used the Granules cloud runtime to orchestrate the MapReduce computations that extract metadata from the datasets Here we focus on the overall architecture of the system and how it enables efficient data discovery We applied our framework to a data discovery application in the atmospheric science domain This paper includes a performance evaluation with observational datasets

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1002/ASI.24165

Searching Data: A Review of Observational Data Retrieval Practices in Selected Disciplines

Kathleen Gregory, +5 more

- 01 May 2019

- Journal of the Association for Informati...

TL;DR: In this article, a review explores the data retrieval literature to identify commonalities in how users search for and evaluate observational research data in selected disciplines, and two analytical frameworks are used to identify key similarities in practices as a first step toward developing a model describing data retrieval.

...read moreread less

62

•Posted Content

Searching Data: A Review of Observational Data Retrieval Practices.

Kathleen Gregory, +4 more

- 21 Jul 2017

TL;DR: Two analytical frameworks, rooted in information retrieval and science and technology studies, are used to identify key similarities in practices as a first step toward developing a model describing data retrieval.

...read moreread less

56

•Dissertation

A perspective and framework for the conceptual modelling of knowledge

Diarmuid Pigott

- 01 Jan 2012

TL;DR: This thesis addresses the lack of a systematic method for the conceptual modelling of knowledge by presenting a novel perspective for dynamic knowledge exchange together with an associated modelling framework and tools and introduces appropriate modelling constructs, the Functional Entity and the Knowledge Relation.

...read moreread less

43

•Journal Article•10.1016/J.FUTURE.2011.05.010

Towards efficient data search and subsetting of large-scale atmospheric datasets

Sangmi Lee Pallickara, +2 more

- 01 Jan 2012

- Future Generation Computer Systems

TL;DR: An infrastructure-Atmospheric Data Discovery System (ADDS)-that provides an efficient data discovery environment for observational datasets in the atmospheric sciences and automatically extract and index fine-grained metadata to support complex querying capabilities.

...read moreread less

22

Journal Article•10.1016/J.ENVSOFT.2013.10.012

Managing a community shared vocabulary for hydrologic observations

Jeffery S. Horsburgh, +3 more

- 01 Feb 2014

- Environmental Modelling and Software

TL;DR: A community shared vocabulary and its supporting management tools that can be used by data publishers to populate metadata describing hydrologic observations to ensure that data from multiple sources published within the CUAHSI HIS are semantically consistent.

...read moreread less

21

...

Expand

References

Journal Article•10.21276/IJRE.2018.5.5.4

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 06 Dec 2004

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

22.7K

Journal Article•10.1145/1327452.1327492

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008

- Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

18.6K

•Journal Article•10.1093/NAR/GKH073

Database resources of the National Center for Biotechnology Information

David L. Wheeler, +12 more

- 01 Jan 2004

- Nucleic Acids Research

TL;DR: In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI’s website.

...read moreread less

11K

Proceedings Article•10.1145/1272996.1273005

Dryad: distributed data-parallel programs from sequential building blocks

Michael Isard, +4 more

- 21 Mar 2007

TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.

...read moreread less

3K

•Book

Information Seeking in Electronic Environments

Gary Marchionini

- 26 May 1995

TL;DR: This chapter discusses the continuing evolution of information seeking and its role in the development of knowledge, skills, and attitudes in the rapidly changing electronic environments.

...read moreread less

1.9K