Proceedings Article10.1109/CLOUDCOM.2010.99
Efficient Metadata Generation to Enable Interactive Data Discovery over Large-Scale Scientific Data Collections
Sangmi Lee Pallickara,Shrideep Pallickara,Milija Zupanski,Stephen Sullivan +3 more
- 30 Nov 2010
- pp 573-580
TL;DR: The Granules cloud runtime is used to orchestrate the MapReduce computations that extract metadata from the datasets, and fine-grained metadata is automatically extracted to provide a micro view and profile of the large dataset to the users.
read more
Abstract: Discovering the correct dataset efficiently is critical for computations and effective simulations in scientific experiments In contrast to searching web documents over the Internet, massive binary datasets are difficult to browse or search Users must select a reliable data publisher from the large collection of data services available over the Internet Once a publisher is selected, the user must then discover the dataset that matches the computation’s needs, among tens of thousands of large data packages that are available Some of the data hosting services provide advanced data search interfaces but their search scope is often limited to local datasets Because scientific datasets are often encoded as binary data formats, querying or validating missing data over hundreds of Megabytes of a binary file involves a compute intensive decoding process We have developed a system, GLEAN, that provides an efficient data discovery environment for users in scientific computing Fine-grained metadata is automatically extracted to provide a micro view and profile of the large dataset to the users We have used the Granules cloud runtime to orchestrate the MapReduce computations that extract metadata from the datasets Here we focus on the overall architecture of the system and how it enables efficient data discovery We applied our framework to a data discovery application in the atmospheric science domain This paper includes a performance evaluation with observational datasets
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Searching Data: A Review of Observational Data Retrieval Practices in Selected Disciplines
TL;DR: In this article, a review explores the data retrieval literature to identify commonalities in how users search for and evaluate observational research data in selected disciplines, and two analytical frameworks are used to identify key similarities in practices as a first step toward developing a model describing data retrieval.
•Posted Content
Searching Data: A Review of Observational Data Retrieval Practices.
Kathleen Gregory,Paul Groth,Helena Cousijn,Andrea Scharnhorst,Sally Wyatt +4 more
- 21 Jul 2017
TL;DR: Two analytical frameworks, rooted in information retrieval and science and technology studies, are used to identify key similarities in practices as a first step toward developing a model describing data retrieval.
•Dissertation
A perspective and framework for the conceptual modelling of knowledge
Diarmuid Pigott
- 01 Jan 2012
TL;DR: This thesis addresses the lack of a systematic method for the conceptual modelling of knowledge by presenting a novel perspective for dynamic knowledge exchange together with an associated modelling framework and tools and introduces appropriate modelling constructs, the Functional Entity and the Knowledge Relation.
43
Towards efficient data search and subsetting of large-scale atmospheric datasets
TL;DR: An infrastructure-Atmospheric Data Discovery System (ADDS)-that provides an efficient data discovery environment for observational datasets in the atmospheric sciences and automatically extract and index fine-grained metadata to support complex querying capabilities.
22
Managing a community shared vocabulary for hydrologic observations
TL;DR: A community shared vocabulary and its supporting management tools that can be used by data publishers to populate metadata describing hydrologic observations to ensure that data from multiple sources published within the CUAHSI HIS are semantically consistent.
21
References
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
- 06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Database resources of the National Center for Biotechnology Information
David L. Wheeler,Deanna M. Church,Ron Edgar,Scott Federhen,Wolfgang Helmberg,Thomas L. Madden,Joan Pontius,Gregory D. Schuler,Lynn M. Schriml,Edwin Sequeira,Tugba O. Suzek,Tatiana Tatusova,Lukas Wagner +12 more
TL;DR: In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI’s website.
Dryad: distributed data-parallel programs from sequential building blocks
Michael Isard,Mihai Budiu,Yuan Yu,Andrew Birrell,Dennis Fetterly +4 more
- 21 Mar 2007
TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
•Book
Information Seeking in Electronic Environments
Gary Marchionini
- 26 May 1995
TL;DR: This chapter discusses the continuing evolution of information seeking and its role in the development of knowledge, skills, and attitudes in the rapidly changing electronic environments.
1.9K