Data discovery

Topic Tools

Papers published on a yearly basis

Papers

Journal Article•10.1029/2007EO020003•

Online analysis enhances use of NASA Earth science data

[...]

James G. Acker, Gregory Leptoukh¹•Institutions (1)

Goddard Space Flight Center¹

09 Jan 2007-Eos, Transactions American Geophysical Union

TL;DR: Giovanni, the Goddard Earth Sciences Data and Information Services Center (GES DISC) Interactive Online Visualization and Analysis Infrastructure, has provided researchers with advanced capabilities to perform data exploration and analysis with observational data from NASA Earth observation satellites.

...read moreread less

Abstract: Giovanni, the Goddard Earth Sciences Data and Information Services Center (GES DISC) Interactive Online Visualization and Analysis Infrastructure, has provided researchers with advanced capabilities to perform data exploration and analysis with observational data from NASA Earth observation satellites. In the past 5-10 years, examining geophysical events and processes with remote-sensing data required a multistep process of data discovery, data acquisition, data management, and ultimately data analysis. Giovanni accelerates this process by enabling basic visualization and analysis directly on the World Wide Web. In the last two years, Giovanni has added new data acquisition functions and expanded analysis options to increase its usefulness to the Earth science research community.

...read moreread less

915 citations

Journal Article•10.1016/J.IJINFOMGT.2017.12.002•

Social media analytics – Challenges in topic discovery, data collection, and data preparation

[...]

Stefan Stieglitz¹, Milad Mirbabaie¹, Björn Ross¹, Christoph Neuberger²•Institutions (2)

University of Duisburg-Essen¹, Ludwig Maximilian University of Munich²

01 Apr 2018-International Journal of Information Management

TL;DR: An extended and structured literature analysis is conducted through which the most important challenges for researchers are discussed and potential solutions proposed and used to extend an existing framework on social media analytics.

...read moreread less

739 citations

Proceedings Article•10.1145/1807167.1807278•

Data warehousing and analytics infrastructure at facebook

[...]

Ashish Thusoo¹, Zheng Shao¹, Suresh Anthony¹, Dhruba Borthakur¹, Namit Jain¹, Joydeep Sen Sarma¹, Raghotham Murthy¹, Hao Liu¹ - Show less +4 more•Institutions (1)

Facebook¹

6 Jun 2010

TL;DR: This paper presents how Scribe, Hadoop and Hive together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook and enabled us to implement a data warehouse that stores more than 15PB of data and loads more than 60TB of new data every day.

...read moreread less

Abstract: Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on.

...read moreread less

500 citations

Proceedings Article•10.1145/3233547.3233725•

The National Sleep Research Resource: Towards a Sleep Data Commons

[...]

Guo-Qiang Zhang¹, Licong Cui¹, Remo Mueller², Shiqiang Tao¹, Matthew Kim², Michael Rueschman², Sara Mariani², Daniel Mobley², Susan Redline² - Show less +5 more•Institutions (2)

University of Kentucky¹, Harvard University²

15 Aug 2018

TL;DR: The NSRR provides a single point of access to analysis-ready physiological signals from polysomnography obtained from multiple sources, and a wide variety of clinical data to facilitate sleep research, and provides the design of a functional architecture for implementing a Sleep Data Commons.

...read moreread less

Abstract: Objective: The gold standard for diagnosing sleep disorders is polysomnography, which generates extensive data about biophysical changes occurring during sleep. We developed the National Sleep Research Resource (NSRR), a comprehensive system for sharing sleep data. The NSRR embodies elements of a data commons aimed at accelerating research to address critical questions about the impact of sleep disorders on important health outcomes. Approach: We used a metadata-guided approach, with a set of common sleep-specific terms enforcing uniform semantic interpretation of data elements across three main components: (1) annotated datasets; (2) user interfaces for accessing data; and (3) computational tools for the analysis of polysomnography recordings. We incorporated the process for managing dataset-specific data use agreements, evidence of Institutional Review Board review, and the corresponding access control in the NSRR web portal. The metadata-guided approach facilitates structural and semantic interoperability, ultimately leading to enhanced data reusability and scientific rigor. Results: The authors curated and deposited retrospective data from 10 large, NIH-funded sleep cohort studies, including several from the Trans-Omics for Precision Medicine (TOPMed) program, into the NSRR. The NSRR currently contains data on 26,808 subjects and 31,166 signal files in European Data Format. Launched in April 2014, over 3000 registered users have downloaded over 130 terabytes of data. Conclusions: The NSRR offers a use case and an example for creating a full-fledged data commons. It provides a single point of access to analysis-ready physiological signals from polysomnography obtained from multiple sources, and a wide variety of clinical data to facilitate sleep research. The NIH Data Commons (or Commons) is an ambitious vision for a shared virtual space to allow digital objects to be stored and computed upon by the scientific community. The Commons would allow investigators to find, manage, share, use and reuse data, software, metadata and workflows. It imagines an ecosystem that makes digital objects Findable, Accessible, Interoperable and Reusable (FAIR). Four components are considered integral parts of the Commons: a computing resource for accessing and processing of digital objects; a "digital object compliance model" that describes the properties of digital objects that enable them to be FAIR; datasets that adhere to the digital object compliance model; and software and services to facilitate access to and use of data. This paper describes the contributions of NSRR along several aspects of the Commons vision: metadata for sleep research digital objects; a collection of annotated sleep data sets; and interfaces and tools for accessing and analyzing such data. More importantly, the NSRR provides the design of a functional architecture for implementing a Sleep Data Commons. The NSRR also reveals complexities and challenges involved in making clinical sleep data conform to the FAIR principles. Future directions: Shared resources offered by emerging resources such as cloud instances provide promising platforms for the Data Commons. However, simply expanding storage or adding compute power may not allow us to cope with the rapidly expanding volume and increasing complexity of biomedical data. Concurrent efforts must be spent to address digital object organization challenges. To make our approach future-proof, we need to continue advancing research in data representation and interfaces for human-data interaction. A possible next phase of NSRR is the creation of a universal self-descriptive sequential data format. The idea is to break large, unstructured, sequential data files into minimal, semantically meaningful, fragments. Such fragments can be indexed, assembled, retrieved, rendered, or repackaged on-the-fly, for multitudes of application scenarios. Data points in such a fragment will be locally embedded with relevant metadata labels, governed by terminology and ontology. Potential benefits of such an approach may include precise levels of data access, increased analysis readiness with on-the-fly data conversion, multi-level data discovery and support for effective web-based visualization of contents in large sequential files.

...read moreread less

405 citations

Proceedings Article•10.1145/3308558.3313685•

Google Dataset Search: Building a search engine for datasets in an open Web ecosystem

[...]

Dan Brickley¹, Matthew Burgess¹, Natasha Noy¹•Institutions (1)

Google¹

13 May 2019

TL;DR: Google Dataset Search as discussed by the authors is a dataset-discovery tool that provides search capabilities over potentially all datasets published on the Web, relying on an open ecosystem, where dataset owners and providers publish semantically enhanced metadata on their own sites.

...read moreread less

Abstract: There are thousands of data repositories on the Web, providing access to millions of datasets. National and regional governments, scientific publishers and consortia, commercial data providers, and others publish data for fields ranging from social science to life science to high-energy physics to climate science and more. Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others' work, and providing data journalists easier access to information and its provenance. In this paper, we discuss Google Dataset Search, a dataset-discovery tool that provides search capabilities over potentially all datasets published on the Web. The approach relies on an open ecosystem, where dataset owners and providers publish semantically enhanced metadata on their own sites. We then aggregate, normalize, and reconcile this metadata, providing a search engine that lets users find datasets in the “long tail” of the Web. In this paper, we discuss both social and technical challenges in building this type of tool, and the lessons that we learned from this experience.

...read moreread less

316 citations

...

Expand

Year	Papers
2025	9
2024	12
2023	13
2022	28
2021	52
2020	59

Topic Tools

Papers published on a yearly basis

Papers

Online analysis enhances use of NASA Earth science data

Social media analytics – Challenges in topic discovery, data collection, and data preparation

Data warehousing and analytics infrastructure at facebook

The National Sleep Research Resource: Towards a Sleep Data Commons

Google Dataset Search: Building a search engine for datasets in an open Web ecosystem

Related Topics (5)

Performance Metrics