Proceedings Article10.1109/HPEC.2013.6670330
Understanding query performance in Accumulo
Scott M. Sawyer,B. David O'Gwynn,An Tran,Tamara Yu +3 more
- 21 Nov 2013
- pp 1-6
TL;DR: An Apache Accumulo-based big data system designed for a network situational awareness application is studied and its storage schema and data retrieval requirements are analyzed, and the correspondingAccumulo performance bottlenecks are characterized.
read more
Abstract: Open-source, BigTable-like distributed databases provide a scalable storage solution for data-intensive applications. The simple key-value storage schema provides fast record ingest and retrieval, nearly independent of the quantity of data stored. However, real applications must support non-trivial queries that require careful key design and value indexing. We study an Apache Accumulo-based big data system designed for a network situational awareness application. The application's storage schema and data retrieval requirements are analyzed. We then characterize the corresponding Accumulo performance bottlenecks. Queries are shown to be communication-bound and server-bound in different situations. Inefficiencies in the open-source communication stack and filesystem limit network and I/O performance, respectively. Additionally, in some situations, parallel clients can contend for server-side resources. Maximizing data retrieval rates for practical queries requires effective key design, indexing, and client parallelization.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Achieving 100,000,000 database inserts per second using Accumulo and D4M
Jeremy Kepner,William Arcand,David Bestor,Bill Bergeron,Chansup Byun,Vijay Gadepally,Matthew Hubbell,Peter Michaleas,Julie Mullen,Andrew Prout,Albert Reuther,Antonio Rosa,Charles Yee +12 more
TL;DR: The Apache Accumulo database as discussed by the authors is an open source relaxed consistency database that is widely used for government applications and is designed to deliver high performance on unstructured data such as graphs of network data.
52
Graphulo implementation of server-side sparse matrix multiply in the Accumulo database
Dylan Hutchison,Jeremy Kepner,Vijay Gadepally,Adam Fuchs +3 more
- 12 Nov 2015
TL;DR: A server-side implementation of GraphBLAS sparse matrix multiplication that leverages Accumulo's native, high-performance iterators and offers its work as a core component to the Graphulo library that will deliver matrix math primitives for graph analytics within Accumulus.
46
Modeling and Indexing Spatiotemporal Trajectory Data in Non-Relational Databases
Berkay Aydin,Vijay Akkineni,Rafal A. Angryk +2 more
- 01 Jan 2016
TL;DR: In this chapter, the important aspects of non-relational (NoSQL) databases for storing large-scale spatiotemporal trajectory data are investigated and two data storage schemata are proposed for storing trajectories.
18
Lustre, hadoop, accumulo
Jeremy Kepner,William Arcand,David Bestor,Bill Bergeron,Chansup Byun,Lauren Edwards,Vijay Gadepally,Matthew Hubbell,Peter Michaleas,Julie Mullen,Andrew Prout,Antonio Rosa,Charles Yee,Albert Reuther +13 more
- 01 Sep 2015
TL;DR: In this article, the authors compare Lustre, Hadoop, and Accumulo databases on a hypothetical common cluster and show that Lustre provides 2x more storage capacity, is less likely to loose data during 3 simultaneous drive failures, and provides higher bandwidth on general purpose workloads.
15
CloudDBGuard: A framework for encrypted data storage in NoSQL wide column stores
Lena Wiese,Tim Waage,Michael Brenner +2 more
- 01 Mar 2020
TL;DR: This article comprehensively present details of the framework CloudDBGuard that allows using property-preserving encryption in unmodified wide column stores, and hides the complexity of the encryption and decryption process and allows various adjustments on specific use cases in order to achieve a maximum of security, functionality and performance.
14
References
Bigtable: A Distributed Storage System for Structured Data
Fay W. Chang,Jeffrey Dean,Sanjay Ghemawat,Wilson C. Hsieh,Deborah A. Wallach,Michael Burrows,Tushar Deepak Chandra,Andrew Fikes,Robert E. Gruber +8 more
TL;DR: The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.
3.5K
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
Jens Dittrich,Jorge-Arnulfo Quiané-Ruiz,Alekh Jindal,Yagiz Kargin,Vinay Setty,Jörg Schad +5 more
- 01 Sep 2010
TL;DR: This paper proposes a new type of system named Hadoop++: it boosts task performance without changing the Hadooper framework at all (Hadoop does not even 'notice it'), and shows the superiority of Hadoo++ over both Hadoops and HadoOPDB for tasks related to indexing and join processing.
747
YCSB++: benchmarking and performance debugging advanced features in scalable table stores
Swapnil Patil,Milo Polte,Kai Ren,Wittawat Tantisiriroj,Lin Xiao,Julio Lopez,Garth A. Gibson,Adam Fuchs,Billie Rinaldi +8 more
- 26 Oct 2011
TL;DR: YCSB++ is described, a set of extensions to the Yahoo! Cloud Serving Benchmark that includes multi-tester coordination for increased load and eventual consistency measurement, multi-phase workloads to quantify the consequences of work deferment and the benefits of anticipatory configuration optimization, and abstract APIs for explicit incorporation of advanced features in benchmark tests.
Dynamic distributed dimensional data model (D4M) database and computation system
Jeremy Kepner,William Arcand,William Bergeron,Nadya T. Bliss,Robert A. Bond,Chansup Byun,Gary R. Condon,Kenneth Gregson,Matthew Hubbell,Jonathan Kurz,Andrew McCabe,Peter Michaleas,Andrew Prout,Albert Reuther,Antonio Rosa,Charles Yee +15 more
- 25 Mar 2012
TL;DR: D4M (Dynamic Distributed Dimensional Data Model) has been developed to provide a mathematically rich interface to tuple stores (and structured query language “SQL” databases) and it is possible to create composable analytics with significantly less effort than using traditional approaches.
Driving big data with big compute
Chansup Byun,William Arcand,David Bestor,Bill Bergeron,Matthew Hubbell,Jeremy Kepner,Andrew McCabe,Peter Michaleas,Julie Mullen,David O'Gwynn,Andrew Prout,Albert Reuther,Antonio Rosa,Charles Yee +13 more
- 01 Sep 2012
TL;DR: The LLGrid team has developed and deployed a number of technologies that aim to provide the best of both worlds, including LLGrid MapReduce, which allows the map/reduce parallel programming model to be used quickly and efficiently in any language on any compute cluster.