TL;DR: In this article, the authors present a system for managing asynchronous replication in a distributed database environment, while providing for scaling of the distributed database, where a cluster of nodes can be assigned roles for managing partitions of data within the database and processing database requests.
Abstract: Systems and methods are provided for managing asynchronous replication in a distributed database environment, while providing for scaling of the distributed database. A cluster of nodes can be assigned roles for managing partitions of data within the database and processing database requests. In one embodiment, each cluster includes a node with a primary role to process write operations and mange asynchronous replication of the operations to at least one secondary node. Each cluster or set of nodes can host one or more partitions of database data. Collectively, the cluster or set of nodes define a shard cluster that hosts all the data of the distributed database. Each shard cluster, individual nodes, or sets of nodes can be configured to manage the size of any hosted partitions, splitting database partitions, migrating partitions, and/or managing expansion of shard clusters to encompass new systems.
TL;DR: In this paper, the authors present a horizontally scaled database based on data ownership for cloud-based collaboration and/or storage platform/service, which comprises multiple shard databases, and all files and folders owned by a user are stored on a single shard database.
Abstract: Systems and methods of maintaining a horizontally scaled database based on data ownership for a cloud-based platform (e.g., cloud-based collaboration and/or storage platform/service) are disclosed. The system database comprises multiple shard databases, and all files and folders owned by a user are stored on a single shard database. When a user transfers ownership of a file and/or a folder to a second user, the transferred file and/or folder is stored on the shard database that stores all of the data for the second user.
TL;DR: The Clause-Iteration algorithms form the basis of the scalable, SHARD graph-store built on the Hadoop implementation of MapReduce, which performs favorably when compared to existing "industrial" graph-stores on a standard benchmark graph with 800 million edges.
Abstract: Graph data processing is an emerging application area for cloud computing because there are few other information infrastructures that cost-effectively permit scalable graph data processing. We present a scalable cloud-based approach to process queries on graph data utilizing the MapReduce model. We call this approach the Clause-Iteration approach. We present algorithms that, when used in conjunction with a MapReduce framework, respond to SPARQL queries over RDF data. Our innovation in the Clause-Iteration approach comes from 1) the iterative construction of query responses by incrementally growing the number of query clauses considered in a response, and 2) our use of flagged keys to join the results of these incremental responses. The Clause-Iteration algorithms form the basis of our scalable, SHARD graph-store built on the Hadoop implementation of MapReduce. SHARD performs favorably when compared to existing "industrial" graph-stores on a standard benchmark graph with 800 million edges. We discuss design considerations and alternatives associated with constructing scalable graph processing technologies.
TL;DR: In this paper, a method and apparatus is used to divide a storage volume into shards (202-210), where the division is made using a directed graph having a vertex for each block in the storage volume and directed-edges between pairs of vertices representing a shard of blocks.
Abstract: A method and apparatus is used to divide a storage volume into shards (202-210). The division is made using a directed graph having a vertex for each block in the storage volume and directed-edges between pairs of vertices representing a shard of blocks (304), associating a weight with each directed edge that represents the dissimilarity for the shard of blocks between the corresponding pair of vertices (306), selecting a maximum number of shards (K) for dividing the storage volume (402), identifying a minimum aggregate weight associated with a current vertex for a combination of no more than K shards (512-514), performing the identification of the minimum aggregate weight for vertices in the directed graph (406), and picking the smallest aggregated weight associated with the last vertex to determine a sharding that spans the storage volume and provides a minimal dissimilarity among no more than K shards of blocks (408).
TL;DR: In this article, a sharded database system for partitioning data among a plurality of shard servers is presented, which includes a first shard, a second shard and a shard control record.
Abstract: A sharded database system configured for partitioning data amongst a plurality of shard servers is provided. In one implementation the sharded database system comprises a sharded database including a first shard server, a second shard server, and a shard control record. The shard control record is configured to define a first data structure for distributing a first plurality of data records or rows based on a first sharding by monotonic key range across the first and second shard servers. The sharded database is also configured to further distribute the first plurality of records or rows across the first shard server and the second shard server via a subsidiary hashing method. A method of partitioning data of a database is also provided.