About: Failover is a research topic. Over the lifetime, 2352 publications have been published within this topic receiving 48755 citations. The topic is also known as: fail-over.
TL;DR: In this paper, the authors propose to assign failover priorities to virtual servers in a cluster of two or more autonomous server nodes, where each virtual server has one or more virtual IP addresses and load balancing can be provided by distributing virtual servers from a failed node to multiple different nodes.
Abstract: Systems and methods, including computer program products, providing high-availability in server systems. In one implementation, a server system is cluster of two or more autonomous server nodes, each running one or more virtual servers. When a node fails, its virtual servers are migrated to one or more other nodes. Connectivity between nodes and clients is based on virtual IP addresses, where each virtual server has one or more virtual IP addresses. Virtual servers can be assigned failover priorities, and, in failover, higher priority virtual servers can be migrated before lower priority ones. Load balancing can be provided by distributing virtual servers from a failed node to multiple different nodes. When a port within a node fails, the node can reassign virtual IP addresses from the failed port to other ports on the node until no good ports remain and only then migrate virtual servers to another node or nodes.
TL;DR: PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees and utilizes automated load-balancing and failover to reduce operational complexity.
Abstract: We describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!'s web applications. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimental results.
TL;DR: Megastore provides fully serializable ACID semantics within ne-grained partitions of data, which allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between datacenters.
Abstract: Megastore is a storage system developed to meet the requirements of today’s interactive online services. Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high availability. We provide fully serializable ACID semantics within ne-grained partitions of data. This partitioning allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between datacenters. This paper describes Megastore’s semantics and replication algorithm. It also describes our experience supporting a wide range of Google production services built with Megastore.
TL;DR: In this paper, a failover recovery approach encapsulates the knowledge of failureover recovery between components within a storage server and between storage server systems, including information about what components are participating in a Failover Set, how they are configured for failover, what is the Fail-Stop policy, and what are the steps to perform when "failing-over" a component.
Abstract: Failover processing in storage server system utilizes policies for managing fault tolerance (FT) and high availability (HA) configurations. The approach encapsulates the knowledge of failover recovery between components within a storage server and between storage server systems. This knowledge includes information about what components are participating in a Failover Set, how they are configured for failover, what is the Fail-Stop policy, and what are the steps to perform when “failing-over” a component.
TL;DR: This paper describes Fastpass, a datacenter network architecture built using this principle that achieves high throughput comparable to current networks at a 240x reduction is queue lengths, and achieves much fairer and consistent flow throughputs than the baseline TCP.
Abstract: An ideal datacenter network should provide several properties, including low median and tail latency, high utilization (throughput), fair allocation of network resources between users or applications, deadline-aware scheduling, and congestion (loss) avoidance. Current datacenter networks inherit the principles that went into the design of the Internet, where packet transmission and path selection decisions are distributed among the endpoints and routers. Instead, we propose that each sender should delegate control---to a centralized arbiter---of when each packet should be transmitted and what path it should follow. This paper describes Fastpass, a datacenter network architecture built using this principle. Fastpass incorporates two fast algorithms: the first determines the time at which each packet should be transmitted, while the second determines the path to use for that packet. In addition, Fastpass uses an efficient protocol between the endpoints and the arbiter and an arbiter replication strategy for fault-tolerant failover. We deployed and evaluated Fastpass in a portion of Facebook's datacenter network. Our results show that Fastpass achieves high throughput comparable to current networks at a 240x reduction is queue lengths (4.35 Mbytes reducing to 18 Kbytes), achieves much fairer and consistent flow throughputs than the baseline TCP (5200x reduction in the standard deviation of per-flow throughput with five concurrent connections), scalability from 1 to 8 cores in the arbiter implementation with the ability to schedule 2.21 Terabits/s of traffic in software on eight cores, and a 2.5x reduction in the number of TCP retransmissions in a latency-sensitive service at Facebook.