Lucian's Blog: February 2009

Wednesday, February 25, 2009

HIVE: Data Warehousing & Analytics on Hadoop

HIVE is a system for querying and managing structured data built on top of Hadoop. The client describes a query in the Hive query language (based on SQL). Based on this query, Hive creates a map reduce execution plan. The Hive data model is slightly simpler than a full relational database, but appears strong enough for most datacenter query applications. Unlike the other extensions to database systems for scalability that we looked at, Hive is designed to be mostly used offline rather than online.

PNUTS: Yahoo!'s Hosted Data Serving Platform

The paper presents a database system called PNUTS. PNUTS offers a service with slightly stronger semantics than the models seen before such as Bigtable and Dynamo but still way simpler than a full relational database. PNUTS has relational tables, but accepts queries only on individual tables. The consistency model of PNUTS can be stronger than eventual consistency, and the API enables the user to specify the model of read consistency desired (I think this is a nice feature). However, the consistency model is weaker than serializable transactions.

A key architectural component of PNUTS is a pub-sub system, used to ensure reliability and replication. PNUTS uses a proprietary message broker, and updates are considered committed when they have been published to the message broker.

Wednesday, February 18, 2009

Dynamo: Amazon's Highly Available Key-Value Store

The paper presents the design and implementation of a highly available key-value storage system. The problem is that database systems are far too complex and scale poorly in a distributed setting for the requirements of modern large scale software services.

The high level idea of Dynamo is to trade consistency for availability. The query model is a single key lookup and eventual consistency. In Dynamo, conflict resolution occurs at reads rather than writes (in contrast to traditional systems or Google's Bigtable). Dynamo's architecture can be characterized as a zero-hop DHT, where each node has pointers to all the other nodes (to reduce latency and its variability).

I found interesting to see how previous research works in DHTs (e.g virtual nodes) and distributed systems (e.g. vector clocks) are combined to form a real working system in industry.

Wednesday, February 11, 2009

Bigtable: A Distributed Storage System for Structured Data

Bigtable is a distributed structured storage used inside the Google data centers.
The Bigtable interface is very simple, consisting of mapping a key tuple with three fileds to a string object (array of bytes).
For implementation, a row is the unit of distribution and load balancing.

I really liked this paper, it presents a simple abstraction that is useful in many circumstances and easy to scale. I think it will probably be influential in 10 years because traditional databases are harder and harder to scale at modern data center sizes.

The Chubby Lock Service for Loosely-Coupled Distributed Systems

The paper presents a lock service to be used by loosely coupled distributed systems.
Chubby is adapted for requirements encountered by Google applications. Therefore, Chubby is optimized for availability and reliability rather than high performance, with locks expected to be course-grained (in time). Chubby is implemented as a set of servers that use a consensus protocol among them. Interestingly, Chubby does not mandate locks, meaning that the applications choose to not modify or corrupt the locked file. This is based mainly on two reasons: the lock service simpler to implement and it does not have to handle real malicious applications as it is running in a closed environment.

The paper makes a thorough discussion of why a lock service is useful compared to a distributed consensus algorithm such as Paxos which would require no new service. I mostly agree with the arguments presented in the paper, which are: some applications are not created to be highly available as required by Paxos (e.g. are started as prototypes); application writing is simplified with a lock service (rather than implement distributed consensus in all applications); a distributed consensus would require to advertise and store small amounts of data which is difficult to do and requires a lock through the name service anyway; a single reliable node can make progress without requiring a quorum.

As a critique, the arguments presented for why building a lock service rather than a consensus protocol are valid but not well structured. Another thing is that by not mandating a lock, poorly written applications could lead to hard to detect bugs.

The Google File System

This file system is very interesting as it is adapted to new assumptions, compared to the classical distributed file systems. The new assumptions are specific to modern data center environments such as failures at such scale are common, files are huge, most files are only appended, most reads are sequential, high sustained bandwidth is more important than latency. GFS splits files into chunks and replicates individually each chunk (3 times by default). GFS uses a centralized master that controls the file system namespace and which returns a mapping from a name to a chunk handle. Chunkservers control the actual access to chunks. Compared to traditional file systems, GFS adds two new operations, snapshot and record append; also, the GFS consistency model is more relaxed, file mutations are atomic, but writes are done on new chunks and users can read stale data since they cache chunk handles (old chunks are garbage collected).

I think this paper will be influential in 10 years since it is a new large step in the modern era of computer science, advancing the state of the art from the traditional file system models since assumptions have changed. Moreover, many companies (such as Microsoft) have implemented file systems that practically copy the GFS.

What I was less comfortable with is the centralized master, which although presents advantages towards simplicity and smarter replica placement, it feels in a way contradictory to the huge scaling issues that this file system is actually addressing.

Monday, February 9, 2009

eBay Scalling Odyssey & All I Need is Scale!

The two presentations show five principles used within Ebay when implementing a webservice. I really liked these principles. I think they show the hidden side of a success story, which many may tend to say it is mostly related to the business model and overlook the hard technical problems that were overcame.

Authors then outline the challenges of moving such an application to Cloud computing. Among the most important challenges are: the difficulty in migration between Cloud/non Cloud, the fact that accountability and SLAs are difficult/not clear.
I think this is a classical dilemma in computer science, to structure better for evolvability but loose efficiency VS to be monolithic and not evolvable but be more efficient (e.g. a similar example is micro-kernel vs monolithic kernel). I think in time a better structure always wins, although the time when the performance requirements can be met by such a structuring cannot be easily determined. The principles presented before for structuring an application, strongly advise to use a better structuring for scalability/evolvability.

Wednesday, February 4, 2009

A Policy-aware Switching Layer for Data Centers

The problem is that data centers contain a large number of middleboxes (firewalls, load balancers) and configuring it (order and when they need to be used) is hard.
The paper presents a switching layer enabling policies to be deployed for which middleboxes to be used and in which order, making configuration much easier.

I liked the paper and think it may be influential in 10 years because I think this type of enhancement would really improve the current state of on-path design.

DCell: A Scalable and Fault-Tolerant Network Structure for Data

Authors propose a data center network architecture using a recursive structure called DCell. Authors also propose a routing algorithm (traditional routing cannot be used) enabling fault tolerance for servers, links or even racks.

The advantages of the proposed architecture are better fault tolerance and scaling with the number of nodes (although I'm not fully convinced by the second argument).

A clear disadvantage of this structure is that the bisection bandwidth is somewhat small, i.e. it is (O(2*log(hosts):1)). This impacts the deployment of random communication patterns and random placement of data in cloud computing environments.
For this reason and due to its complexity, I would tend to think this paper will not be influential in 10 years, at least not in deployments.

A Scalable, Commodity Data Center Network Architecture

The paper addresses the problem of expensive network infrastructure in data centers and the fact that the current topologies are under provisioned for the bisection bandwidth. Authors propose to replace the current topologies using large core switches (currently 10Gbps) with a fat-tree topology formed only by smaller switches (currently 1Gbps). This helps in: reducing costs, increase bisection bandwidth, increase fault tolerance and reduce energy consumption.

The problem seems real particularly with the cloud computing presumably requiring more bandwidth between data center nodes (at least due to higher utilization). The solution is interesting.

There are two main issues created by adopting such a solution: load balancing becomes much harder due to the large variety of smaller size paths, and wiring is more complex. Authors address both problems by proposing both a static load balancing using a two level routing algorithm and possibly a per (large) flow scheduling using a centralized scheduler. To improve wiring, authors propose a packaging solution.

I liked the paper and I think it will be influential in the next 10 year, not necessarily through deployments of this form but due to potentially triggering a paradigm shift in data center design.

Monday, February 2, 2009

Designing a highly availabile directory service

The paper provides a few strategies for addressing failures in a directory service. The paper classifies failures in two classes: system unavailable and system unreliable (classification I did not find particularly enlightening). The paper presents two backup and recovery strategies: binary (which is full binary replication) and LDIFF (a logical backup based on differences from previous version). The second strategy is more general but slower in both backup and recovery.

The paper then describes replication topologies starting from one data center up to 5 data centers. The high level idea is to have one/two master servers per data center and keep them consistent by linking them all in a circular topology (a connected component with one redundant link). To return consistent data even in master failure (within the same data center), the design also uses hub replicas one per master in datacenter (that are read only). The paper further optimizes the design for read performance with an extra layer of client servers (below the hub layer, still read only as hubs).

Random thoughts: The design is interesting, as consistency is a property hard to achieve, but the presented design not motivated (at least not in this section) and alternative designs are not really discussed. It is interesting that in general, the more mechanism is thrown to cope with failures (e.g. more redundancy), the more complex the system design is and more errors can occur due to this complexity. One can see in the paper the reliability vs overhead (delay, communication) tradeoff. Finally, the authors mention that the links between masters are “separate” but I did not understood whether they are dedicated links or just the outgoing datacenter interface is separate (if dedicated, I would wonder about the failure probability of routing vs that of the dedicated links).

Failure Trends in a Large Disk Drive Population

The paper presents failure patterns of hard drives collected from a large sample (over 100k hard drives) in Google’s data center. The paper makes the argument that failures are hard to define because failure can be context dependent (e.g. bad driver). Thus, they roughly define a hard drive to have failed if replaced in a repair procedure. As expected, failure results are significantly affected by model/manufacturer, the paper mentions this but does not show such a breakdown.

I found interesting the survival probability breakdown by age after a scan error and after a reallocation and the failure rate by age. I also liked the tracing/detection mechanism (using the Google stack) which is quite general.

I think these results may be interesting for quite some time since such detailed studies for large populations are hard to achieve and not often published, however, at least to me, a breakdown for model/manufacturer before drawing all of these charts would have made much more sense.

Failure stories

Crash: Data Center Horror Stories – this article describes a data center failure due to bad power cabling (but quite a hidden one). The most important lessons learned are: to not use a single point of failure and use specialized (data center specific) and highly reliable equipment.

Data Center Failure As A Learning Experience - this article emphasizes the role of planning and design in data center construction to “fail small”, since “fail never” may not be achievable. The article also notices that most failures are concentrated in shortly after being put in service and when approaching the end of life.

Lucian's Blog