Wednesday, February 25, 2009
A key architectural component of PNUTS is a pub-sub system, used to ensure reliability and replication. PNUTS uses a proprietary message broker, and updates are considered committed when they have been published to the message broker.
Wednesday, February 18, 2009
The high level idea of Dynamo is to trade consistency for availability. The query model is a single key lookup and eventual consistency. In Dynamo, conflict resolution occurs at reads rather than writes (in contrast to traditional systems or Google's Bigtable). Dynamo's architecture can be characterized as a zero-hop DHT, where each node has pointers to all the other nodes (to reduce latency and its variability).
I found interesting to see how previous research works in DHTs (e.g virtual nodes) and distributed systems (e.g. vector clocks) are combined to form a real working system in industry.
Wednesday, February 11, 2009
The Bigtable interface is very simple, consisting of mapping a key tuple with three fileds
For implementation, a row is the unit of distribution and load balancing.
I really liked this paper, it presents a simple abstraction that is useful in many circumstances and easy to scale. I think it will probably be influential in 10 years because traditional databases are harder and harder to scale at modern data center sizes.
Chubby is adapted for requirements encountered by Google applications. Therefore, Chubby is optimized for availability and reliability rather than high performance, with locks expected to be course-grained (in time). Chubby is implemented as a set of servers that use a consensus protocol among them. Interestingly, Chubby does not mandate locks, meaning that the applications choose to not modify or corrupt the locked file. This is based mainly on two reasons: the lock service simpler to implement and it does not have to handle real malicious applications as it is running in a closed environment.
The paper makes a thorough discussion of why a lock service is useful compared to a distributed consensus algorithm such as Paxos which would require no new service. I mostly agree with the arguments presented in the paper, which are: some applications are not created to be highly available as required by Paxos (e.g. are started as prototypes); application writing is simplified with a lock service (rather than implement distributed consensus in all applications); a distributed consensus would require to advertise and store small amounts of data which is difficult to do and requires a lock through the name service anyway; a single reliable node can make progress without requiring a quorum.
As a critique, the arguments presented for why building a lock service rather than a consensus protocol are valid but not well structured. Another thing is that by not mandating a lock, poorly written applications could lead to hard to detect bugs.
I think this paper will be influential in 10 years since it is a new large step in the modern era of computer science, advancing the state of the art from the traditional file system models since assumptions have changed. Moreover, many companies (such as Microsoft) have implemented file systems that practically copy the GFS.
What I was less comfortable with is the centralized master, which although presents advantages towards simplicity and smarter replica placement, it feels in a way contradictory to the huge scaling issues that this file system is actually addressing.
Monday, February 9, 2009
Authors then outline the challenges of moving such an application to Cloud computing. Among the most important challenges are: the difficulty in migration between Cloud/non Cloud, the fact that accountability and SLAs are difficult/not clear.
I think this is a classical dilemma in computer science, to structure better for evolvability but loose efficiency VS to be monolithic and not evolvable but be more efficient (e.g. a similar example is micro-kernel vs monolithic kernel). I think in time a better structure always wins, although the time when the performance requirements can be met by such a structuring cannot be easily determined. The principles presented before for structuring an application, strongly advise to use a better structuring for scalability/evolvability.
Wednesday, February 4, 2009
The paper presents a switching layer enabling policies to be deployed for which middleboxes to be used and in which order, making configuration much easier.
I liked the paper and think it may be influential in 10 years because I think this type of enhancement would really improve the current state of on-path design.
The advantages of the proposed architecture are better fault tolerance and scaling with the number of nodes (although I'm not fully convinced by the second argument).
A clear disadvantage of this structure is that the bisection bandwidth is somewhat small, i.e. it is (O(2*log(hosts):1)). This impacts the deployment of random communication patterns and random placement of data in cloud computing environments.
For this reason and due to its complexity, I would tend to think this paper will not be influential in 10 years, at least not in deployments.
The problem seems real particularly with the cloud computing presumably requiring more bandwidth between data center nodes (at least due to higher utilization). The solution is interesting.
There are two main issues created by adopting such a solution: load balancing becomes much harder due to the large variety of smaller size paths, and wiring is more complex. Authors address both problems by proposing both a static load balancing using a two level routing algorithm and possibly a per (large) flow scheduling using a centralized scheduler. To improve wiring, authors propose a packaging solution.
I liked the paper and I think it will be influential in the next 10 year, not necessarily through deployments of this form but due to potentially triggering a paradigm shift in data center design.
Monday, February 2, 2009
The paper then describes replication topologies starting from one data center up to 5 data centers. The high level idea is to have one/two master servers per data center and keep them consistent by linking them all in a circular topology (a connected component with one redundant link). To return consistent data even in master failure (within the same data center), the design also uses hub replicas one per master in datacenter (that are read only). The paper further optimizes the design for read performance with an extra layer of client servers (below the hub layer, still read only as hubs).
Random thoughts: The design is interesting, as consistency is a property hard to achieve, but the presented design not motivated (at least not in this section) and alternative designs are not really discussed. It is interesting that in general, the more mechanism is thrown to cope with failures (e.g. more redundancy), the more complex the system design is and more errors can occur due to this complexity. One can see in the paper the reliability vs overhead (delay, communication) tradeoff. Finally, the authors mention that the links between masters are “separate” but I did not understood whether they are dedicated links or just the outgoing datacenter interface is separate (if dedicated, I would wonder about the failure probability of routing vs that of the dedicated links).
I found interesting the survival probability breakdown by age after a scan error and after a reallocation and the failure rate by age. I also liked the tracing/detection mechanism (using the Google stack) which is quite general.
I think these results may be interesting for quite some time since such detailed studies for large populations are hard to achieve and not often published, however, at least to me, a breakdown for model/manufacturer before drawing all of these charts would have made much more sense.
Crash: Data Center Horror Stories – this article describes a data center failure due to bad power cabling (but quite a hidden one). The most important lessons learned are: to not use a single point of failure and use specialized (data center specific) and highly reliable equipment.
Data Center Failure As A Learning Experience - this article emphasizes the role of planning and design in data center construction to “fail small”, since “fail never” may not be achievable. The article also notices that most failures are concentrated in shortly after being put in service and when approaching the end of life.