Lucian's Blog: 2009

Monday, April 27, 2009

BOINC: A System for Public-Resource Computing and Storage

The paper presents BOINC, a software similar to SETI@HOME for using home users computational resources.

A common problem with this approach are the incentives for home users, which the current paper addresses through the use of "credits" for computational power (which may be rewarded) and by allowing users to select computational topics they feel stronger about.

An interesting discussion point is the use of total energy of this approach compared to a dedicated cluster (a single organizational control). For example, BOINC launches multiple redundant tasks to avoid malicious results, which is clearly a waste of energy. Also, these tasks can wake up processors in low states of sleep (a global optimization could optimize this energy use) and transfer data on long distances.

Incentives Build Robustness in BitTorrent

This paper presents a piece of good software. The idea behind bittorrent is to download file chunks in parallel from multiple peers. Chunk order is rarest piece first to increase overall presence of the file. The peers are found through a tracker (server) who's address is hard coded in the torrent file. Each peer selfishly allows other peers to download from it based on reciprocal downloads and to avoid oscillations, this is done or a larger time scale.

One thing that I liked is the extensive use of randomization (returned peers are random, unchoking random peers) which has shown good results in practice.

Measuring and Evaluating Large-Scale CDNs

Very interesting paper, it shows the different design choices for CDNs and their current deployment scale and performance. It is interesting that CDNs do not necessarily redirect to the lowest delay servers (but use other metrics also) and that they also use IP anycast. Another interesting aspect of the paper is that it measures the marginal gains of each datacenter and that offers a good starting point for new CDNs on how to deploy their network (lowering a bit the barrier to entry).

Wednesday, April 15, 2009

Open Cloud/AppDrop/GoogleApp

The Open Cloud Manifesto proposes the idea of an open cloud mainly for portability/flexibility in choosing cloud provider and efficiency (developers used to the API, maturing code, etc).

Unfortunately, the paper is written from a cloud user standpoint. From a business perspective, the leader cloud providers have no incentives to open up their interfaces. The tradition of successful IT business seems more to indicate to rely on very closed systems e.g. look at Microsoft, Google (even though appears open it is actually very closed).

In fact, this is the approach for cloud that Google is pursuing with Google Apps. AppDrop could turn out to be a bad replica of the original due to many reasons and unknown tweaks/details of the Google setup.

Wednesday, April 1, 2009

Erlang

Erlang is a programming language with higher level constructions such as process,scheduling or memory management built into the language itself. These constructions are typically associated with operating systems. By construction, Erlang is run in a virtual machine like environment (supposedly it runs just as fast as as unoptimized C code but this is not an appropriate performance metric).

While I was convinced by the usefulness of having such high level operations in the language constructions, I was not fully convinced by the language itself, which at a first glance does not look very appealing. I would argue (possibly wrongly :) ) that a language with this constructions but closer in syntax to an existing programming language would be much easily adopted.

Monday, March 30, 2009

Friday: Global Comprehension for Distributed Replay

Friday is a system for distributed debugging using deterministic replay. Friday allows distributed watchpoints and breakpoints to be placed in the replayed system. Besides these, Friday allows a scripting language (Python) for commands to be associated with the watchpoints and breakpoints. These commands can modify debug variables, implement complex predicates or even call functions of the debugged application. This seems a really important feature to have as exemplified by the simplicity with which the Chord example can be debugged.

XTrace

XTrace is a tracing system for distributed and complex applications. XTrace allows traces across multiple layers. It uses labels to recreate the trace and a conceptually centralized database. XTrace requires code instrumentation at all (traced) stack layers.

I liked XTrace and I think it can be very useful to check where the distributed execution got stuck as it generates an easy to follow tree output.

DTrace: Dynamic Instrumentation of Production Systems

DTrace is a tracing system implemented in the Solaris kernel that allows tracing of user level and kernel level programs. It is dynamic, meaning that it is explicitly called and when not in use it consumes no extra resources.
The architecture is two tiered, with a core DTrace module (that acts as a multiplexer and disables interrupts) and providers, which are the modules that actually perform the tracing. In this way, different instrumentation methodologies can be added. Many different providers were already implemented by the authors.
Users specify arbitrary predicates to monitor their programs using a C-like programming language called "D".

This architecture is interesting. I think one recurrent tradeoff in the tracing systems is the amount of user work versus the overhead of tracing.

Wednesday, March 18, 2009

Chukwa

Chukwa is a data collection system built on top of Hadoop. It solves some particular problems in this context such as the fact that HDFS is not best suited to hold files used for monitoring; for this Chukwa uses a collector to aggregate logs and reduce the number of HDFS files generated. Chukwa is under use at Yahoo and the evaluation shows a small overhead.

Artemis

Artemis is a framework designed to analyze logs for performance troubleshooting. It is formed by 4 parts: log collection and data extraction, database, visualization tool and plug-in interface.

I liked that the paper presents a real problem that was detected using this tool. After reading the paper, I am not sure how much work is required to adapt Artemis to a new environment/application versus writing a quick-and-dirty application specific script to monitor interesting variables.
I think a tradeoff here is the log structuring to push more in the automated part of the analyzer and make analysis easier vs the ease of generating (unstructured) logs.
Related to this, the paper does not specify how the DryadLINQ computation that summarizes the logs works, and how does this always scale to use a commercial database for the analyzed data.

Wednesday, March 4, 2009

DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language

DryadLINQ is a layer on top of Dryad allowing users to easily implement distributed computations written in a simple language like LINQ. DryadLINQ compiles user programs into Dryad tasks and runs them on clusters under the hood, with little awareness of the user. LINQ allows the users to write both SQL type queries and imperative, object oriented like programs. Through the examples and demos that I've seen, it looks like a really neat tool. Compared to both HIVE and Pig, the LINQ language seems to be more powerful, and the underlying Dryad offers more room for optimizations than Map-reduce for the other two. Again, DryadLINQ pays the price in worldwide usage and adoption for using proprietary technology.

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Dryad is another framework for writing distributed computations. Compared to map-reduce, it allows a general computational DAG, with arbitrary structure and where vertices can implement arbitrary computations. This gain is somewhat payed through increased complexity. However, as seen by recent industry feedback such as Yahoo, these features seem useful in practice. A classical example is a join between a small and a large table where the small table can be distributed to all nodes and held in memory.

Dryad is used on wide scale at Microsoft and I think will be influential in 10 years because as an extension to map-reduce like jobs, it is the first paper to show how this can be done on a data center. However, due to the lack of an open-source implementation, the more complex dryad paradigm lags map-reduce in worldwide usage.

MapReduce: Simplified Data Processing on Large Clusters

Map-reduce presents a framework for running large distributed computations. The main contribution of map-reduce is the identification of a construction that is simple, but at the same time is so general that naturally captures a (really) wide variety of distributed computations used in practice. This is the map-reduce framework.

It seems obvious that this paper will be influential in 10 years.
A tradeoff that can be identified from reading subsequent papers such as Dryad is the one between the simplicity of the model and its expressivity (the latter translates into more efficiency and ease of expressing some complex computations).

Monday, March 2, 2009

Pig Latin: A Not-So-Foreign Language for Data Processing

The paper presents a new language for querying information in data centers, trying to fill a gap between the high level SQL and the low level, hand written, map-reduce execution plans.
The main advantage that I see is that, unlike SQL, having this language does not impose a schema on the information, is extensible to user defined functions and allows nested structures.
The paper also makes the case that is it easier for programmers to write in Pig versus the declarative SQL (as it more natural to write imperative code and it is well known that debugging declarative programs is difficult).

In general I am skeptical when being presented with yet another new language and at a first glance it seems that most examples can be written in SQL. However, after reading more, I actually liked Pig and I think there is need for a new such language and the choices made by Pig make sense to me. Since it is open source and not many such systems are readily available to outside communities, I would say Pig may have the traction to be influential in 10 years.

Wednesday, February 25, 2009

HIVE: Data Warehousing & Analytics on Hadoop

HIVE is a system for querying and managing structured data built on top of Hadoop. The client describes a query in the Hive query language (based on SQL). Based on this query, Hive creates a map reduce execution plan. The Hive data model is slightly simpler than a full relational database, but appears strong enough for most datacenter query applications. Unlike the other extensions to database systems for scalability that we looked at, Hive is designed to be mostly used offline rather than online.

PNUTS: Yahoo!'s Hosted Data Serving Platform

The paper presents a database system called PNUTS. PNUTS offers a service with slightly stronger semantics than the models seen before such as Bigtable and Dynamo but still way simpler than a full relational database. PNUTS has relational tables, but accepts queries only on individual tables. The consistency model of PNUTS can be stronger than eventual consistency, and the API enables the user to specify the model of read consistency desired (I think this is a nice feature). However, the consistency model is weaker than serializable transactions.

A key architectural component of PNUTS is a pub-sub system, used to ensure reliability and replication. PNUTS uses a proprietary message broker, and updates are considered committed when they have been published to the message broker.

Wednesday, February 18, 2009

Dynamo: Amazon's Highly Available Key-Value Store

The paper presents the design and implementation of a highly available key-value storage system. The problem is that database systems are far too complex and scale poorly in a distributed setting for the requirements of modern large scale software services.

The high level idea of Dynamo is to trade consistency for availability. The query model is a single key lookup and eventual consistency. In Dynamo, conflict resolution occurs at reads rather than writes (in contrast to traditional systems or Google's Bigtable). Dynamo's architecture can be characterized as a zero-hop DHT, where each node has pointers to all the other nodes (to reduce latency and its variability).

I found interesting to see how previous research works in DHTs (e.g virtual nodes) and distributed systems (e.g. vector clocks) are combined to form a real working system in industry.

Wednesday, February 11, 2009

Bigtable: A Distributed Storage System for Structured Data

Bigtable is a distributed structured storage used inside the Google data centers.
The Bigtable interface is very simple, consisting of mapping a key tuple with three fileds to a string object (array of bytes).
For implementation, a row is the unit of distribution and load balancing.

I really liked this paper, it presents a simple abstraction that is useful in many circumstances and easy to scale. I think it will probably be influential in 10 years because traditional databases are harder and harder to scale at modern data center sizes.

The Chubby Lock Service for Loosely-Coupled Distributed Systems

The paper presents a lock service to be used by loosely coupled distributed systems.
Chubby is adapted for requirements encountered by Google applications. Therefore, Chubby is optimized for availability and reliability rather than high performance, with locks expected to be course-grained (in time). Chubby is implemented as a set of servers that use a consensus protocol among them. Interestingly, Chubby does not mandate locks, meaning that the applications choose to not modify or corrupt the locked file. This is based mainly on two reasons: the lock service simpler to implement and it does not have to handle real malicious applications as it is running in a closed environment.

The paper makes a thorough discussion of why a lock service is useful compared to a distributed consensus algorithm such as Paxos which would require no new service. I mostly agree with the arguments presented in the paper, which are: some applications are not created to be highly available as required by Paxos (e.g. are started as prototypes); application writing is simplified with a lock service (rather than implement distributed consensus in all applications); a distributed consensus would require to advertise and store small amounts of data which is difficult to do and requires a lock through the name service anyway; a single reliable node can make progress without requiring a quorum.

As a critique, the arguments presented for why building a lock service rather than a consensus protocol are valid but not well structured. Another thing is that by not mandating a lock, poorly written applications could lead to hard to detect bugs.

The Google File System

This file system is very interesting as it is adapted to new assumptions, compared to the classical distributed file systems. The new assumptions are specific to modern data center environments such as failures at such scale are common, files are huge, most files are only appended, most reads are sequential, high sustained bandwidth is more important than latency. GFS splits files into chunks and replicates individually each chunk (3 times by default). GFS uses a centralized master that controls the file system namespace and which returns a mapping from a name to a chunk handle. Chunkservers control the actual access to chunks. Compared to traditional file systems, GFS adds two new operations, snapshot and record append; also, the GFS consistency model is more relaxed, file mutations are atomic, but writes are done on new chunks and users can read stale data since they cache chunk handles (old chunks are garbage collected).

I think this paper will be influential in 10 years since it is a new large step in the modern era of computer science, advancing the state of the art from the traditional file system models since assumptions have changed. Moreover, many companies (such as Microsoft) have implemented file systems that practically copy the GFS.

What I was less comfortable with is the centralized master, which although presents advantages towards simplicity and smarter replica placement, it feels in a way contradictory to the huge scaling issues that this file system is actually addressing.

Monday, February 9, 2009

eBay Scalling Odyssey & All I Need is Scale!

The two presentations show five principles used within Ebay when implementing a webservice. I really liked these principles. I think they show the hidden side of a success story, which many may tend to say it is mostly related to the business model and overlook the hard technical problems that were overcame.

Authors then outline the challenges of moving such an application to Cloud computing. Among the most important challenges are: the difficulty in migration between Cloud/non Cloud, the fact that accountability and SLAs are difficult/not clear.
I think this is a classical dilemma in computer science, to structure better for evolvability but loose efficiency VS to be monolithic and not evolvable but be more efficient (e.g. a similar example is micro-kernel vs monolithic kernel). I think in time a better structure always wins, although the time when the performance requirements can be met by such a structuring cannot be easily determined. The principles presented before for structuring an application, strongly advise to use a better structuring for scalability/evolvability.

Wednesday, February 4, 2009

A Policy-aware Switching Layer for Data Centers

The problem is that data centers contain a large number of middleboxes (firewalls, load balancers) and configuring it (order and when they need to be used) is hard.
The paper presents a switching layer enabling policies to be deployed for which middleboxes to be used and in which order, making configuration much easier.

I liked the paper and think it may be influential in 10 years because I think this type of enhancement would really improve the current state of on-path design.

DCell: A Scalable and Fault-Tolerant Network Structure for Data

Authors propose a data center network architecture using a recursive structure called DCell. Authors also propose a routing algorithm (traditional routing cannot be used) enabling fault tolerance for servers, links or even racks.

The advantages of the proposed architecture are better fault tolerance and scaling with the number of nodes (although I'm not fully convinced by the second argument).

A clear disadvantage of this structure is that the bisection bandwidth is somewhat small, i.e. it is (O(2*log(hosts):1)). This impacts the deployment of random communication patterns and random placement of data in cloud computing environments.
For this reason and due to its complexity, I would tend to think this paper will not be influential in 10 years, at least not in deployments.

A Scalable, Commodity Data Center Network Architecture

The paper addresses the problem of expensive network infrastructure in data centers and the fact that the current topologies are under provisioned for the bisection bandwidth. Authors propose to replace the current topologies using large core switches (currently 10Gbps) with a fat-tree topology formed only by smaller switches (currently 1Gbps). This helps in: reducing costs, increase bisection bandwidth, increase fault tolerance and reduce energy consumption.

The problem seems real particularly with the cloud computing presumably requiring more bandwidth between data center nodes (at least due to higher utilization). The solution is interesting.

There are two main issues created by adopting such a solution: load balancing becomes much harder due to the large variety of smaller size paths, and wiring is more complex. Authors address both problems by proposing both a static load balancing using a two level routing algorithm and possibly a per (large) flow scheduling using a centralized scheduler. To improve wiring, authors propose a packaging solution.

I liked the paper and I think it will be influential in the next 10 year, not necessarily through deployments of this form but due to potentially triggering a paradigm shift in data center design.

Monday, February 2, 2009

Designing a highly availabile directory service

The paper provides a few strategies for addressing failures in a directory service. The paper classifies failures in two classes: system unavailable and system unreliable (classification I did not find particularly enlightening). The paper presents two backup and recovery strategies: binary (which is full binary replication) and LDIFF (a logical backup based on differences from previous version). The second strategy is more general but slower in both backup and recovery.

The paper then describes replication topologies starting from one data center up to 5 data centers. The high level idea is to have one/two master servers per data center and keep them consistent by linking them all in a circular topology (a connected component with one redundant link). To return consistent data even in master failure (within the same data center), the design also uses hub replicas one per master in datacenter (that are read only). The paper further optimizes the design for read performance with an extra layer of client servers (below the hub layer, still read only as hubs).

Random thoughts: The design is interesting, as consistency is a property hard to achieve, but the presented design not motivated (at least not in this section) and alternative designs are not really discussed. It is interesting that in general, the more mechanism is thrown to cope with failures (e.g. more redundancy), the more complex the system design is and more errors can occur due to this complexity. One can see in the paper the reliability vs overhead (delay, communication) tradeoff. Finally, the authors mention that the links between masters are “separate” but I did not understood whether they are dedicated links or just the outgoing datacenter interface is separate (if dedicated, I would wonder about the failure probability of routing vs that of the dedicated links).

Failure Trends in a Large Disk Drive Population

The paper presents failure patterns of hard drives collected from a large sample (over 100k hard drives) in Google’s data center. The paper makes the argument that failures are hard to define because failure can be context dependent (e.g. bad driver). Thus, they roughly define a hard drive to have failed if replaced in a repair procedure. As expected, failure results are significantly affected by model/manufacturer, the paper mentions this but does not show such a breakdown.

I found interesting the survival probability breakdown by age after a scan error and after a reallocation and the failure rate by age. I also liked the tracing/detection mechanism (using the Google stack) which is quite general.

I think these results may be interesting for quite some time since such detailed studies for large populations are hard to achieve and not often published, however, at least to me, a breakdown for model/manufacturer before drawing all of these charts would have made much more sense.

Failure stories

Crash: Data Center Horror Stories – this article describes a data center failure due to bad power cabling (but quite a hidden one). The most important lessons learned are: to not use a single point of failure and use specialized (data center specific) and highly reliable equipment.

Data Center Failure As A Learning Experience - this article emphasizes the role of planning and design in data center construction to “fail small”, since “fail never” may not be achievable. The article also notices that most failures are concentrated in shortly after being put in service and when approaching the end of life.