Lucian's Blog: March 2009

Monday, March 30, 2009

Friday: Global Comprehension for Distributed Replay

Friday is a system for distributed debugging using deterministic replay. Friday allows distributed watchpoints and breakpoints to be placed in the replayed system. Besides these, Friday allows a scripting language (Python) for commands to be associated with the watchpoints and breakpoints. These commands can modify debug variables, implement complex predicates or even call functions of the debugged application. This seems a really important feature to have as exemplified by the simplicity with which the Chord example can be debugged.

XTrace

XTrace is a tracing system for distributed and complex applications. XTrace allows traces across multiple layers. It uses labels to recreate the trace and a conceptually centralized database. XTrace requires code instrumentation at all (traced) stack layers.

I liked XTrace and I think it can be very useful to check where the distributed execution got stuck as it generates an easy to follow tree output.

DTrace: Dynamic Instrumentation of Production Systems

DTrace is a tracing system implemented in the Solaris kernel that allows tracing of user level and kernel level programs. It is dynamic, meaning that it is explicitly called and when not in use it consumes no extra resources.
The architecture is two tiered, with a core DTrace module (that acts as a multiplexer and disables interrupts) and providers, which are the modules that actually perform the tracing. In this way, different instrumentation methodologies can be added. Many different providers were already implemented by the authors.
Users specify arbitrary predicates to monitor their programs using a C-like programming language called "D".

This architecture is interesting. I think one recurrent tradeoff in the tracing systems is the amount of user work versus the overhead of tracing.

Wednesday, March 18, 2009

Chukwa

Chukwa is a data collection system built on top of Hadoop. It solves some particular problems in this context such as the fact that HDFS is not best suited to hold files used for monitoring; for this Chukwa uses a collector to aggregate logs and reduce the number of HDFS files generated. Chukwa is under use at Yahoo and the evaluation shows a small overhead.

Artemis

Artemis is a framework designed to analyze logs for performance troubleshooting. It is formed by 4 parts: log collection and data extraction, database, visualization tool and plug-in interface.

I liked that the paper presents a real problem that was detected using this tool. After reading the paper, I am not sure how much work is required to adapt Artemis to a new environment/application versus writing a quick-and-dirty application specific script to monitor interesting variables.
I think a tradeoff here is the log structuring to push more in the automated part of the analyzer and make analysis easier vs the ease of generating (unstructured) logs.
Related to this, the paper does not specify how the DryadLINQ computation that summarizes the logs works, and how does this always scale to use a commercial database for the analyzed data.

Wednesday, March 4, 2009

DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language

DryadLINQ is a layer on top of Dryad allowing users to easily implement distributed computations written in a simple language like LINQ. DryadLINQ compiles user programs into Dryad tasks and runs them on clusters under the hood, with little awareness of the user. LINQ allows the users to write both SQL type queries and imperative, object oriented like programs. Through the examples and demos that I've seen, it looks like a really neat tool. Compared to both HIVE and Pig, the LINQ language seems to be more powerful, and the underlying Dryad offers more room for optimizations than Map-reduce for the other two. Again, DryadLINQ pays the price in worldwide usage and adoption for using proprietary technology.

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Dryad is another framework for writing distributed computations. Compared to map-reduce, it allows a general computational DAG, with arbitrary structure and where vertices can implement arbitrary computations. This gain is somewhat payed through increased complexity. However, as seen by recent industry feedback such as Yahoo, these features seem useful in practice. A classical example is a join between a small and a large table where the small table can be distributed to all nodes and held in memory.

Dryad is used on wide scale at Microsoft and I think will be influential in 10 years because as an extension to map-reduce like jobs, it is the first paper to show how this can be done on a data center. However, due to the lack of an open-source implementation, the more complex dryad paradigm lags map-reduce in worldwide usage.

MapReduce: Simplified Data Processing on Large Clusters

Map-reduce presents a framework for running large distributed computations. The main contribution of map-reduce is the identification of a construction that is simple, but at the same time is so general that naturally captures a (really) wide variety of distributed computations used in practice. This is the map-reduce framework.

It seems obvious that this paper will be influential in 10 years.
A tradeoff that can be identified from reading subsequent papers such as Dryad is the one between the simplicity of the model and its expressivity (the latter translates into more efficiency and ease of expressing some complex computations).

Monday, March 2, 2009

Pig Latin: A Not-So-Foreign Language for Data Processing

The paper presents a new language for querying information in data centers, trying to fill a gap between the high level SQL and the low level, hand written, map-reduce execution plans.
The main advantage that I see is that, unlike SQL, having this language does not impose a schema on the information, is extensible to user defined functions and allows nested structures.
The paper also makes the case that is it easier for programmers to write in Pig versus the declarative SQL (as it more natural to write imperative code and it is well known that debugging declarative programs is difficult).

In general I am skeptical when being presented with yet another new language and at a first glance it seems that most examples can be written in SQL. However, after reading more, I actually liked Pig and I think there is need for a new such language and the choices made by Pig make sense to me. Since it is open source and not many such systems are readily available to outside communities, I would say Pig may have the traction to be influential in 10 years.

Lucian's Blog