Wednesday, February 11, 2009

The Google File System

This file system is very interesting as it is adapted to new assumptions, compared to the classical distributed file systems. The new assumptions are specific to modern data center environments such as failures at such scale are common, files are huge, most files are only appended, most reads are sequential, high sustained bandwidth is more important than latency. GFS splits files into chunks and replicates individually each chunk (3 times by default). GFS uses a centralized master that controls the file system namespace and which returns a mapping from a name to a chunk handle. Chunkservers control the actual access to chunks. Compared to traditional file systems, GFS adds two new operations, snapshot and record append; also, the GFS consistency model is more relaxed, file mutations are atomic, but writes are done on new chunks and users can read stale data since they cache chunk handles (old chunks are garbage collected).

I think this paper will be influential in 10 years since it is a new large step in the modern era of computer science, advancing the state of the art from the traditional file system models since assumptions have changed. Moreover, many companies (such as Microsoft) have implemented file systems that practically copy the GFS.

What I was less comfortable with is the centralized master, which although presents advantages towards simplicity and smarter replica placement, it feels in a way contradictory to the huge scaling issues that this file system is actually addressing.

No comments:

Post a Comment