The paper provides a few strategies for addressing failures in a directory service. The paper classifies failures in two classes: system unavailable and system unreliable (classification I did not find particularly enlightening). The paper presents two backup and recovery strategies: binary (which is full binary replication) and LDIFF (a logical backup based on differences from previous version). The second strategy is more general but slower in both backup and recovery.
The paper then describes replication topologies starting from one data center up to 5 data centers. The high level idea is to have one/two master servers per data center and keep them consistent by linking them all in a circular topology (a connected component with one redundant link). To return consistent data even in master failure (within the same data center), the design also uses hub replicas one per master in datacenter (that are read only). The paper further optimizes the design for read performance with an extra layer of client servers (below the hub layer, still read only as hubs).
Random thoughts: The design is interesting, as consistency is a property hard to achieve, but the presented design not motivated (at least not in this section) and alternative designs are not really discussed. It is interesting that in general, the more mechanism is thrown to cope with failures (e.g. more redundancy), the more complex the system design is and more errors can occur due to this complexity. One can see in the paper the reliability vs overhead (delay, communication) tradeoff. Finally, the authors mention that the links between masters are “separate” but I did not understood whether they are dedicated links or just the outgoing datacenter interface is separate (if dedicated, I would wonder about the failure probability of routing vs that of the dedicated links).