Citation
Xu, Lihao (1999) Highly available distributed storage systems. Dissertation (Ph.D.), California Institute of Technology. http://resolver.caltech.edu/CaltechETD:etd05162005084223
Abstract
As the need for data explodes with the passage of time and the increase of computing power, data storage becomes more and more important. Distributed storage, as distributed computing before it, is coming of age as a good solution to make systems highly available, i.e., highly scalable, reliable and efficient. The focus of this thesis is how to achieve data reliability and efficiency in distributed storage systems. This thesis consists of two parts. The first part deals with the reliability of distributed storage systems. Reliability is achieved by computationally efficient MDS array codes that eliminate single points of failure in the systems, thus providing more reliability and flexibility to the systems. Such codes can be used as general MDS errorcorrecting codes. They are particularly suitable for use in distributed storage systems. The second part deals with the efficiency of distributed storage systems. Methods are proposed to improve the performance of data server and storage systems significantly through the proper use of data redundancy. These methods are based on errorcorrecting codes, particularly the MDS array codes developed in the first part.
Two new classes of MDS array codes are presented: the XCode and the BCode. The encoding operations of both codes are optimal, i.e., their update complexity achieves the theoretical lower bound. They distribute parity bits over all columns rather than concentrating them on some parity columns. As with other array codes, the error model for both codes is that errors or erasures are columns of the array, i.e., if at least one bit of a column is an error or erasure, then the whole column is considered to be an error or erasure. Both codes are of distance 3, i.e., they can either: correct two erasures, detect two errors or correct one error. In addition to encoding algorithms, efficient decoding algorithms are proposed, both for erasurecorrecting and for errorcorrecting. In fact, the erasurecorrecting algorithms are also optimal in terms of computation complexity.
The XCode has a very simple geometrical structure: the parity bits are constructed along two groups of parallel parity lines of slopes 1 and 1. This is the origin of the name XCode. This simple geometrical structure allows simple erasuredecoding and errordecoding algorithms, using only XORs and vector cyclicshift operations.
The significance of the Bcode not only includes all its optimality properties: MDS, optimal encoding and optimal decoding, but also its relation with a 3decade old graph theory problem. It is proven in this thesis that constructing a BCode of odd length is exactly equivalent to constructing a perfect onefactorization (or P1F) of a complete graph. Constructing a P1F of an arbitrary complete graph has remained a conjecture since the early 1960's. Though the P1F conjecture remains unsolved, the Bcode as the first real application of the P1F problem will hopefully spur more research on it. It is also conjectured in this thesis that constructing a BCode of any length, even or odd, is equivalent to constructing a P1F of a complete graph. An efficient errorcorrecting algorithm for the BCode is also presented, which is based on the relations between the BCode and its dual. The algorithm might give a hint of how to develop efficient decoding algorithms for other codes.
While it is intuitive that redundancy can bring reliability to a system, this thesis gives another direction: using redundancy actively to improve performance (efficiency) of distributed data systems. The results in this direction are both theoretical and experimental. System models are extracted from experiments in real practical systems; analytical results are derived using these and are then fed back to experiments for verification.
In this thesis, a novel deterministic voting scheme that uses errorcorrecting codes is proposed. The voting scheme generalizes all known simple deterministic voting algorithms. It can be tuned to various application environments with different error rates to drastically reduce average communication complexity, i.e., the amount of information that must be transmitted in order to get correct voting results.
Two problems are identified to improve the performance of general data server systems, namely the data distribution problem and the data acquisition problem. Solutions to these are proposed, as are general analytical results on performance of (n, k) systems. A simple service time model of a practical diskbased distributed server system is given. This model, which is based on experimental results, is a starting point for data distribution and data acquisition schemes. These results, both experimental and analytical, can be further used for more sophisticated scheduling schemes to optimize or improve the performance of data server systems that serve multiple clients simultaneously.
Finally, some research problems related to storage systems are proposed as future directions.
Item Type:  Thesis (Dissertation (Ph.D.)) 

Degree Grantor:  California Institute of Technology 
Division:  Engineering and Applied Science 
Major Option:  Electrical Engineering 
Thesis Availability:  Public (worldwide access) 
Research Advisor(s): 

Thesis Committee: 

Defense Date:  30 November 1998 
Record Number:  CaltechETD:etd05162005084223 
Persistent URL:  http://resolver.caltech.edu/CaltechETD:etd05162005084223 
Default Usage Policy:  No commercial reproduction, distribution, display or performance rights in this work are provided. 
ID Code:  1834 
Collection:  CaltechTHESIS 
Deposited By:  Imported from ETDdb 
Deposited On:  16 May 2005 
Last Modified:  26 Dec 2012 02:42 
Thesis Files

PDF (00_cover.pdf)
 Final Version
See Usage Policy. 7Kb  

Postscript (00_cover.ps)
 Final Version
See Usage Policy. 25Kb  

PDF (01_abst.pdf)
 Final Version
See Usage Policy. 44Kb  

Postscript (01_abst.ps)
 Final Version
See Usage Policy. 169Kb  

PDF (02_content.pdf)
 Final Version
See Usage Policy. 48Kb  

Postscript (02_content.ps)
 Final Version
See Usage Policy. 171Kb  

PDF (03_chpt1.pdf)
 Final Version
See Usage Policy. 143Kb  

Postscript (03_chpt1.ps)
 Final Version
See Usage Policy. 222Kb  

PDF (04_chpt2.pdf)
 Final Version
See Usage Policy. 182Kb  

Postscript (04_chpt2.ps)
 Final Version
See Usage Policy. 237Kb  

PDF (05_chpt3.pdf)
 Final Version
See Usage Policy. 201Kb  

Postscript (05_chpt3.ps)
 Final Version
See Usage Policy. 303Kb  

PDF (06_chpt4.pdf)
 Final Version
See Usage Policy. 273Kb  

Postscript (06_chpt4.ps)
 Final Version
See Usage Policy. 511Kb  

PDF (07_chpt5.pdf)
 Final Version
See Usage Policy. 231Kb  

Postscript (07_chpt5.ps)
 Final Version
See Usage Policy. 489Kb  

PDF (08_chpt6.pdf)
 Final Version
See Usage Policy. 104Kb  

Postscript (08_chpt6.ps)
 Final Version
See Usage Policy. 194Kb  

PDF (09_bbl.pdf)
 Final Version
See Usage Policy. 56Kb  

Postscript (09_bbl.ps)
 Final Version
See Usage Policy. 170Kb  

PDF (full_thesis.pdf)
 Final Version
See Usage Policy. 734Kb  

Postscript (full_thesis.ps)
 Final Version
See Usage Policy. 1196Kb 
Repository Staff Only: item control page