A Caltech Library Service

Fault-tolerant cluster of networking elements


Fan, Chenggong Charles (2001) Fault-tolerant cluster of networking elements. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/R15B-VD58.


The explosive growth of the Internet demands higher reliability and performance than what the current networking infrastructure can provide. This dissertation explores novel architectures and protocols that provide a methodology for grouping together multiple networking elements, such as routers, gateways, and switches, to create a more reliable and performant distributed networking system. Clustering of networking elements is a novel concept that requires the invention of distributed computing protocols that facilitate efficient and robust support of networking protocols. We introduce the Raincore protocol architecture that achieves these goals by bridging the fields of computer networks and distributed systems. In designing Raincore, we paid special attention to the unique requirements from the networking environment. First, networking clusters need to scale up the networking throughput in addition to the scaling up of computing power. Second, task switching between the different services supported by a networking element has a major negative impact on performance. Third, fast fail-over time is critical for maintaining network connections in the event of failures. We discuss in depth the design of Raincore Group Communication Manager that addresses the forgoing requirements and provides group membership management and reliable multicast transport. It is based on a novel token-ring protocol. We prove that this protocol is formally correct, namely, it satisfies the set of formal specifications that defines the Group Membership problem. The creation of Raincore has already made a substantial impact both at Caltech and the academic community as well as in the industry. The first application is SNOW, a scalable web server cluster that is part of RAIN, a collaborative project between Caltech and JPL/NASA. The second application is RainWall, a commercial solution created by Rainfinity, a Caltech spin-off company, that provides the first fault-tolerant and scalable firewall cluster. These applications exhibit the fast fail-over response, low overhead, and near-linear scalability of the Raincore protocols. In addition, we studied fault-tolerant networking architectures. In particular, we considered efficient constructions of extra-stage fault-tolerant Multistage Interconnection Networks. Multistage Interconnection Networks provide a way to construct a larger switching network using smaller switching elements. We discovered an optimal family of constructions, in the sense that it requires the least number of extra components to tolerate multiple switching element failures. We prove that this is the only family of constructions that has this optimal fault-tolerance property.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:Communications.; Computer Networks; Distributed Protocols; Distributed Systems; Fault tolerance; Group Membership; High Availability; Load Balancing; Multistage Interconnection Networks
Degree Grantor:California Institute of Technology
Division:Engineering and Applied Science
Major Option:Electrical Engineering
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Bruck, Jehoshua
Thesis Committee:
  • Bruck, Jehoshua (chair)
  • DeHon, Andre
  • Effros, Michelle
  • McEliece, Robert J.
  • Low, Steven H.
Defense Date:11 May 2001
Non-Caltech Author Email:fan (AT)
Record Number:CaltechETD:etd-08152001-144501
Persistent URL:
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:3121
Deposited By: Imported from ETD-db
Deposited On:06 Sep 2001
Last Modified:21 Dec 2019 01:56

Thesis Files

PDF (Thesis.pdf) - Final Version
See Usage Policy.


Repository Staff Only: item control page