Citation
Li, Ling (2006) Data complexity in machine learning and novel classification algorithms. Dissertation (Ph.D.), California Institute of Technology. http://resolver.caltech.edu/CaltechETD:etd04122006114210
Abstract
This thesis summarizes four of my research projects in machine learning. One of them is on a theoretical challenge of defining and exploring complexity measures for data sets; the others are about new and improved classification algorithms. We first investigate the role of data complexity in the context of binary classification problems. The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping enforced by that data set. It is closely related to several existing principles used in machine learning such as Occam's razor, the minimum description length, and the Bayesian approach. We demonstrate the application of the data complexity in two learning problems, data decomposition and data pruning. In data decomposition, we illustrate that a data set is best approximated by its principal subsets which are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that outliers usually have high complexity contributions, and propose methods for estimating the complexity contribution. Experiments were carried out with a practical complexity measure on several toy problems. We then propose a family of novel learning algorithms to directly minimize the 0/1 loss for perceptrons. A perceptron is a linear threshold classifier that separates examples with a hyperplane. Unlike most perceptron learning algorithms, which require smooth cost functions, our algorithms directly minimize the 0/1 loss, and usually achieve the lowest training error compared with other algorithms. The algorithms are also computationally efficient. Such advantages make them favorable for both standalone use and ensemble learning, on problems that are not linearly separable. Experiments show that our algorithms work very well with AdaBoost. We also study ensemble methods that aggregate many base hypotheses in order to achieve better performance. AdaBoost is one such method for binary classification problems. The superior outofsample performance of AdaBoost has been attributed to the fact that it minimizes a cost function based on the margin, in that it can be viewed as a special case of AnyBoost, an abstract gradient descent algorithm. We provide a more sophisticated abstract boosting algorithm, CGBoost, based on conjugate gradient in function space. When the AdaBoost exponential cost function is optimized, CGBoost generally yields much lower cost and training error but higher test error, which implies that the exponential cost is vulnerable to overfitting. With the optimization power of CGBoost, we can adopt more "regularized" cost functions that have better outofsample performance but are difficult to optimize. Our experiments demonstrate that CGBoost generally outperforms AnyBoost in cost reduction. With suitable cost functions, CGBoost can have better outofsample performance. A multiclass classification problem can be reduced to a collection of binary problems with the aid of a coding matrix. The quality of the final solution, which is an ensemble of base classifiers learned on the binary problems, is affected by both the performance of the base learner and the errorcorrecting ability of the coding matrix. A coding matrix with strong errorcorrecting ability may not be overall optimal if the binary problems are too hard for the base learner. Thus a tradeoff between errorcorrecting and base learning should be sought. In this paper, we propose a new multiclass boosting algorithm that modifies the coding matrix according to the learning ability of the base learner. We show experimentally that our algorithm is very efficient in optimizing the multiclass margin cost, and outperforms existing multiclass algorithms such as AdaBoost.ECC and onevsone. The improvement is especially significant when the base learner is not very powerful.
Item Type:  Thesis (Dissertation (Ph.D.)) 

Subject Keywords:  0/1 loss; AdaBoost; boosting; CGBoost; complexity contribution; conjugate gradient; data complexity; data decomposition; data pruning; decision stump; errorcorrecting codes; fingerprint plot; gradient descent; Kolmogorov complexity; maxcut; multiclass boosting; Occam's razor; outlier detection; perceptron learning; principal subset; random coordinate descent; repartition; SVM 
Degree Grantor:  California Institute of Technology 
Division:  Engineering and Applied Science 
Major Option:  Computer Science 
Thesis Availability:  Public (worldwide access) 
Research Advisor(s): 

Thesis Committee: 

Defense Date:  4 May 2006 
Record Number:  CaltechETD:etd04122006114210 
Persistent URL:  http://resolver.caltech.edu/CaltechETD:etd04122006114210 
Default Usage Policy:  No commercial reproduction, distribution, display or performance rights in this work are provided. 
ID Code:  1361 
Collection:  CaltechTHESIS 
Deposited By:  Imported from ETDdb 
Deposited On:  30 May 2006 
Last Modified:  29 May 2014 18:33 
Thesis Files

PDF (thesis.pdf)
 Final Version
See Usage Policy. 2349Kb  

Postscript (thesis.ps)
 Final Version
See Usage Policy. 3055Kb 
Repository Staff Only: item control page