Data Complexity in Machine Learning and Novel Classification Algorithms

Citation

Li, Ling (2006) Data Complexity in Machine Learning and Novel Classification Algorithms. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/EW2G-9986. https://resolver.caltech.edu/CaltechETD:etd-04122006-114210

Abstract

This thesis summarizes four of my research projects in machine learning. One of them is on a theoretical challenge of defining and exploring complexity measures for data sets; the others are about new and improved classification algorithms.

We first investigate the role of data complexity in the context of binary classification problems. The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping enforced by that data set. It is closely related to several existing principles used in machine learning such as Occam's razor, the minimum description length, and the Bayesian approach. We demonstrate the application of the data complexity in two learning problems, data decomposition and data pruning. In data decomposition, we illustrate that a data set is best approximated by its principal subsets which are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that outliers usually have high complexity contributions, and propose methods for estimating the complexity contribution. Experiments were carried out with a practical complexity measure on several toy problems.

We then propose a family of novel learning algorithms to directly minimize the 0/1 loss for perceptrons. A perceptron is a linear threshold classifier that separates examples with a hyperplane. Unlike most perceptron learning algorithms, which require smooth cost functions, our algorithms directly minimize the 0/1 loss, and usually achieve the lowest training error compared with other algorithms. The algorithms are also computationally efficient. Such advantages make them favorable for both standalone use and ensemble learning, on problems that are not linearly separable. Experiments show that our algorithms work very well with AdaBoost.

We also study ensemble methods that aggregate many base hypotheses in order to achieve better performance. AdaBoost is one such method for binary classification problems. The superior out-of-sample performance of AdaBoost has been attributed to the fact that it minimizes a cost function based on the margin, in that it can be viewed as a special case of AnyBoost, an abstract gradient descent algorithm. We provide a more sophisticated abstract boosting algorithm, CGBoost, based on conjugate gradient in function space. When the AdaBoost exponential cost function is optimized, CGBoost generally yields much lower cost and training error but higher test error, which implies that the exponential cost is vulnerable to overfitting. With the optimization power of CGBoost, we can adopt more "regularized" cost functions that have better out-of-sample performance but are difficult to optimize. Our experiments demonstrate that CGBoost generally outperforms AnyBoost in cost reduction. With suitable cost functions, CGBoost can have better out-of-sample performance.

A multiclass classification problem can be reduced to a collection of binary problems with the aid of a coding matrix. The quality of the final solution, which is an ensemble of base classifiers learned on the binary problems, is affected by both the performance of the base learner and the error-correcting ability of the coding matrix. A coding matrix with strong error-correcting ability may not be overall optimal if the binary problems are too hard for the base learner. Thus a trade-off between error-correcting and base learning should be sought. In this paper, we propose a new multiclass boosting algorithm that modifies the coding matrix according to the learning ability of the base learner. We show experimentally that our algorithm is very efficient in optimizing the multiclass margin cost, and outperforms existing multiclass algorithms such as AdaBoost.ECC and one-vs-one. The improvement is especially significant when the base learner is not very powerful.

Item Type:	Thesis (Dissertation (Ph.D.))
Subject Keywords:	0/1 loss; AdaBoost; boosting; CGBoost; complexity contribution; conjugate gradient; data complexity; data decomposition; data pruning; decision stump; error-correcting codes; fingerprint plot; gradient descent; Kolmogorov complexity; max-cut; multiclass boosting; Occam's razor; outlier detection; perceptron learning; principal subset; random coordinate descent; repartition; SVM
Degree Grantor:	California Institute of Technology
Division:	Engineering and Applied Science
Major Option:	Computer Science
Minor Option:	Electrical Engineering
Thesis Availability:	Public (worldwide access)
Research Advisor(s):	Abu-Mostafa, Yaser S.
Thesis Committee:	Abu-Mostafa, Yaser S. (chair) Perona, Pietro McEliece, Robert J. Low, Steven H.
Defense Date:	4 May 2006
Record Number:	CaltechETD:etd-04122006-114210
Persistent URL:	https://resolver.caltech.edu/CaltechETD:etd-04122006-114210
DOI:	10.7907/EW2G-9986
Default Usage Policy:	No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:	1361
Collection:	CaltechTHESIS
Deposited By:	Imported from ETD-db
Deposited On:	30 May 2006
Last Modified:	20 Apr 2020 21:21

Thesis Files

Preview

PDF - Final Version
See Usage Policy.
2MB

Repository Staff Only: item control page