A Caltech Library Service

Generalization Error Estimates and Training Data Valuation


Nicholson, Alexander Marshall (2002) Generalization Error Estimates and Training Data Valuation. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/1H16-VX81.


This thesis addresses several problems related to generalization in machine learning systems. We introduce a theoretical framework for studying learning and generalization. Within this framework, a closed form is derived for the expected generalization error that estimates the out-of-sample performance in terms of the in-sample performance. We consider the problem of overfitting and show that, using a simple exhaustive learning algorithm, overfitting does not occur. These results do not assume a particular form of the target function, input distribution or learning model, and hold even with noisy data sets. We apply our analysis to practical learning systems, illustrate how it may be used to estimate out-of-sample errors in practice, and demonstrate that the resulting estimates improve upon errors estimated with a validation set for real world problems. Based on this study of generalization, we develop a technique for quantitative valuation of training data. We demonstrate that this valuation may be used to select training sets that improve generalization performance. With a reasonable prior over target functions, it further allows us to estimate the level of noise in a data set and provides for detection and correction of noise in individual examples. Finally, this data valuation can be used to classify new examples, yielding a new learning algorithm that is shown to be relatively robust to noise.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:data valuation
Degree Grantor:California Institute of Technology
Division:Engineering and Applied Science
Major Option:Computer Science
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Abu-Mostafa, Yaser S.
Thesis Committee:
  • Abu-Mostafa, Yaser S. (chair)
  • Bruck, Jehoshua
  • Hickey, Jason J.
  • Magdon-Ismail, Malik
  • Perona, Pietro
Defense Date:16 May 2002
Non-Caltech Author Email:zander (AT)
Record Number:CaltechETD:etd-09062005-083717
Persistent URL:
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:3347
Deposited By: Imported from ETD-db
Deposited On:06 Sep 2005
Last Modified:18 Aug 2022 23:53

Thesis Files

PDF (Nicholson_a_2002.pdf) - Final Version
See Usage Policy.


Repository Staff Only: item control page