Data Pruning

Citation

Angelova, Anelia Nedelcheva (2004) Data Pruning. Master's thesis, California Institute of Technology. doi:10.7907/T1GM-1R20. https://resolver.caltech.edu/CaltechETD:etd-05282004-000943

Abstract

Could a training example be detrimental to learning? Contrary to the common belief that more training data is needed for better generalization, we show that the learning algorithm might be better off when some training examples are discarded. In other words, the quality of the examples matters.

We explore a general approach to identify examples that are troublesome for learning with a given model and exclude them from the training set in order to achieve better generalization. We term this process 'data pruning'. The method is targeted as a pre-learning step in order to obtain better data to train on.

The approach consists in creating multiple semi-independent learners from the dataset each of which is influenced differently by individual examples. The multiple learners' opinions about which example is difficult are arbitrated by an inference mechanism. Although, without guarantees of optimality, data pruning is shown to decrease the generalization error in experiments on real-life data. It is not assumed that the data or the noise can be modeled or that additional training examples are available.

Data pruning is applied for obtaining visual category data with little supervision. In this setting the object data is contaminated with non-object examples. We show that a mechanism for pruning noisy datasets prior to learning can be very successful especially in the presence of large amount of contamination or when the algorithm is sensitive to noise.

Our experiments demonstrate that data pruning can be worth while even if the algorithm has regularization capabilities or mechanisms to cope with noise and has a potential to be a more refined method for regularization or model selection.

Item Type:

Thesis (Master's thesis)

Subject Keywords:

data pruning; example elimination; learning from noisy data; outlier detection; regularization

Degree Grantor:

California Institute of Technology

Division:

Engineering and Applied Science

Major Option:

Computer Science

Thesis Availability:

Public (worldwide access)

Research Advisor(s):

Perona, Pietro

Thesis Committee:

Unknown, Unknown

Defense Date:

28 May 2004

Record Number:

CaltechETD:etd-05282004-000943

Persistent URL:

https://resolver.caltech.edu/CaltechETD:etd-05282004-000943

DOI:

10.7907/T1GM-1R20

ORCID:

Author	ORCID
Angelova, Anelia Nedelcheva	0000-0003-1822-7943

Default Usage Policy:

No commercial reproduction, distribution, display or performance rights in this work are provided.

ID Code:

2184

Collection:

CaltechTHESIS

Deposited By:

Imported from ETD-db

Deposited On:

01 Jun 2004

Last Modified:

08 Nov 2023 00:44

Thesis Files

Preview

PDF (DataPruning.pdf) - Final Version
See Usage Policy.
1MB

Repository Staff Only: item control page