CaltechTHESIS
  A Caltech Library Service

An investigation of several document classification algorithms leading to the design of an autonomous software agent for locating specific, relevant information on the World Wide Web

Citation

Lindal, John (2001) An investigation of several document classification algorithms leading to the design of an autonomous software agent for locating specific, relevant information on the World Wide Web. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/k937-cy67. https://resolver.caltech.edu/CaltechTHESIS:12162010-110439957

Abstract

The goal of the research described in this thesis was to design an autonomous software agent that can locate specific, relevant information on the World Wide Web. The first chapter provides the motivation behind this project and a brief overview of the challenges associated with it. The next chapter presents the analysis which led to the development of a new, improved version of the computer program called ITRule. The improvements consist of a new algorithm for classifying documents that outperforms the previous one, significantly enhanced support for data exploration, i.e., the process of extracting information from raw data, and a new algorithm for quantizing numeric variables so they can be used by ITRule. The third part of this thesis compares the performances of three versions of ITRule, two versions of the Naive Bayes classifier, several neural networks, the decision tree algorithm called CART, and a linear support vector machine, in order to determine which one is best suited for selecting relevant web pages. An analysis of the test results shows that a new ITRule classification algorithm, based on cross validation combined with the J-measure, performs best. The fourth and final part of the thesis describes how some of these results were used in the design of a user friendly, autonomous software agent called Poirot that can help World Wide Web users stay up to date on new developments in topics of interest.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:Electrical Engineering
Degree Grantor:California Institute of Technology
Division:Engineering and Applied Science
Major Option:Electrical Engineering
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Goodman, Rodney M.
Thesis Committee:
  • Unknown, Unknown
Defense Date:19 December 2000
Record Number:CaltechTHESIS:12162010-110439957
Persistent URL:https://resolver.caltech.edu/CaltechTHESIS:12162010-110439957
DOI:10.7907/k937-cy67
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:6210
Collection:CaltechTHESIS
Deposited By: Rita Suarez
Deposited On:16 Dec 2010 19:56
Last Modified:09 Nov 2022 19:19

Thesis Files

[img]
Preview
PDF - Final Version
See Usage Policy.

7MB

Repository Staff Only: item control page