Citation
Lindal, John (2001) An investigation of several document classification algorithms leading to the design of an autonomous software agent for locating specific, relevant information on the World Wide Web. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/k937-cy67. https://resolver.caltech.edu/CaltechTHESIS:12162010-110439957
Abstract
The goal of the research described in this thesis was to design an autonomous software agent that can locate specific, relevant information on the World Wide Web. The first chapter provides the motivation behind this project and a brief overview of the challenges associated with it. The next chapter presents the analysis which led to the development of a new, improved version of the computer program called ITRule. The improvements consist of a new algorithm for classifying documents that outperforms the previous one, significantly enhanced support for data exploration, i.e., the process of extracting information from raw data, and a new algorithm for quantizing numeric variables so they can be used by ITRule. The third part of this thesis compares the performances of three versions of ITRule, two versions of the Naive Bayes classifier, several neural networks, the decision tree algorithm called CART, and a linear support vector machine, in order to determine which one is best suited for selecting relevant web pages. An analysis of the test results shows that a new ITRule classification algorithm, based on cross validation combined with the J-measure, performs best. The fourth and final part of the thesis describes how some of these results were used in the design of a user friendly, autonomous software agent called Poirot that can help World Wide Web users stay up to date on new developments in topics of interest.
Item Type: | Thesis (Dissertation (Ph.D.)) |
---|---|
Subject Keywords: | Electrical Engineering |
Degree Grantor: | California Institute of Technology |
Division: | Engineering and Applied Science |
Major Option: | Electrical Engineering |
Thesis Availability: | Public (worldwide access) |
Research Advisor(s): |
|
Thesis Committee: |
|
Defense Date: | 19 December 2000 |
Record Number: | CaltechTHESIS:12162010-110439957 |
Persistent URL: | https://resolver.caltech.edu/CaltechTHESIS:12162010-110439957 |
DOI: | 10.7907/k937-cy67 |
Default Usage Policy: | No commercial reproduction, distribution, display or performance rights in this work are provided. |
ID Code: | 6210 |
Collection: | CaltechTHESIS |
Deposited By: | Rita Suarez |
Deposited On: | 16 Dec 2010 19:56 |
Last Modified: | 09 Nov 2022 19:19 |
Thesis Files
|
PDF
- Final Version
See Usage Policy. 7MB |
Repository Staff Only: item control page