Bora Cenk Gazen

Thesis Title: Discovering Web Structure with Multiple Experts in a Clustering Framework
Degree Type: Ph.D. in Computer Science
Advisor(s): Jaime Carbonell
Graduated: December 2008

Abstract:

The world wide web contains vast amounts of data, but only a small portion of it is accessible in an operational form by machines. The rest of this vast collection is behind a presentation layer that renders web pages in a human-friendly form but also hampers machine-processing of data. The task of converting web data into operational form is the task of data extraction. Current approaches to data extraction from the web either require human-effort to guide supervised learning algorithms or are customized to extract a narrow range of data types in specific domains. We focus on the broader problem of discovering the underlying structure of any database-generated web site. Our approach automatically discovers relational data that is hidden behind these web sites by combining experts that identify the relationship between surface structure and the underlying structure.

Our approach is to have a set of software experts that analyze a web site's pages. Each of these experts is specialized to recognize a particular type of structure. These experts discover similarities between data items within the context of the particular types of structure they analyze and output their discoveries as hypotheses in a common hypothesis language. We find the most likely clustering of data using a probabilistic framework in which the hypotheses provide the evidence. From the clusters, the relational form of the data is derived.

We develop two frameworks following the principles of our approach. The first framework introduces a common hypothesis language in which heterogeneous experts express their discoveries. The second framework extends the common language to allow experts to assign confidence scores to their hypotheses.

We experiment in the web domain by comparing the output of our approach to the data extracted by a supervised wrapper-induction system and validated manually. Our results show that our approach performs well in the data extraction task on a variety of web sites.

Our approach is applicable to other structure discovery problems as well. We demonstrate this by successfully applying our approach in the record deduplication domain.

Thesis Committee:
Jaime Carbonell (Chair)
William Cohen
John Lafferty
Steven Minton (Fetch Technologies)

Peter Lee, Head, Computer Science Department
Randy Bryant, Dean, School of Computer Science

Keywords:
Structure discovery, heterogeneous experts, hypothesis language, confidence scores, clustering, unsupervised data extraction, world wide web, record linkage

CMU-CS-08-154.pdf (1003.17 KB) ( 134 pages)
Copyright Notice