A Novel Hybrid Focused Crawling Algorithm to Build Domain-Specific Collections
Created by W.Langdon from
gp-bibliography.bib Revision:1.8110
- @PhdThesis{Yuxin_Chen:thesis,
-
author = "Yuxin Chen",
-
title = "A Novel Hybrid Focused Crawling Algorithm to Build
Domain-Specific Collections",
-
school = "Virginia Polytechnic Institute and State University",
-
year = "2007",
-
address = "Blacksburg, Virginia, USA",
-
month = feb # " 5",
-
keywords = "genetic algorithms, genetic programming, digital
libraries, focused crawler, classification,
meta-search",
-
URL = "http://scholar.lib.vt.edu/theses/available/etd-02162007-005107/",
-
URL = "http://scholar.lib.vt.edu/theses/available/etd-02162007-005107/unrestricted/YuxinDissertation_etd_final1.pdf",
-
URN = "etd-02162007-005107",
-
size = "85 pages",
-
abstract = "The Web, containing a large amount of useful
information and resources, is expanding rapidly.
Collecting domain-specific documents/information from
the Web is one of the most important methods to build
digital libraries for the scientific community. Focused
Crawlers can selectively retrieve Web documents
relevant to a specific domain to build collections for
domain-specific search engines or digital libraries.
Traditional focused crawlers normally adopting the
simple Vector Space Model and local Web search
algorithms typically only find relevant Web pages with
low precision. Recall also often is low, since they
explore a limited sub-graph of the Web that surrounds
the starting URL set, and will ignore relevant pages
outside this sub-graph. In this work, we investigated
how to apply an inductive machine learning algorithm
and meta-search technique, to the traditional focused
crawling process, to overcome the above mentioned
problems and to improve performance. We proposed a
novel hybrid focused crawling framework based on
Genetic Programming (GP) and meta-search. We showed
that our novel hybrid framework can be applied to
traditional focused crawlers to accurately find more
relevant Web documents for the use of digital libraries
and domain-specific search engines. The framework is
validated through experiments performed on test
documents from the Open Directory Project. Our studies
have shown that improvement can be achieved relative to
the traditional focused crawler if genetic programming
and meta-search methods are introduced into the focused
crawling process.",
- }
Genetic Programming entries for
Yuxin (Jerry) Chen
Citations