An evolutionary approach for combining different sources of evidence in search engines
Introduction
Modern search engines usually use several sources of evidence to compute the ranking of documents that satisfy a user query. Examples of such sources of evidence are anchor text concatenation, the URL of each page, page title, the link structure available on the Web, geographic information about the user or about the Web pages, user profiles, etc. The final result produced by each search engine can be seen as a combination of partial results where each source of relevance evidence produces an individual ranking for each query. This requires a combination of all sources of evidence to provide a single final ranking.
Previous work in the literature present simple generic solutions to this problem, such as computing a linear combination of pieces of evidence [1], or modeling each source of evidence as an independent relevance probability [2], [3]. Such solutions have the disadvantage of missing important detailed information about each evidence that might affect the overall system quality. Examples of properties which are difficult to generalize are dependencies among different sources of evidence (e.g., dependencies among textual and link evidence based on query terms), or discrepancy in the values (or in their distribution) provided by each individual evidence.
Another alternative approach is to use learning methods for combining sources of relevance evidence [4]. Following this strategy, we propose the use of genetic programming (GP) as a tuning approach to solve the problem of combining different sources of evidence into a single ranking function. GP has been proved to be a successful framework for discovering ranking functions in information retrieval systems [5]. By using GP to find evidence combination functions, we aim not only at providing a good solution to combine different sources of evidence, but also at generating valuable information that allows a detailed analysis of the usefulness of each individual piece of evidence to different types of query submitted to a search engine.
In order to evaluate the proposed approach we have performed experiments with a Web collection and a real search engine query log using three different sources of evidence explored in the literature. The first is the result of the well-known vector space model applied to the full text of each document. The second also uses the vector space model, however, it is applied to the concatenation of all anchor text referring to a document in the collection. The third is the well-known Pagerank [6] of the Web pages, which is computed by the analysis of the Web link structure.
Our experiments were conducted taking into consideration two distinct types of queries as described in [7]: navigational and informational queries. These two query types were further divided into sets of popular and non-popular queries, thus producing four distinct query sets. The experiments confirmed that each of these sets is affected in a different way by the three pieces of evidence studied. Therefore, distinct ranking functions are required to combine the studied pieces of evidence according to each query type. The performed experiments indicate that our approach achieves a good final ranking based on the combination of pieces of evidence. In addition it is also useful for studying the importance of each piece of evidence to the final ranking.
This paper is organized as follows. In Section 2 we discuss related work. Section 3 presents a brief overview of GP. Section 4 describes how we use GP to generate ranking functions that combine distinct sources of evidence. Section 5 discusses the methodology we use to conduct our experimentation. Section 6 presents the results obtained by our evolutionary approach confronted with the best results obtained with each individual evidence used in isolation and with other methods. Finally, Section 7 presents final remarks, with directions to future work.
Section snippets
Related work
The combination of different pieces of evidence has been studied in several previous research efforts with the goal of improving the overall search result quality. Salton [8] described an effort in this direction showing a way to improve the search result quality by combining textual evidence with citation information, concluding that documents that have similar citations have similar subjects in most cases.
The combination of citation and content information in Web search systems was also
Genetic programming
In this section we present a brief overview of GP, to provide the necessary background for the description of our proposed method.
Usually, a GP system has as its main goal to evolve a population of individuals, each one representing a single solution to a given problem. As usual in many problems, in our case individuals are arithmetic functions, as illustrated in Fig. 1.
When using trees in GP, a set of terminals and functions should be defined. Terminals are inputs, constants or zero arguments1
Evidence combination with GP
In this section we show how GP can be used to generate ranking functions that combine distinct sources of evidence. More specifically, we address the problem stated below.
Let be a set of sources of evidence that can be used to assess the relevance of a Web page d to a given query q in the database of a search engine. Typically, each source of evidence produces a score , so that it is possible to build a ranking of the documents according to each individual source of
Experimental setup
In this section we present the details about the experiments performed, describing the sources of evidence combined, the query types considered and the fitness functions evaluation process.
Experimental results
Before presenting and discussing the results, we show in Table 3 statistics on the number of distinct words present in the text and anchor text for the relevant documents, for each query type experimented. These statistics are useful to show some particular characteristics of each query type. For instance, text has much more distinct words for relevant documents for informational queries, than it does for navigational queries. We can also see that the anchor text information increases with
Conclusions and future research direction
This paper presented a GP-based approach for the combination of different sources of evidence for ranking documents in Web search engines. The quality of the final ranking is crucial to achieve better results and thus user satisfaction. The sources of evidence considered in this work are the document textual content, anchor text concatenation and cross-reference information (Pagerank).
In order to evaluate our approach we performed experiments using a real search engine log from which we
Acnowledgments
This work was supported by CNPq research grants 302209/2007-7(Edleno S. de Moura), 308528/2007-7 (Altigran S. da Silva), 303738/2006-5 (João M. Cavalcanti) and 301043/2006-0 (Marcos Gonçalves), by FAPEAM scholarship (Thomaz P. C. Silva) and by SIRIAA project CNPq 553126/2005-9.
References (41)
- et al.
Integration of multiple evidences based on a query type for web search
Inf. Process. Manage.
(2004) - et al.
Results and challenges in web search evaluation
Comput. Networks
(1999) - et al.
Retrieving web pages using content, links, urls and anchors
- et al.
Link-based and content-based evidential information in a belief network model
- et al.
Combining link-based and content-based methods for web document classification
- T. Joachims, Optimizing search engines using clickthrough data, in: Proceedings of the ACM KDD Conference, ACM, New...
- W. Fan, M. Gordon, P. Pathak, W. Xi, E. Fox, Ranking function optimization for effective web search by genetic...
- L. Page, S. Brin, R. Motwani, T. Winograd, The pagerank citation ranking: bringing order to the web, Technical Report,...
A taxonomy of web search
SIGIR Forum
(2002)Associative document retrieval techniques using bibliographic information
J. ACM (JACM)
(1963)
Authoritative sources in a hyperlinked environment
J. ACM (JACM)
Core algorithms in the clever system
ACM Trans. Internet Technol.
Models for metasearch
Rank aggregation methods for the web
Effective site finding using link anchor information
Analyses of multiple-evidence combinations for retrieval strategies
A belief network model for ir
Query-independent evidence in home page finding
ACM Trans. Inf. Syst.
Relevance weighting for query independent evidence
Cited by (16)
A cross-benchmark comparison of 87 learning to rank methods
2015, Information Processing and ManagementAn adaptive learning to rank algorithm: Learning automata approach
2012, Decision Support SystemsCitation Excerpt :These algorithms learn a ranking function in a reproducing kernel Hilbert space (RKHS) derived from the graph. Evolutionary algorithms have been shown to be useful in web page classification and ranking [8–11]. Besides the list-wise and pair-wise approaches, the point-wise [40–42] is another learning to rank approach in which a single document is the input of the learning process.
Prediction of permeation flux decline during MF of oily wastewater using genetic programming
2012, Chemical Engineering Research and DesignCommunity-based geoportals: The next generation? Concepts and methods for the geospatial Web 2.0
2010, Computers, Environment and Urban SystemsCitation Excerpt :In this case, the relevance of each information item is defined through a democratic process instead of a black-box-like centralized one (Staab et al., 2000). Communities can also offer a certain degree of personalization that search engines do not permit because search engines are usually based on the assumption that each piece of information is equally relevant for everybody (Silva et al., 2009). A community-based geoportal can offer search results that take into account the user’s preferences, by giving more weight to user’s votes that are a part of the same community.
Exploring features for the automatic identification of user goals in web search
2010, Information Processing and Management