An evolutionary approach for combining different sources of evidence in search engines

doi:10.1016/j.is.2008.07.003

Information Systems

Volume 34, Issue 2, April 2009, Pages 276-289

https://doi.org/10.1016/j.is.2008.07.003 Get rights and content

Abstract

Modern Web search engines use different strategies to improve the overall quality of their document rankings. Usually the strategy adopted involves the combination of multiple sources of relevance into a single ranking. This work proposes the use of evolutionary techniques to derive good evidence combination functions using three different sources of evidence of relevance: the textual content of documents, the reputation of documents extracted from the connectivity information available in the processed collection and the anchor text concatenation. The combination functions discovered by our evolutionary strategies were tested using a collection containing 368 queries extracted from a real nation-wide search engine query log with over 12 million documents. The experiments performed indicate that our proposal is an effective and practical alternative for combining sources of evidence into a single ranking. We also show that different types of queries submitted to a search engine can require different combination functions and that our proposal is useful for coping with such differences.

Introduction

Modern search engines usually use several sources of evidence to compute the ranking of documents that satisfy a user query. Examples of such sources of evidence are anchor text concatenation, the URL of each page, page title, the link structure available on the Web, geographic information about the user or about the Web pages, user profiles, etc. The final result produced by each search engine can be seen as a combination of partial results where each source of relevance evidence produces an individual ranking for each query. This requires a combination of all sources of evidence to provide a single final ranking.

Previous work in the literature present simple generic solutions to this problem, such as computing a linear combination of pieces of evidence [1], or modeling each source of evidence as an independent relevance probability [2], [3]. Such solutions have the disadvantage of missing important detailed information about each evidence that might affect the overall system quality. Examples of properties which are difficult to generalize are dependencies among different sources of evidence (e.g., dependencies among textual and link evidence based on query terms), or discrepancy in the values (or in their distribution) provided by each individual evidence.

Another alternative approach is to use learning methods for combining sources of relevance evidence [4]. Following this strategy, we propose the use of genetic programming (GP) as a tuning approach to solve the problem of combining different sources of evidence into a single ranking function. GP has been proved to be a successful framework for discovering ranking functions in information retrieval systems [5]. By using GP to find evidence combination functions, we aim not only at providing a good solution to combine different sources of evidence, but also at generating valuable information that allows a detailed analysis of the usefulness of each individual piece of evidence to different types of query submitted to a search engine.

In order to evaluate the proposed approach we have performed experiments with a Web collection and a real search engine query log using three different sources of evidence explored in the literature. The first is the result of the well-known vector space model applied to the full text of each document. The second also uses the vector space model, however, it is applied to the concatenation of all anchor text referring to a document in the collection. The third is the well-known Pagerank [6] of the Web pages, which is computed by the analysis of the Web link structure.

Our experiments were conducted taking into consideration two distinct types of queries as described in [7]: navigational and informational queries. These two query types were further divided into sets of popular and non-popular queries, thus producing four distinct query sets. The experiments confirmed that each of these sets is affected in a different way by the three pieces of evidence studied. Therefore, distinct ranking functions are required to combine the studied pieces of evidence according to each query type. The performed experiments indicate that our approach achieves a good final ranking based on the combination of pieces of evidence. In addition it is also useful for studying the importance of each piece of evidence to the final ranking.

This paper is organized as follows. In Section 2 we discuss related work. Section 3 presents a brief overview of GP. Section 4 describes how we use GP to generate ranking functions that combine distinct sources of evidence. Section 5 discusses the methodology we use to conduct our experimentation. Section 6 presents the results obtained by our evolutionary approach confronted with the best results obtained with each individual evidence used in isolation and with other methods. Finally, Section 7 presents final remarks, with directions to future work.

Section snippets

Related work

The combination of different pieces of evidence has been studied in several previous research efforts with the goal of improving the overall search result quality. Salton [8] described an effort in this direction showing a way to improve the search result quality by combining textual evidence with citation information, concluding that documents that have similar citations have similar subjects in most cases.

The combination of citation and content information in Web search systems was also

Genetic programming

In this section we present a brief overview of GP, to provide the necessary background for the description of our proposed method.

Usually, a GP system has as its main goal to evolve a population of individuals, each one representing a single solution to a given problem. As usual in many problems, in our case individuals are arithmetic functions, as illustrated in Fig. 1.

When using trees in GP, a set of terminals and functions should be defined. Terminals are inputs, constants or zero arguments¹

Evidence combination with GP

In this section we show how GP can be used to generate ranking functions that combine distinct sources of evidence. More specifically, we address the problem stated below.

Let $E = {e_{1}, \dots, e_{k}}$ be a set of sources of evidence that can be used to assess the relevance of a Web page d to a given query q in the database of a search engine. Typically, each source of evidence $e_{i}$ produces a score $s_{i} (q, d)$ , so that it is possible to build a ranking of the documents according to each individual source of

Experimental setup

In this section we present the details about the experiments performed, describing the sources of evidence combined, the query types considered and the fitness functions evaluation process.

Experimental results

Before presenting and discussing the results, we show in Table 3 statistics on the number of distinct words present in the text and anchor text for the relevant documents, for each query type experimented. These statistics are useful to show some particular characteristics of each query type. For instance, text has much more distinct words for relevant documents for informational queries, than it does for navigational queries. We can also see that the anchor text information increases with

Conclusions and future research direction

This paper presented a GP-based approach for the combination of different sources of evidence for ranking documents in Web search engines. The quality of the final ranking is crucial to achieve better results and thus user satisfaction. The sources of evidence considered in this work are the document textual content, anchor text concatenation and cross-reference information (Pagerank).

In order to evaluate our approach we performed experiments using a real search engine log from which we

Acnowledgments

This work was supported by CNPq research grants 302209/2007-7(Edleno S. de Moura), 308528/2007-7 (Altigran S. da Silva), 303738/2006-5 (João M. Cavalcanti) and 301043/2006-0 (Marcos Gonçalves), by FAPEAM scholarship (Thomaz P. C. Silva) and by SIRIAA project CNPq 553126/2005-9.

References (41)

I.-H. Kang et al.
Integration of multiple evidences based on a query type for web search
Inf. Process. Manage.
(2004)
D. Hawking et al.
Results and challenges in web search evaluation
Comput. Networks
(1999)
T. Westerveld et al.
Retrieving web pages using content, links, urls and anchors
I. Silva et al.
Link-based and content-based evidential information in a belief network model
P. Calado et al.
Combining link-based and content-based methods for web document classification
T. Joachims, Optimizing search engines using clickthrough data, in: Proceedings of the ACM KDD Conference, ACM, New...
W. Fan, M. Gordon, P. Pathak, W. Xi, E. Fox, Ranking function optimization for effective web search by genetic...
L. Page, S. Brin, R. Motwani, T. Winograd, The pagerank citation ranking: bringing order to the web, Technical Report,...
A. Broder
A taxonomy of web search
SIGIR Forum
(2002)
G. Salton
Associative document retrieval techniques using bibliographic information
J. ACM (JACM)
(1963)

J.M. Kleinberg

Authoritative sources in a hyperlinked environment

J. ACM (JACM)

(1999)

K. Bharat, M.R. Henzinger, Improved algorithms for topic distillation in a hyperlinked environment, in: Proceedings of...

R. Kumar et al.

Core algorithms in the clever system

ACM Trans. Internet Technol.

(2006)

J.A. Aslam et al.

Models for metasearch

C. Dwork et al.

Rank aggregation methods for the web

N. Craswell et al.

Effective site finding using link anchor information

A. Chowdhury et al.

Analyses of multiple-evidence combinations for retrieval strategies

B. Ribeiro-Neto et al.

A belief network model for ir

T. Upstill et al.

Query-independent evidence in home page finding

ACM Trans. Inf. Syst.

(2003)

N. Craswell et al.

Relevance weighting for query independent evidence

Cited by (16)

A cross-benchmark comparison of 87 learning to rank methods
2015, Information Processing and Management
Learning to rank is an increasingly important scientific field that comprises the use of machine learning for the ranking task. New learning to rank methods are generally evaluated on benchmark test collections. However, comparison of learning to rank methods based on evaluation results is hindered by the absence of a standard set of evaluation benchmark collections. In this paper we propose a way to compare learning to rank methods based on a sparse set of evaluation results on a set of benchmark datasets. Our comparison methodology consists of two components: (1) Normalized Winning Number, which gives insight in the ranking accuracy of the learning to rank method, and (2) Ideal Winning Number, which gives insight in the degree of certainty concerning its ranking accuracy. Evaluation results of 87 learning to rank methods on 20 well-known benchmark datasets are collected through a structured literature search. ListNet, SmoothRank, FenchelRank, FSMRank, LRUF and LARF are Pareto optimal learning to rank methods in the Normalized Winning Number and Ideal Winning Number dimensions, listed in increasing order of Normalized Winning Number and decreasing order of Ideal Winning Number.
An adaptive learning to rank algorithm: Learning automata approach
2012, Decision Support Systems
Citation Excerpt :
These algorithms learn a ranking function in a reproducing kernel Hilbert space (RKHS) derived from the graph. Evolutionary algorithms have been shown to be useful in web page classification and ranking [8–11]. Besides the list-wise and pair-wise approaches, the point-wise [40–42] is another learning to rank approach in which a single document is the input of the learning process.
The recent years have witnessed the birth and explosive growth of the web. It is obvious that the exponential growth of the web has made it into a huge interconnected source of information wherein finding a document without a searching tool is unimaginable. Today's search engines try to provide the most relevant suggestions to the user queries. To do this, different strategies are used to enhance the precision of the information retrieval process. In this paper, a learning method is proposed to rank the web documents in a search engine. The proposed method takes advantage of the user feedback to enhance the precision of the search results. To do so, it uses a learning automata-based approach to train the search engine. In this method, the user feedback is defined as its interest to review an item. Within the search results, the document that is visited by the user is more likely relevant to the user query. Therefore, its choice probability must be increased by the learning automaton. By this, the rank of the most relevant documents increases as that of the others decreases. To investigate the efficiency of the proposed method, extensive simulation experiment is conducted on well-known data collections. The obtained results show the superiority of the proposed approach over the existing methods in terms of mean average precision, precision at position n, and normalized discount cumulative gain.
Prediction of permeation flux decline during MF of oily wastewater using genetic programming
2012, Chemical Engineering Research and Design
Genetic programming is an orderly method for getting computers to regularly solve a problem. The genetic programming creates a computer program from an obtained data and solves the problem. In this work, treatment of oily wastewaters with synthesized mullite ceramic microfiltration membranes was studied and a new approach for modeling of the membrane flux is presented. The model used input parameters for operating conditions (flux and filtration time) and feed oily wastewater quality (oil concentration, temperature, trans-membrane pressure and cross-flow velocity). The genetic programming utilized here delivers a mathematical function for the membrane flux as a function of the independent variables stated above. Parameters for controlling and termination criterion for a run are provided by the user. Result is provided as a tree of functions and terminals. The results thus obtained from the genetic programming model demonstrated good representation of the experimental data with an average error of less than 5%.
Community-based geoportals: The next generation? Concepts and methods for the geospatial Web 2.0
2010, Computers, Environment and Urban Systems
Citation Excerpt :
In this case, the relevance of each information item is defined through a democratic process instead of a black-box-like centralized one (Staab et al., 2000). Communities can also offer a certain degree of personalization that search engines do not permit because search engines are usually based on the assumption that each piece of information is equally relevant for everybody (Silva et al., 2009). A community-based geoportal can offer search results that take into account the user’s preferences, by giving more weight to user’s votes that are a part of the same community.
User-generated content, interoperability and the social dimension are the cornerstones of an emerging paradigm for the creation and sharing of information: Web 2.0. This article studies how geoportals can benefit from the Web 2.0 features. Geoportals are World Wide Web gateways that organize content and services related to geographic information. They are the most visible part of Spatial Data Infrastructures (i.e. distributed systems that aid acquisition, processing, distribution, use, maintenance, and preservation of spatial data). Today’s geoportals are focusing on interoperability through the implementation of standards for discovery and use of geographic data and services. Will tomorrow’s Geoportals focus more on organising communities of users sharing common interests? Recent papers are arguing for deeper integration of the Web 2.0 paradigm within the geospatial web. This article aims to provide an overview supporting the next generation geoportal development by defining related concepts, by emphasising advantages and caveats of such an approach, and proposing appropriate implementation strategies.
Exploring features for the automatic identification of user goals in web search
2010, Information Processing and Management
Queries submitted to search engines can be classified according to the user goals into three distinct categories: navigational, informational, and transactional. Such classification may be useful, for instance, as additional information for advertisement selection algorithms and for search engine ranking functions, among other possible applications. This paper presents a study about the impact of using several features extracted from the document collection and query logs on the task of automatically identifying the users’ goals behind their queries. We propose the use of new features not previously reported in literature and study their impact on the quality of the query classification task. Further, we study the impact of each feature on different web collections, showing that the choice of the best set of features may change according to the target collection.
The results obtained indicate the new proposed set of features improves the quality of the classification task when compared to previous proposals. We report experiments with two web collections where we were able to obtain 82.5% and 77.67% of overall accuracy when classifying queries according to the three distinct user goals studied.
Relevance ranking for proximity full-text search based on additional indexes with multi-component keys
2021, arXiv

View all citing articles on Scopus

View full text

An evolutionary approach for combining different sources of evidence in search engines

Abstract

Introduction

Section snippets

Related work

Genetic programming

Evidence combination with GP

Experimental setup

Experimental results

Conclusions and future research direction

Acnowledgments

Inf. Process. Manage.

Comput. Networks

Retrieving web pages using content, links, urls and anchors

Link-based and content-based evidential information in a belief network model

Combining link-based and content-based methods for web document classification

A taxonomy of web search

SIGIR Forum

Associative document retrieval techniques using bibliographic information

J. ACM (JACM)

Authoritative sources in a hyperlinked environment

J. ACM (JACM)

Core algorithms in the clever system

ACM Trans. Internet Technol.

Models for metasearch

Rank aggregation methods for the web

Effective site finding using link anchor information

Analyses of multiple-evidence combinations for retrieval strategies

A belief network model for ir

Query-independent evidence in home page finding

ACM Trans. Inf. Syst.

Relevance weighting for query independent evidence