Elsevier

Information Systems

Volume 34, Issue 2, April 2009, Pages 276-289
Information Systems

An evolutionary approach for combining different sources of evidence in search engines

https://doi.org/10.1016/j.is.2008.07.003Get rights and content

Abstract

Modern Web search engines use different strategies to improve the overall quality of their document rankings. Usually the strategy adopted involves the combination of multiple sources of relevance into a single ranking. This work proposes the use of evolutionary techniques to derive good evidence combination functions using three different sources of evidence of relevance: the textual content of documents, the reputation of documents extracted from the connectivity information available in the processed collection and the anchor text concatenation. The combination functions discovered by our evolutionary strategies were tested using a collection containing 368 queries extracted from a real nation-wide search engine query log with over 12 million documents. The experiments performed indicate that our proposal is an effective and practical alternative for combining sources of evidence into a single ranking. We also show that different types of queries submitted to a search engine can require different combination functions and that our proposal is useful for coping with such differences.

Introduction

Modern search engines usually use several sources of evidence to compute the ranking of documents that satisfy a user query. Examples of such sources of evidence are anchor text concatenation, the URL of each page, page title, the link structure available on the Web, geographic information about the user or about the Web pages, user profiles, etc. The final result produced by each search engine can be seen as a combination of partial results where each source of relevance evidence produces an individual ranking for each query. This requires a combination of all sources of evidence to provide a single final ranking.

Previous work in the literature present simple generic solutions to this problem, such as computing a linear combination of pieces of evidence [1], or modeling each source of evidence as an independent relevance probability [2], [3]. Such solutions have the disadvantage of missing important detailed information about each evidence that might affect the overall system quality. Examples of properties which are difficult to generalize are dependencies among different sources of evidence (e.g., dependencies among textual and link evidence based on query terms), or discrepancy in the values (or in their distribution) provided by each individual evidence.

Another alternative approach is to use learning methods for combining sources of relevance evidence [4]. Following this strategy, we propose the use of genetic programming (GP) as a tuning approach to solve the problem of combining different sources of evidence into a single ranking function. GP has been proved to be a successful framework for discovering ranking functions in information retrieval systems [5]. By using GP to find evidence combination functions, we aim not only at providing a good solution to combine different sources of evidence, but also at generating valuable information that allows a detailed analysis of the usefulness of each individual piece of evidence to different types of query submitted to a search engine.

In order to evaluate the proposed approach we have performed experiments with a Web collection and a real search engine query log using three different sources of evidence explored in the literature. The first is the result of the well-known vector space model applied to the full text of each document. The second also uses the vector space model, however, it is applied to the concatenation of all anchor text referring to a document in the collection. The third is the well-known Pagerank [6] of the Web pages, which is computed by the analysis of the Web link structure.

Our experiments were conducted taking into consideration two distinct types of queries as described in [7]: navigational and informational queries. These two query types were further divided into sets of popular and non-popular queries, thus producing four distinct query sets. The experiments confirmed that each of these sets is affected in a different way by the three pieces of evidence studied. Therefore, distinct ranking functions are required to combine the studied pieces of evidence according to each query type. The performed experiments indicate that our approach achieves a good final ranking based on the combination of pieces of evidence. In addition it is also useful for studying the importance of each piece of evidence to the final ranking.

This paper is organized as follows. In Section 2 we discuss related work. Section 3 presents a brief overview of GP. Section 4 describes how we use GP to generate ranking functions that combine distinct sources of evidence. Section 5 discusses the methodology we use to conduct our experimentation. Section 6 presents the results obtained by our evolutionary approach confronted with the best results obtained with each individual evidence used in isolation and with other methods. Finally, Section 7 presents final remarks, with directions to future work.

Section snippets

Related work

The combination of different pieces of evidence has been studied in several previous research efforts with the goal of improving the overall search result quality. Salton [8] described an effort in this direction showing a way to improve the search result quality by combining textual evidence with citation information, concluding that documents that have similar citations have similar subjects in most cases.

The combination of citation and content information in Web search systems was also

Genetic programming

In this section we present a brief overview of GP, to provide the necessary background for the description of our proposed method.

Usually, a GP system has as its main goal to evolve a population of individuals, each one representing a single solution to a given problem. As usual in many problems, in our case individuals are arithmetic functions, as illustrated in Fig. 1.

When using trees in GP, a set of terminals and functions should be defined. Terminals are inputs, constants or zero arguments1

Evidence combination with GP

In this section we show how GP can be used to generate ranking functions that combine distinct sources of evidence. More specifically, we address the problem stated below.

Let E={e1,,ek} be a set of sources of evidence that can be used to assess the relevance of a Web page d to a given query q in the database of a search engine. Typically, each source of evidence ei produces a score si(q,d), so that it is possible to build a ranking of the documents according to each individual source of

Experimental setup

In this section we present the details about the experiments performed, describing the sources of evidence combined, the query types considered and the fitness functions evaluation process.

Experimental results

Before presenting and discussing the results, we show in Table 3 statistics on the number of distinct words present in the text and anchor text for the relevant documents, for each query type experimented. These statistics are useful to show some particular characteristics of each query type. For instance, text has much more distinct words for relevant documents for informational queries, than it does for navigational queries. We can also see that the anchor text information increases with

Conclusions and future research direction

This paper presented a GP-based approach for the combination of different sources of evidence for ranking documents in Web search engines. The quality of the final ranking is crucial to achieve better results and thus user satisfaction. The sources of evidence considered in this work are the document textual content, anchor text concatenation and cross-reference information (Pagerank).

In order to evaluate our approach we performed experiments using a real search engine log from which we

Acnowledgments

This work was supported by CNPq research grants 302209/2007-7(Edleno S. de Moura), 308528/2007-7 (Altigran S. da Silva), 303738/2006-5 (João M. Cavalcanti) and 301043/2006-0 (Marcos Gonçalves), by FAPEAM scholarship (Thomaz P. C. Silva) and by SIRIAA project CNPq 553126/2005-9.

References (41)

  • I.-H. Kang et al.

    Integration of multiple evidences based on a query type for web search

    Inf. Process. Manage.

    (2004)
  • D. Hawking et al.

    Results and challenges in web search evaluation

    Comput. Networks

    (1999)
  • T. Westerveld et al.

    Retrieving web pages using content, links, urls and anchors

  • I. Silva et al.

    Link-based and content-based evidential information in a belief network model

  • P. Calado et al.

    Combining link-based and content-based methods for web document classification

  • T. Joachims, Optimizing search engines using clickthrough data, in: Proceedings of the ACM KDD Conference, ACM, New...
  • W. Fan, M. Gordon, P. Pathak, W. Xi, E. Fox, Ranking function optimization for effective web search by genetic...
  • L. Page, S. Brin, R. Motwani, T. Winograd, The pagerank citation ranking: bringing order to the web, Technical Report,...
  • A. Broder

    A taxonomy of web search

    SIGIR Forum

    (2002)
  • G. Salton

    Associative document retrieval techniques using bibliographic information

    J. ACM (JACM)

    (1963)
  • J.M. Kleinberg

    Authoritative sources in a hyperlinked environment

    J. ACM (JACM)

    (1999)
  • K. Bharat, M.R. Henzinger, Improved algorithms for topic distillation in a hyperlinked environment, in: Proceedings of...
  • R. Kumar et al.

    Core algorithms in the clever system

    ACM Trans. Internet Technol.

    (2006)
  • J.A. Aslam et al.

    Models for metasearch

  • C. Dwork et al.

    Rank aggregation methods for the web

  • N. Craswell et al.

    Effective site finding using link anchor information

  • A. Chowdhury et al.

    Analyses of multiple-evidence combinations for retrieval strategies

  • B. Ribeiro-Neto et al.

    A belief network model for ir

  • T. Upstill et al.

    Query-independent evidence in home page finding

    ACM Trans. Inf. Syst.

    (2003)
  • N. Craswell et al.

    Relevance weighting for query independent evidence

  • Cited by (16)

    • A cross-benchmark comparison of 87 learning to rank methods

      2015, Information Processing and Management
    • An adaptive learning to rank algorithm: Learning automata approach

      2012, Decision Support Systems
      Citation Excerpt :

      These algorithms learn a ranking function in a reproducing kernel Hilbert space (RKHS) derived from the graph. Evolutionary algorithms have been shown to be useful in web page classification and ranking [8–11]. Besides the list-wise and pair-wise approaches, the point-wise [40–42] is another learning to rank approach in which a single document is the input of the learning process.

    • Community-based geoportals: The next generation? Concepts and methods for the geospatial Web 2.0

      2010, Computers, Environment and Urban Systems
      Citation Excerpt :

      In this case, the relevance of each information item is defined through a democratic process instead of a black-box-like centralized one (Staab et al., 2000). Communities can also offer a certain degree of personalization that search engines do not permit because search engines are usually based on the assumption that each piece of information is equally relevant for everybody (Silva et al., 2009). A community-based geoportal can offer search results that take into account the user’s preferences, by giving more weight to user’s votes that are a part of the same community.

    View all citing articles on Scopus
    View full text