Supervisor: Dr. Orland Hoeber
Improving Web Search Personalization Using Luhn-Inspired Vector Re-Weighting
Department of Computer Science
Wednesday, November 23, 2011, 1:00 p.m., Room EN 2022
Web search is an essential tool for people to find information among the vast resources available on the Web. However, under the same query, conventional Web search systems return the same search results for different users, regardless of their different information needs. This limits the effectiveness of Web search, especially in light of the short queries that are common on the Web. In order to address this problem, personalization has been studied as a way to tailor Web search to individual users based on their interests and preferences. A common approach is to model users’ interests into term frequency (TF) vectors, and re-rank the search results based on the similarity of documents to these vector-based models. However, a difficulty with this approach is that the high-frequency terms within the vector-based models are usually over weighted, and such common terms can easily diminish the capabilities of personalization because of their potentially ambiguous nature. A classical approach to address this problem is TF*IDF, which scales down the importance of the high-frequency terms using the inverse document frequency (IDF). However, the calculation of IDF is usually difficult and costly within the context of Web search personalization since it requires knowledge of the distribution of terms across all documents on the Web (or at least in a subset of the Web). Inspired by Luhn’s model of term importance, a novel approach is proposed in this thesis to identify significant terms in vector-based personalization models in order to improve the personalized order of the search results. Unlike TF*IDF, this approach does not require the knowledge of the entire collection of documents, but only the information that is already contained in the target model. Based on the features of the term frequency histogram derived from the target model, this Luhn-inspired vector re-weighting approach is able to automatically re-weight the terms in the target model according to their significance values. Evaluations with a set of ambiguous queries illustrate that this approach is effective for improving the ranked order of the search results over the original ranking and the baseline TF approach. Although the performance is similar to that of TF*IDF, it requires access to less information during the construction of the personalization models, and can be applied more broadly when only limited information regarding the collection being searched is available.