Information Retrieval

Susan Dumais, Microsoft Research

     The Web has made literally terabytes of information available at the click of a mouse. The challenge is in finding the right information. Information retrieval is concerned with providing access to textual data for which we have no good formal model, such as a relational model. Statistical approaches have been widely applied to this problem. This tutorial will provide an overview of: a) statistical characteristics of large text collections (e.g., size, sparcity, word distributions), b) important retrieval models (e.g., Boolean, vector space and probabilistic), and c) enhancements which use unsupervised learning to model structure in text collections, or supervised learning to incorporate user feedback. We will conclude with a discussion of open research issues where improved statistical models can improve performance.