Download PDF

Towards Story Understanding and Search - Web Mining Methods and Tools for Exploration, Search and Discovery (Web mining methoden en applicaties voor exploratie, zoeken en ontdekking voor het verstaan van en zoeken in verhaallijnen)

Publication date: 2011-12-15

Author:

Subasic, Ilija

Keywords:

text mining, news processing, visual web search, query log analysis

Abstract:

Over the past decade the Internet became one of the leading sources of news content, and using different news provider services available on the Internet has for many people become the main medium for staying informed about the world. Such services support Internet users in interaction with stories. In this thesis, we regard a story as a set of time-stamped documents describing correlated subjects, such as for example persons, event descriptions, and topics. Our particular interest is to investigate the time dimension of stories and particularly story tracking – following a story over time. The goal of different research areas interested in story tracking is to identify and highlight developments – novel and relevant information in a story. In this work we restrict ourselves to news collections and investigate effectiveness and usability of temporal text mining (TTM) story tracking methods.Across the thesis we investigate four areas related to stories: (a) stories and search engines; (b) story tracking methods and tools, (c) story tracking evaluation frameworks, and (d) stories and sources. We formalize these 4 thematic areas into more concrete research questions addressed in this thesis: (Q1) How are search engines affected by story developments? (Q2) Does the semi-automatic story tracking approach we developed enable user comprehension and navigation of stories? (Q3) Can the graph-based patterns extracted by our algorithm be used for story tracking? (Q4) How can different bursty text patterns be used for discovering origins of the changes in document sets? (Q5) How do users interact with interfaces for story tracking? (Q6): How to measure differences between a story across different sources?We start by exploring how search engine users change their behaviour when new developments emerge in a story. For this we investigate a one-year long query log from a leading commercial search engine, and describe the changes of user behaviour correlated with the emergence of new developments. Then, we continue by exploring story tracking methods and tools as means for accommodating for these changes in user behaviour. We propose a new, graph-based, story tracking method and build a tool to support it. Additionally, we investigate the effectiveness of story tracking methods and define a new framework for automatic and user oriented evaluation. Although there are many TTM methods developed, there is a lack of common evaluation procedure. We propose an evaluation framework for measuring how different TTM methods discover novel developments. Apart from the automatic evaluation we are interested in how users interact with patterns and learn about the developments of the story they track. For this we propose a set of metrics and procedures for evaluation of user interfaces in the context of story tracking. To test our tool, we conducted a user study of four interfaces in the context of story tracking. Finally, we look at the source dimension of stories and explore the possible differences in news reporting across different families of news sources,and how to measure them.The results of our analysis show that our method is comparable in performance to other TTM methods, and that it meets the requirements for story tracking. We also show that by leveraging the pattern structure and sentence retrieval TTM methods can help discover developments in the news domain. The user study results show that users have a preference for our tool compared to the rest of the tools used in the study. The results also point out that the tool we built meets a number of the requirements discovered in the query log analysis.