|Institution:||Technische Universität Darmstadt|
|Full text PDF:||http://tuprints.ulb.tu-darmstadt.de/5385/|
Modern communication relies on electronic messages organized in the form of discussion threads. Emails, IMs, SMS, website comments, and forums are all composed of threads, which consist of individual user messages connected by metadata and discourse coherence to messages from other users. Threads are used to display user messages effectively in a GUI such as an email client, providing a background context for understanding a single message. Many messages are meaningless without the context provided by their thread. However, a number of factors may result in missing thread structure, ranging from user mistake (replying to the wrong message), to missing metadata (some email clients do not produce/save headers that fully encapsulate thread structure; and, conversion of archived threads from over repository to another may also result in lost metadata), to covert use (users may avoid metadata to render discussions difficult for third parties to understand). In the field of security, law enforcement agencies may obtain vast collections of discussion turns that require automatic thread reconstruction to understand. For example, the Enron Email Corpus, obtained by the Federal Energy Regulatory Commission during its investigation of the Enron Corporation, has no inherent thread structure. In this thesis, we will use natural language processing approaches to reconstruct threads from message content. Reconstruction based on message content sidesteps the problem of missing metadata, permitting post hoc reorganization and discussion understanding. We will investigate corpora of email threads and Wikipedia discussions. However, there is a scarcity of annotated corpora for this task. For example, the Enron Emails Corpus contains no inherent thread structure. Therefore, we also investigate issues faced when creating crowdsourced datasets and learning statistical models of them. Several of our findings are applicable for other natural language machine classification tasks, beyond thread reconstruction. We will divide our investigation of discussion thread reconstruction into two parts. First, we explore techniques needed to create a corpus for our thread reconstruction research. Like other NLP pairwise classification tasks such as Wikipedia discussion turn/edit alignment and sentence pair text similarity rating, email thread disentanglement is a heavily class-imbalanced problem, and although the advent of crowdsourcing has reduced annotation costs, the common practice of crowdsourcing redundancy is too expensive for class-imbalanced tasks. As the first contribution of this thesis, we evaluate alternative strategies for reducing crowdsourcing annotation redundancy for class-imbalanced NLP tasks. We also examine techniques to learn the best machine classifier from our crowdsourced labels. In order to reduce noise in training data, most natural language crowdsourcing annotation tasks gather redundant labels and aggregate them into an integrated label, which is provided to the classifier. However, aggregation… Advisors/Committee Members: Gurevych, Iryna (advisor), Fürnkranz, Johannes (advisor), Daelemans, Walter (advisor).