Modernizing Topic Models: Accounting for Noise, Time, and Domain Knowledge
Churchill, Robert J
Data has evolved rapidly since the inception of topic models over twenty years ago.The most popular topic models perform poorly on large contemporary data sets that contain short, noisy texts. This dissertation aims to produce a suite of topic models capable of accurately modeling these new types of data. We begin by tracking the evolution of topic models from inception to modern days. We then propose a flexible preprocessing pipeline that can be adjusted for different levels of noise in the data. The core contribution of this dissertation is the development of a new class of topic models, the topic-noise model. Topic-noise models jointly model topic and noise distributions, greatly increasing the quality of topics derived from social media posts. While static topic models are useful for many settings, they are not well suited for temporal analysis. To identify meaningful topics in this setting, we propose a dynamic topic-noise model that tracks the evolution of topics through time by passing the topic and noise distributions from one time period to the next. Often, researchers have considerable domain knowledge pertaining to the data set being modeled, and wish to infuse their expertise into the topics being generated. For this scenario, we propose a semi-supervised topic model that uses seed topics and selective oversampling to produce a topic set reflective of the user's guidance. We conduct an extensive empirical analysis of all of our models, using quantitative and qualitative methods. All of these contributions culminate in a furthering of the collective capabilities of topic models, and now topic-noise models, and by sharing the models and code, enhances future research in the field.
Embargo Lift Date
MetadataShow full item record
Showing items related by title, author, creator and subject.
Churchill, Robert (Georgetown University, 2017)In the modern era, data is being created faster than ever. Social media, in par-