Georgetown University LogoGeorgetown University Library LogoDigitalGeorgetown Home
    • Login
    View Item 
    •   DigitalGeorgetown Home
    • Georgetown University Institutional Repository
    • Georgetown College
    • Department of Computer Science
    • Graduate Theses and Dissertations - Computer Science
    • View Item
    •   DigitalGeorgetown Home
    • Georgetown University Institutional Repository
    • Georgetown College
    • Department of Computer Science
    • Graduate Theses and Dissertations - Computer Science
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Modernizing Topic Models: Accounting for Noise, Time, and Domain Knowledge

    Cover for Modernizing Topic Models: Accounting for Noise, Time, and Domain Knowledge
    View/Open
    View/Open: Churchill_georgetown_0076D_15119.pdf (3.7MB) Bookview

    Creator
    Churchill, Robert J
    Advisor
    Singh, Lisa
    ORCID
    0000-0003-4798-1582
    Abstract
    Data has evolved rapidly since the inception of topic models over twenty years ago.The most popular topic models perform poorly on large contemporary data sets that contain short, noisy texts. This dissertation aims to produce a suite of topic models capable of accurately modeling these new types of data. We begin by tracking the evolution of topic models from inception to modern days. We then propose a flexible preprocessing pipeline that can be adjusted for different levels of noise in the data. The core contribution of this dissertation is the development of a new class of topic models, the topic-noise model. Topic-noise models jointly model topic and noise distributions, greatly increasing the quality of topics derived from social media posts. While static topic models are useful for many settings, they are not well suited for temporal analysis. To identify meaningful topics in this setting, we propose a dynamic topic-noise model that tracks the evolution of topics through time by passing the topic and noise distributions from one time period to the next. Often, researchers have considerable domain knowledge pertaining to the data set being modeled, and wish to infuse their expertise into the topics being generated. For this scenario, we propose a semi-supervised topic model that uses seed topics and selective oversampling to produce a topic set reflective of the user's guidance. We conduct an extensive empirical analysis of all of our models, using quantitative and qualitative methods. All of these contributions culminate in a furthering of the collective capabilities of topic models, and now topic-noise models, and by sharing the models and code, enhances future research in the field.
    Description
    Ph.D.
    Permanent Link
    http://hdl.handle.net/10822/1063045
    Date Published
    2021
    Subject
    semi-supervised learning; social media; temporal; topic modeling; topic noise model; unsupervised learning; Computer science; Artificial intelligence; Computer science; Artificial intelligence;
    Type
    thesis
    Embargo Lift Date
    2022-08-02
    Publisher
    Georgetown University
    Extent
    177 leaves
    Collections
    • Graduate Theses and Dissertations - Computer Science
    Metadata
    Show full item record

    Related items

    Showing items related by title, author, creator and subject.

    • Cover for Modernizing Topic Models: Accounting for Noise, Time, and Domain Knowledge

      Modernizing Topic Models: Accounting for Noise, Time, and Domain Knowledge 

      Churchill, Robert J (Georgetown University, 2021)
      Data has evolved rapidly since the inception of topic models over twenty years ago.The most popular topic models perform poorly on large contemporary data sets that contain short, noisy texts. This dissertation aims to ...
    Related Items in Google Scholar

    Georgetown University Seal
    ©2009 - 2023 Georgetown University Library
    37th & O Streets NW
    Washington DC 20057-1174
    202.687.7385
    digitalscholarship@georgetown.edu
    Accessibility
     

     

    Browse

    All of DigitalGeorgetownCommunities & CollectionsCreatorsTitlesBy Creation DateThis CollectionCreatorsTitlesBy Creation Date

    My Account

    Login

    Statistics

    View Usage Statistics

    Georgetown University Seal
    ©2009 - 2023 Georgetown University Library
    37th & O Streets NW
    Washington DC 20057-1174
    202.687.7385
    digitalscholarship@georgetown.edu
    Accessibility