A Multifactorial, Multitask Approach to Automated Speaker Profiling
Abstract
Automated Speaker Profiling (ASP) refers broadly to the computational prediction of speaker traits based on cues mined from the speech signal. Accurate prediction of such traits can have a wide variety of applications such as automating the collection of customer metadata, improving smart-speaker/voice-assistant interactions, narrowing down suspect pools in forensic situations, etc.
Approaches to ASP to date have primarily focused on single-task computational models– i.e. models which each predict one speaker trait in isolation. Recent work however has suggested that using a multi-task learning framework, in which a system
learns to predict multiple related traits simultaneously, each trait-prediction task having access to the training signals of all other trait-prediction tasks, can increase classification accuracy along all trait axes considered.
Likewise, most work on ASP to date has focused primarily on acoustic cues as predictive features for speaker profiling. However, there is a wide range of evidence from the sociolinguistic literature that lexical and phonological cues may also be of use in predicting social characteristics of a given speaker. Recent work in the field of author profiling has also demonstrated the utility of lexical features in predicting social information about authors of textual data, though few studies have investigated whether this carries over to spoken data.
In this dissertation I focus on prediction of five different social traits: sex, ethnicity, age, region, and education. Linguistic features from the acoustic, phonetic, and lexical realms are extracted from 60 second chunks of speech taken from the 2008 NIST SRE corpus and used to train several types of predictive models. Naive (majority class prediction) and informed (single-task neural network) models are trained to provide baseline predictions against which multi-task neural network models are evaluated. Feature importance experiments are performed in order to investigate which features and feature types are most useful for predicting which social traits.
Results presented in chapters 5-7 of this dissertation demonstrate that multitask models consistently outperform single-task models, that models are most accurate when provided information from all three linguistic levels considered, and that lexical features as a group contribute substantially more predictive power than either phonetic or acoustic features.
Description
Ph.D.
Permanent Link
http://hdl.handle.net/10822/1057312Date Published
2019Subject
Type
Publisher
Georgetown University
Extent
300 leaves
Collections
Metadata
Show full item recordRelated items
Showing items related by title, author, creator and subject.
-
Reflections on International Relations Theory and its Relevance to the Twenty-First Century: The Need to Incorporate a Complex Approach
Farrell, Sean (Georgetown University, 2011)This paper focuses on the problem that security studies has not adequately incorporated complex analytical methods. Current analytical methodologies used in social sciences adhere to reductionist approaches. This approach ...