A Predictive and Interpretable Model for Toxic Content Classification
Goharian, Nazli NG
In this thesis, we develop methodologies to enhance the robustness of current neural models for online toxicity detection. Specifically, we aim at adding predictive power and interpretability to transformer-based models. To improve the predictive power of a transformer-based model, we propose to further pre-train the model on the domain-related corpus, i.e., social media text. To add interpretability to a transformer-based model, we introduce a simple and effective assumption, that a post is at least as toxic as its most toxic span, to empower the model with the ability to explain its output during prediction. We incorporate this assumption into transformer-based models by scoring a post based on the maximum toxicity of its spans and augmenting the training process to identify correct spans. The experiments have shown that our proposed idea of further pre-training can improve the model's performance for toxicity detection. We also find our proposed approach that incorporates interpretability does not injure the predictive power of the model and can produce explanations that exceed the quality of those provided by Logistic Regression analysis (often regarded as a highly interpretable model), according to a human study. We also find that our proposed approach can be generalized to different transformer-based models and even different domain tasks.
MetadataShow full item record
Showing items related by title, author, creator and subject.
Content and Classification of Clinical Trials at a University Hospital in Japan Nakamura, Tetsuya; Yamamoto, Koujirou; Nagai, Ryozo; Horiuchi, Ryuya (2003-03)