Thesis Examination Committee
Prof Bertram SHI, ECE/HKUST (Chairperson)
Prof Pascale FUNG, ECE/HKUST (Thesis Supervisor)
Prof Chiew Lan TAI, CSE/HKUST
It is important for machines to interpret human emotions properly for better human-machine communication, as emotion is an essential part of human-to-human communications. One aspect of emotion is reflected in the language we use. How to represent emotions in texts is a challenge in natural language processing. This thesis focuses on the significance of learning good representations of emotions, especially for solving affect-related text classification tasks like sentiment/emotion analysis and abusive language detection.
We first propose representations called emotional word vectors (EVEC), which is learned from a convolutional neural network (CNN) model with an emotion-labeled corpus, which is constructed using tweet hashtags. A pseudo task of emotion labeling is performed in training this CNN model whose word embedding layers are then used for other tasks.
Secondly, we further extend our research to learning sentence-level emotion representations by training a bidirectional Long Short-Term Memory(LSTM) model with a huge corpus of texts with the pseudo task of recognizing emojis. We evaluate both representations by performing both qualitative and quantitative analysis and also report high-ranked results in the Semantic Evaluation (SemEval2018) competition. Our results show that, with the representations trained from millions of tweets with weakly supervised labels such as hashtags and emojis, we can solve sentiment/emotion analysis tasks more effectively.
Lastly, as examples of understanding affect in texts, we explore a more specific problem of automatic detection of abusive language (also known as hate speech). We propose a two-step classification approach to alleviate the data shortage problem. Moreover, we address the issue of gender bias in various neural network models by conducting experiments to measure and reduce those biases in the representations in order to build more robust abusive language detection models.