- ACII 2022Audio and ASR-based Filled Pause DetectionAggelina Chatziagapi, Dimitris Sgouropoulos, Constantinos Karouzos, Thomas Melistas, Theodoros Giannakopoulos, Athanasios Katsamanis, and Shrikanth NarayananOct 2022
Filled pauses (or fillers) are the most common form of speech disfluencies and they can be recognized as hesitation markers ("um", "uh" and "er") made by speakers, usually to gain extra time while thinking their next words. Filled pauses are very frequent in spontaneous speech. Their detection is therefore rather important for two basic reasons: (a) their existence influences the performance of individual components, like Automatic Speech Recognition system (ASR), in human-machine interaction and (b) their frequency can characterize the overall speech quality of a particular speaker, as it can be strongly associated with the speaker’s confidence. Despite that, only limited work has been published for the detection of filled pauses in speech, especially through audio. In this work, we propose a framework for filled pause detection using both audio and textual information. For the audio modality, we transfer knowledge from a plethora of supervised tasks, such as emotion or speaking rate, using Convolutional Neural Networks (CNNs). For the text modality, we develop a temporal Recurrent Neural Network (RNN) method that takes into account textual information derived from an ASR system. In addition, the proposed transfer learning approach for the audio classifier leads to better results when benchmarked on our internal dataset for which the text is not transcribed but estimated by an ASR system. In this case, a simple late fusion approach boosts the performance even further. This proves that the audio approach is suitable for real-world applications where the transcribed text is not available and has to leverage imperfect ASR results, or even the absence of textual information (to reduce computational cost).
- NAACL 2021UDALM: Unsupervised Domain Adaptation through Language ModelingJun 2021
In this work we explore Unsupervised Domain Adaptation (UDA) of pretrained language models for downstream tasks. We introduce UDALM, a fine-tuning procedure, using a mixed classification and Masked Language Model loss, that can adapt to the target domain distribution in a robust and sample efficient manner. Our experiments show that performance of models trained with the mixed loss scales with the amount of available target data and the mixed loss can be effectively used as a stopping criterion during UDA training. Furthermore, we discuss the relationship between A-distance and the target error and explore some limitations of the Domain Adversarial Training approach. Our method is evaluated on twelve domain pairs of the Amazon Reviews Sentiment dataset, yielding 91.74% accuracy, which is an 1.11% absolute improvement over the state-of-the-art.