Spoken Corpora Data, Automatic Speech Recognition, and Bias Against African American Language: The Case of Habitual ‘Be’
This paper can be seen in
Recent work has revealed that major automatic speech recognition (ASR) systems such as Apple, Amazon, Google, IBM, and Microsoft perform much more poorly for Black U.S. speakers than for white U.S. speakers. Researchers postulate that this may be a result of biased datasets which are largely racially homogeneous. However, while the study of ASR performance with regards to the intersection of racial identity and language use is slowly gaining traction within AI, machine learning, and algorithmic bias research, little to nothing has been done to examine the data drawn from the spoken corpora which are commonly used in the training and evaluation of ASRs in order to understand whether or not they are actually biased, this study seeks to begin addressing this gap in the research by investigating spoken corpora used for ASR training and evaluation for a grammatical linguistic feature of what the field of linguistics terms African American Language (AAL), a systematic, rule-governed, and legitimate linguistic variety spoken by many (but not all) African Americans in the U.S. This grammatical feature, habitual 'be', is an uninflected form of 'be' that encodes the characteristic of habituality, as in "I be in my office by 7:30am", paraphrasable as “I am usually in my office by 7:30” in Standardized American English. This study utilizes established corpus linguistics methods on the transcribed data of four major spoken corpora -- Switchboard, Fisher, TIMIT, and LibriSpeech -- to understand the frequency, distribution, and usage of habitual 'be' within each corpus as compared to a reference corpus of spoken AAL -- the Corpus of Regional African American Language (CORAAL). The results find that habitual 'be' appears far less frequently, is dispersed in far fewer transcribed texts, and is surrounded by a much less diverse set of word types and parts of speech in the four ASR corpora as compared with CORAAL. This work provides foundational evidence that spoken corpora used in the training and evaluation of widely used ASR systems are, in fact, biased against AAL and likely contribute to poorer ASR performance for Black users.