Conference Content

Learn more about who, what, when, and where.
2 of 4
Next

In & Out of the Hub

This is your portal to access more content in the Virtual Hub.
1 of 4
Next

Live Content

On these streams you will find live content happening during the conference.
4 of 4
Close

Connect with others

These spaces are built specifically for you to connect with others at the conference.
3 of 4
Next
← Back to Library
Paper questions
Have a question? Ask a question now and we will try to get it answered during the live Q&A.
Ask a Question?
This paper can be seen in
This live session has already passed 🤭. Luckily there is a recording below.
March
9
Mar 09, 2021
20:00
-
21:45
UTC
Add to Calendar 3/9/21 20:00 3/9/21 21:45 UTC Paper Session 15 Check out this session on the FAccT Hub. https://2021.facctconference.org/conference-agenda/spoken-corpora-data-automatic-speech-recognition-and-bias-against-african-american-language-the-case-of-habitual-be
Abstract

Recent work has revealed that major automatic speech recognition (ASR) systems such as Apple, Amazon, Google, IBM, and Microsoft perform much more poorly for Black U.S. speakers than for white U.S. speakers. Researchers postulate that this may be a result of biased datasets which are largely racially homogeneous. However, while the study of ASR performance with regards to the intersection of racial identity and language use is slowly gaining traction within AI, machine learning, and algorithmic bias research, little to nothing has been done to examine the data drawn from the spoken corpora which are commonly used in the training and evaluation of ASRs in order to understand whether or not they are actually biased, this study seeks to begin addressing this gap in the research by investigating spoken corpora used for ASR training and evaluation for a grammatical linguistic feature of what the field of linguistics terms African American Language (AAL), a systematic, rule-governed, and legitimate linguistic variety spoken by many (but not all) African Americans in the U.S. This grammatical feature, habitual 'be', is an uninflected form of 'be' that encodes the characteristic of habituality, as in "I be in my office by 7:30am", paraphrasable as “I am usually in my office by 7:30” in Standardized American English. This study utilizes established corpus linguistics methods on the transcribed data of four major spoken corpora -- Switchboard, Fisher, TIMIT, and LibriSpeech -- to understand the frequency, distribution, and usage of habitual 'be' within each corpus as compared to a reference corpus of spoken AAL -- the Corpus of Regional African American Language (CORAAL). The results find that habitual 'be' appears far less frequently, is dispersed in far fewer transcribed texts, and is surrounded by a much less diverse set of word types and parts of speech in the four ASR corpora as compared with CORAAL. This work provides foundational evidence that spoken corpora used in the training and evaluation of widely used ASR systems are, in fact, biased against AAL and likely contribute to poorer ASR performance for Black users.

Live Q&A Recording
This live session has not been uploaded yet. Check back soon or check out the live session.