Data Science

Insightful Resources for Uncovering Bias in English Speech Recognition

Authored By

Published on

January 27, 2023

However, as this technology increases in popularity, we are uncovering its fragility. For example, imagine you're trying to use a virtual assistant like Siri, Alexa or OK Google. But no matter how hard you try, the assistant seems to have trouble understanding you. Even though you're a bilingual or native speaker, you have to repeat yourself multiple times or even change your accent for the assistant to understand you finally. This unfair behaviour, also called bias, refers to the tendency to perform differently for certain factors such as age, gender, and accent, among others. To mitigate bias, it is essential to use diverse training data and continually evaluate and enhance the system's performance on underrepresented groups.

Possible Bias in Automatic Speech Recognition

Some groups that may suffer from bias in Automatic Speech Recognition (ASR) systems are:

Gender: Some speech recognition systems may be trained primarily on male voices, leading to poor performance for female voices.
Non-native speakers: Systems may not perform as well for people with accents or those who speak languages other than the one the system is trained on.
Older adults: As people age, their speech patterns may change, which can affect the performance of a speech recognition system.
People with disabilities: Speech recognition systems may not be designed with accessibility in mind, leading to poor performance for people with disabilities such as hearing loss or speech impairments.
People with different dialects or sociolects (accent): Systems may perform poorly for people who speak with a dialect or sociolect different from the one the system was trained on.
People with different noise environments: Such as in a car or a crowded room.

To diagnose this bias, it is essential to have annotated data that enables the evaluation of ASR systems. For example, the Speech Accent Archive Dataset was used in some studies to analyse the performance of commercial ASR systems provided by Amazon and Google. The results showed that both perform form with high error rates for second-language speakers of English, male speakers, and speakers of some varieties spoken in the North and Northeast of England compared to native speakers, women, and those from the South of England.

Another requirement are the metrics used in the domain of speech recognition. Some of them are:

Character Error Rate (CER) calculates the number of incorrect characters divided by the total number of characters in the text. A lower CER indicates better performance.

Where S is the number of substitutions (incorrect characters), D the number of deletions (missing characters), I the number of insertions (extra characters), and N the total number of characters in the text.

Word Error Rate (WER) calculates the number of incorrect words divided by the total number of words in the text. A lower WER indicates better performance.

Where S is the number of substitutions (incorrect words), D is the number of deletions (missing words), I is the number of insertions (extra words), and N is the total number of words in the text.

Dialect Density Measure (DDM) evaluates the degree of dialectal variation present in a speech corpus, especially in multilingual or multidialectal approaches. The measure can identify regions or speakers where the system may perform poorly. DDM counts the number of dialect-specific phonemes and compares them to the total number of phonemes in the corpus. A higher dialect density indicates a higher degree of dialectal variation in the corpus. The metric can vary depending on the specific implementation, but a common approach is as follows:

Where Nd is the number of dialect-specific phonemes in the corpus, and N is the total number of phonemes in the corpus. Some other forms of the equation may exist depending on the specific implementation.

In this blog, we show you some great data sets to consider when analysing bias in your ASR systems. The selected datasets are well documented, support different speech recognition tasks, and annotate important attributes such as accent, gender, age, region, ethnicity, education, etc. Remember, the license is important for your application!

Here are some excellent datasets to consider:

Speech Accent Archive
Author/Supervision	Steven H. Weinberger/George Mason University
Description	Contains Native and non-native speakers of English. Dataset contains 2,140 speech samples. Participants come from 177 countries with 214 different native languages.
Links	https://github.com/DagsHub/audio-datasets/tree/main/Speech-Accent-Dataset
Annotation	Accent (117 countries), Gender, Age, Birthplace
Tasks	Accent Classification
License	Attribution-NonCommercial-ShareAlike 2.0 Generic

‍

ACL Anthology Dataset
Author/Supervision	Google Research
Description	31 hours of recordings from 120 volunteers who self-identify as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English.
Links	https://aclanthology.org/2020.lrec-1.804/
Annotation	Accent (UK)
Tasks	Accent Classification
License	Attribution-ShareAlike 4.0 International

‍

Santa Barbara Corpus of Spoken American English
Author/Supervision	John W. Du Bois, Wallace L. Chafe, Charles Meyer, Sandra A. Thompson
Description	Recordings: 14 recordings as .mp3 files Transcripts: Time-aligned transcripts for all 14 recordings, in the CHAT format Metadata: A .csv with demographic information on speakers, as well as which recordings they appear in. (Some speakers appear in more than one recording.)
Links	https://www.kaggle.com/datasets/rtatman/santa-barbara-corpus-of-spoken-american-english
Annotation	Age, Gender, Origin, Ethnicity, Education
Tasks	Speech Recognition
License	Attribution-NoDerivs 3.0 United States

‍

OpenSLR (SLR83)
Author/Supervision	Işın Demirşahin, Oddur Kjartansson, Alexander Gutkin, Clara Rivera / Google Research
Description	Contains a total of 17,877 recordings of six dialects with the associated transcriptions. A total of 120 volunteers were recorded, (49 female /71 male). The dialects represent in the recordings are: Irish English, Midlands English, Northern English, Scottish English, Southern English, Welsh English.
Links	https://aclanthology.org/2020.lrec-1.804.pdf & http://www.openslr.org/83/
Annotation	Accent (6 dialects), Gender
Tasks	Accent Classification
License	Attribution-ShareAlike 4.0 International

‍

Datatang’s British English Speech Dataset
Author/Supervision	Datatang
Description	831 hours of data of Mobile Phone conversations of adults of a wide range of ages speaking British English.
Links	https://www.datatang.ai/datasets/950
Annotation	Gender, Age, Noise Environment
Tasks	Speech Recognition
License	-

‍

Artie Bias Corpus
Author/Supervision	Josh Meyer / Artie
Description	The Artie Bias Corpus consists of 1,712 audio clips (≈ 2.4 hours) along with their transcripts and demographic data about the speaker.
Links	https://aclanthology.org/2020.lrec-1.796.pdf
Annotation	Gender, Age, Accent
Tasks	Speech Recognition
License	Mozilla Public License 2.0

‍

Voice Gender Detection
Author/Supervision	Jim Schwoebel/Digital Ocean
Description	Cleaned Dataset for Voice gender detection using the VoxCeleb dataset (7000+ unique speakers and utterances, 3683 males / 2312 females). The VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.
Links	https://github.com/DagsHub/audio-datasets/tree/main/voice_gender_detection
Annotation	Gender [Male, Female]
Tasks	Audio Gender Classification
License	Creative Commons Attribution 4.0 International /Apache License Version 2.0

‍

Audio MNIST
Author/Supervision	Sören Becker/Department of Video Coding & Analytics, Germany
Description	The dataset consists of 30000 audio samples of spoken digits (0-9) of 60 different speakers. Additionally "audioMNIST_meta.txt" provides meta information such as gender or age of each speaker.
Links	https://github.com/DagsHub/audio-datasets/tree/main/AudioMNIST
Annotation	Age, Gender, Accent, Origin, Native Speaker [True/False]
Tasks	Audio Number Classification
License	MIT License

‍

---

References

[1] Weinberger, S. (2013). Speech accent archive. George Mason University.

[2] Markl, N. (2022, June). Language variation and algorithmic bias: understanding algorithmic bias in British English automatic speech recognition. In 2022 ACM Conference on Fairness, Accountability, and Transparency (pp. 521-534).

[4] Open-source Multi-speaker Corpora of the English Accents in the British Isles (Demirsahin et al., LREC 2020).

[5] https://www.datatang.ai/datasets/950

[6] Meyer, J., Rauchenstein, L., Eisenberg, J. D., & Howell, N. (2020, May). Artie bias corpus: An open dataset for detecting demographic bias in speech applications. In Proceedings of the 12th language resources and evaluation conference (pp. 6462-6468).Speech technologies have become integral to everyday life and have many applications. For instance, with Automatic Speech Recognition (ASR), you can control devices and access information simply by using voice commands. Additionally, a speech-to-text application allows for real-time dictation and not taking, while a speaker identification application can be used to identify who is speaking in an audio sample. This technology can also aid communication with people who speak different languages by converting spoken audio from one language to another through language translation applications. These are just a few examples of the many tasks that speech recognition technology can be used for.

Heading 2

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.