However, as this technology increases in popularity, we are uncovering its fragility. For example, imagine you're trying to use a virtual assistant like Siri, Alexa or OK Google. But no matter how hard you try, the assistant seems to have trouble understanding you. Even though you're a bilingual or native speaker, you have to repeat yourself multiple times or even change your accent for the assistant to understand you finally. This unfair behaviour, also called bias, refers to the tendency to perform differently for certain factors such as age, gender, and accent, among others. To mitigate bias, it is essential to use diverse training data and continually evaluate and enhance the system's performance on underrepresented groups.
Some groups that may suffer from bias in Automatic Speech Recognition (ASR) systems are:
To diagnose this bias, it is essential to have annotated data that enables the evaluation of ASR systems. For example, the Speech Accent Archive Dataset was used in some studies to analyse the performance of commercial ASR systems provided by Amazon and Google. The results showed that both perform form with high error rates for second-language speakers of English, male speakers, and speakers of some varieties spoken in the North and Northeast of England compared to native speakers, women, and those from the South of England.
Another requirement are the metrics used in the domain of speech recognition. Some of them are:
Character Error Rate (CER) calculates the number of incorrect characters divided by the total number of characters in the text. A lower CER indicates better performance.
Where S is the number of substitutions (incorrect characters), D the number of deletions (missing characters), I the number of insertions (extra characters), and N the total number of characters in the text.
Word Error Rate (WER) calculates the number of incorrect words divided by the total number of words in the text. A lower WER indicates better performance.
Where S is the number of substitutions (incorrect words), D is the number of deletions (missing words), I is the number of insertions (extra words), and N is the total number of words in the text.
Dialect Density Measure (DDM) evaluates the degree of dialectal variation present in a speech corpus, especially in multilingual or multidialectal approaches. The measure can identify regions or speakers where the system may perform poorly. DDM counts the number of dialect-specific phonemes and compares them to the total number of phonemes in the corpus. A higher dialect density indicates a higher degree of dialectal variation in the corpus. The metric can vary depending on the specific implementation, but a common approach is as follows:
Where Nd is the number of dialect-specific phonemes in the corpus, and N is the total number of phonemes in the corpus. Some other forms of the equation may exist depending on the specific implementation.
In this blog, we show you some great data sets to consider when analysing bias in your ASR systems. The selected datasets are well documented, support different speech recognition tasks, and annotate important attributes such as accent, gender, age, region, ethnicity, education, etc. Remember, the license is important for your application!
Here are some excellent datasets to consider:
---
References
[1] Weinberger, S. (2013). Speech accent archive. George Mason University.
[2] Markl, N. (2022, June). Language variation and algorithmic bias: understanding algorithmic bias in British English automatic speech recognition. In 2022 ACM Conference on Fairness, Accountability, and Transparency (pp. 521-534).
[4] Open-source Multi-speaker Corpora of the English Accents in the British Isles (Demirsahin et al., LREC 2020).
[5] https://www.datatang.ai/datasets/950
[6] Meyer, J., Rauchenstein, L., Eisenberg, J. D., & Howell, N. (2020, May). Artie bias corpus: An open dataset for detecting demographic bias in speech applications. In Proceedings of the 12th language resources and evaluation conference (pp. 6462-6468).Speech technologies have become integral to everyday life and have many applications. For instance, with Automatic Speech Recognition (ASR), you can control devices and access information simply by using voice commands. Additionally, a speech-to-text application allows for real-time dictation and not taking, while a speaker identification application can be used to identify who is speaking in an audio sample. This technology can also aid communication with people who speak different languages by converting spoken audio from one language to another through language translation applications. These are just a few examples of the many tasks that speech recognition technology can be used for.
DISCLAIMER: This blog article is for informational purposes only. This blog article is not intended to, and does not, provide legal advice or a legal opinion. It is not a do-it-yourself guide to resolving legal issues or handling litigation. This blog article is not a substitute for experienced legal counsel and does not provide legal advice regarding any situation or employer.
Schedule a call with one of our experts