โŒ

Reading view

Transcribing speech is never neutral. It shapes power and bias

Vaselena / Getty Images

Earlier this year I gave a talk about my research at Oxfordโ€™s All Souls College, and worked with a chef to design an accompanying menu.

Thinking about my work in southwest Western Australia, I typed โ€œBoorlooโ€, the Nyungar name for the City of Perth.

Autocorrect had other ideas. It replaced it with โ€œBaroloโ€ โ€“ which, I thought, made for a fitting wine choice on the night.

It was an amusing moment, but also a revealing one. The systemโ€™s dictionary, trained largely on mainstream English data, didnโ€™t know what Boorloo was, so it reached for a more familiar alternative. This seemingly minor miscorrection offers a glimpse into how language technologies are shaped โ€“ including which words they recognise, and which they overlook.

Why does this happen?

Part of the answer is that technologies such as automatic speech recognition convert spoken language into text. Transcription is often presented as a straightforward technical exercise: you listen, you write down what was said.

But every transcription protocol carries within it assumptions about what standardised speech looks like. In the words of linguist Mary Bucholtz, โ€œall transcripts take sidesโ€.

In practice, the standardised language is almost always the โ€œprestige dialectโ€ of powerful institutions. For English, that may be the variety used in the Oxford English Dictionary or by the BBC.

Recent research from Cornell University and Carnegie Mellon shows what this means in concrete terms.

When people watched a video presentation with automatically generated, error-prone subtitles, they consistently rated the speaker as less clear and less knowledgeable than viewers who saw the same presentation with accurate captions. The quality of the transcription affected not only how viewers perceived the speaker, but also the content of the talk.

Automated systems, amplified consequences

The stakes are particularly high for First Nations people in Australia. Here, the mismatch between the conventions of transcription and the actual practice of communication can be severe.

In many Indigenous communities, pauses and silences themselves function as meaningful acts of communication.

In places such as Wadeye in Australiaโ€™s Northern Territory, a sustained silence is not a gap to be filled. Instead it is part of the structure of what is being communicated.

Transcription systems developed in northern hemisphere academic contexts will generally render those silences with hesitation markers, ellipses, or editorial cuts, stripping out meaning.

Common words in languages other than English (such as โ€œBoorlooโ€ for Perth) go unrecognised. They may be mistranscribed to fit the language models on which technology is trained.

In legal, medical and welfare contexts, transcription can determine someoneโ€™s liberty, diagnosis, or entitlements. Here, systematic misrepresentation of non-standardised language is a justice issue.

Tools using artificial intelligence (AI) for transcription are now being deployed in hospitals and GP practices across Australia, resulting in mistakes, omissions and so-called hallucinations. A recent study of several AI scribes found all of them made errors in transcription and note-taking.

About half of the samples also included factual inaccuracies, with hallucinations occurring frequently, fabricating diagnoses, or listing medications that were never taken. In one case, a male patient was even recorded as being on the contraceptive pill.

Making conventions visible

Making things better includes developing more diverse models for automated speech recognition.

But for anyone producing transcripts right now โ€“ in journalism, oral history, the law, clinical records, or sociolinguistic research โ€“ certain obligations apply. Make your conventions explicit, acknowledge what your system cannot represent, and resist the impulse to normalise speech into something legible to an imagined standard reader.

Rendering speech into writing may seem natural, but writing is itself a technology. The task is not to achieve perfect objectivity, but to be visible and accountable for decisions about what is included and excluded, and how those decisions are made.

The Conversation

Celeste Rodriguez Louro receives funding from the Australian Research Council and Google.

  •  
โŒ