We ask Alexa to play songs from the ’70s. We ask Siri for a weather report. We dictate e-mails and text messages to our phones. We speak into machines and they do things for us. They seem to understand us. But is this technology up to the task of preparing important documents like legal transcripts?
Over the last five years, there has been several new Internet-based companies promoting the use of voice recognition, coupled with artificial intelligence (AI), to produce transcripts. Are these transcripts of the same quality that a skilled court reporter produces? Is voice recognition now sophisticated enough to remove the human from the equation?
Reliable statistics on the accuracy of voice-to-text software are hard to find. This is likely because of lack of consistency and ever-changing topics and speakers that voice recognition software tries to turn into text. In litigation, changing factors on a daily basis include:
- Technical nature of the testimony;
- How clearly people speak, accents;
- Overtalking by the participants;
- Names, acronyms, etc. that have no Oxford dictionary context;
- Changing nature of cases (medical, technical, construction, etc.) and terminology.
Whether you’re using AI voice-to-text software or a traditional court reporter, all the things listed above will affect one’s ability to produce an accurate transcript. To overcome the obstacles listed, a human being is required to intervene. As court reporters, we ask witnesses to spell and clarify if we’re unsure about something they’ve said; we stop people from speaking over each other or ask them to be clear; and in preparing a transcript, research is often done to ensure what we heard is, in fact, correct. Example: I recently had a case involving an unknown (to me) U.S. statute, and I needed to be sure of the citation style and numbering, so I Googled it.
For voice-to-text transcription to work well, it starts with an excellent recording. That means using microphones for everyone who is going to speak in the room and identifying them by their microphone designation (i.e. Mr. Smith is on mic one, Ms. Jones is on mic two, and the witness is on mic three).
High quality recording equipment, with redundancy, is a must. The digital monitor must create a log of notes and perform all the tasks of a court reporter to overcome the obstacles outlined above. All these elements need to be present for reasonably accurate AI voice-to-text translation. If you think you can slap a tape recorder down on a table without these features, good luck getting anything comprehensive from an AI system.
Presently, voice-to-text still faces challenges with the lack of punctuation, the inability to “decide” where to place the speech of overtalking speakers within the text and, of course, in deciphering some of the language that is being said. All these issues require the intervention of a human, usually referred to as an editor. The editor will review, with the audio recording, what the automated voice-to-text software has produced and insert punctuation, correct errors, misidentified speakers and spelling, and ensure that contextually the “machine” has understood what is being said.
I am always reminded of the book Eats, Shoots & Leaves with the cover photo of a panda. If we were to contextualize the title of the book with the picture, the title should read Eats Shoots and Leaves. There’s a big difference between those two meanings.
No one can dispute that voice recognition has come a long way in a very, very short time. It’s hard to predict when it will ever — if ever — be able to fully replace a human being’s intervention. For the foreseeable future, voice-to-text, accompanied by excellent audio quality and a trained editor (preferably a seasoned ex-court reporter or experienced transcriber) will be the most likely next step into the future, but only at a point when the time to edit and certify a voice-to-text transcript is more efficient than our current methods.
So, we won’t be turning over the creation of a certified legal transcript to the machines today.