This blog post contains some
examples of speech-to-text transcription apps which could be useful for
qualitative researchers, e.g. to quickly transcribe and summarise meetings,
interviews, or conversations. After an overview of some key features, I reflect
on some key considerations for using these apps – for example the use of digital/automated
methods and ethics, inclusion, and privacy. This is the second post in a 2-part
introduction to using automated speech-to-text apps – see part 1 for an
overview and background information.
In this post, I reflect on
my experiences of using Otter.ai to record, take notes, and embed photos during
the Talking Maps exhibition at the Weston Library in Oxford (I’ll write a
separate blog post about this exhibition, as it was fantastic!). At this
exhibition, we joined a large group as part of a guided tour with Stewart
Ackland from the Map Department at the Bodleian Library. With permission, this
tour was recorded using the Otter.ai app on my smartphone (Samsung Galaxy s10).
Of course, you can do a lot
of the following tasks manually (or by using Natural Language Processing
features in a programming language/environment, NVivo, or similar). However,
these in-built features in Otter.ai could be very useful for those who are new
to automatic ways of transcribing, summarising, and displaying qualitative data
(or would benefit from having these features in an accessible, engaging, and
free mobile/computer app).
Automatic
word frequencies
Otter.ai automatically finds
key words, i.e. the most frequently mentioned words. It displays these as a
list at the top of the transcription, once it has finished processing the
conversation after recording. These words are ordered in terms of the frequency
that they are mentioned, and you can click on any of these words to highlight it
throughout the transcript. Otter can also generate a word cloud from these
frequent words, with the size of the word proportional to its frequency.

Word clouds are by no means
a sophisticated way to analyse text, however they do provide a quick, easy, and
engaging way to see which words are most prevalent in your transcript. For example, the photos at the beginning of this blog (parts 1 and 2) are word clouds created from the text in the post - including words I've frequently used like 'transcription', 'otter.ai' and 'example' (see bottom of article for citation). In the
word cloud above, you can see that our conversation at the map
exhibition was (unsurprisingly!) about maps, and things that can be related to
maps (country, area, land, ocean, world, Europe, people, etc.). It’s important
to note that the transcript has been automatically cleaned so that
common English words (e.g. “so”, “if”, “and”) have been removed for you (so they
don’t affect the frequency of the words you might be most interested in).
Exploring
word frequencies
In the previous example, you
can see that ‘field’ was one of the key words in this conversation about maps.
If you want to quickly find out more about this word, you can do this by
clicking on the key word and it will highlight all those words in the
transcript (like doing CTRL + f in a document). Let’s have a look at where
‘field’ is mentioned in the Talking Maps transcript.
When we navigate to the mentions
of ‘field’ in the transcript, we can see that this is clustered around 17
minutes in – the exhibition guide is talking about a very interesting map from
the 1600s, which depicts common agricultural practices at the time.
As you might be able to tell
from the excerpt above, one downside to Otter.ai is that it transcribes almost everything
that is being said. This can be an issue because naturally, humans tend to
not always speak in coherent and flowing sentences and can change the direction
of what they are saying mid-sentence (and pause, ‘umm’ and ‘err’ a lot). You can
end up with a lot of repetition, breaks in sentences, and some sentences that
don’t make sense. Therefore, it’s useful to listen to your recording as you
edit (you can do this easily within the Otter.ai application, or elsewhere).
When you edit your transcript in Otter.ai, it automatically realigns your text
with the audio, which is useful. You can also see that it has highlighted the
position of the word ‘field’ throughout your transcript along the time bar at
the bottom, which makes it easy to skip to the word you are interested in.
Editing,
photos, and speaker assignment
As I mentioned before, no
transcription software perfectly understands every word that is being said –
particularly if there are different accents, speeds and tones of speaking,
multiple people trying to speak at once (as was the case with our conversation
in the museum), or if acronyms and unusual place names are used. However, you
can easily edit any mistakes while listening to the recording, before you
export the file for further analysis. After time, Otter.ai will learn to pick
up when you say some words, and you can also teach Otter names, words,
acronyms, and phrases to improve the accuracy of the transcription (you can
teach it up to 5 words for free, or 200 if you upgrade).
The examples below also show
how you can easily integrate photos within the flow of text. This can be done
by taking a photo on your smartphone, for example, while also recording on
Otter.ai (on the mobile app). This is quite useful to refer to, so you know
exactly what the speaker is referring to in the conversation (in this case,
unsurprisingly, it’s maps again!). It's also a nice feature for researchers
interested in mobile research methods (particularly those involving walking
interviews, smartphones, and/or human-technology interactions), however
background noise and the recording of multiple participants might be an issue
here.

You might have noticed that
in the pictures above, the person who is speaking is labelled as ‘Speaker 1’. At
first, the speaker’s name was blank. Once I had labelled this, the computer
will begin to scan through and automatically label ‘Speaker 1’ whenever it
picks up that they are saying something. This is mostly accurate (ish), but you
might want to double check by listening back through your recording. You can
also save the names (or code names for) ‘suggested speakers’ in the Otter app.
I’ve found this useful when recording regularly occurring meetings, for example
those with my PhD supervisors.
Is
there anything I should consider before using it?
Otter.ai is not 100%
accurate and it might not be the best, most reliable (and most time or
cost-effective) choice for everyone. Otter can struggle with recognising the
voices of different speakers, picking up some accents, and is also quite
limited to what languages it recognises (however this is something that the
company is improving). It also requires a clear recording with little/no
background noise and can struggle to transcribe multiple voices when people
speak at once (however, it did work rather well for me in a museum with lots of
people talking in the background!). Further to this, Otter can miss out quite a
bit of punctuation (or, on the other hand, overuses punctuation and puts
unexpected full stops in place of a natural pause), which requires further
edits. Finally, particularly if you are using your mobile phone to record
meetings and interviews, it is worth noting where the microphones are on your
device to ensure that you can record two or more voices (e.g. most smartphones
have mics on the top and bottom of the handset).
As with any digital research
tool, you might want to critically evaluate the ways that technology includes
(and excludes) individuals and groups of people. Ethics, inclusivity, and power
relations are all important considerations here, including how this affects the
knowledge produced by the research encounter. If you’re interested in digital
research methods and ethics, this is a topic of interest in digital
geographies, for example – the RGS-IBG
Digital Geographies Research Group
hosts and promotes some great events and resources. Considering the explosion
of the use of digital tools during the coronavirus pandemic and social
distancing measures, this LSE Impact Blog post outlines some practical and ethical
considerations of carrying out qualitative research under lockdown (this Google Docs on ‘doing fieldwork in a pandemic’,
edited by Deborah Lupton, also contains some excellent resources).
Importantly, the use of speech-to-text
applications (including Otter.ai) for research purposes comes with important
concerns regarding privacy and security. This is because sections of your
recorded information could be used for training and quality testing purposes -
see Otter.ai FAQs on “Is my data safe?” for more information on this,
and view their full privacy policy here. It is important to carefully consider the privacy and
security of any application or service you use for transcription, particularly
if you are responsible for handling sensitive data. It is also important to
think about how using apps like Otter.ai fit in with your institution’s GDPR
and ethics guidelines, and/or the guidelines of the organisation you are
collecting data for. As best practice, you should consider gaining informed consent from anyone you wish to record using Otter.ai (or similar apps). You should
also think about whether your institution’s ethics council need to be aware that you are
intending to use this method of recording and data storage.
Conclusion:
are automated speech-to-text apps useful for qualitative research?
Automated speech-to-text
applications have the potential to be incredibly useful, if used with
consideration and for suitable applications. Apps like Otter.ai can save you a large
amount of time by allowing a computer to perform the labour-intensive task of
transcription for you. They can also help by identifying emerging themes,
highlighting key words, embedding photographs, and visualising your text (such
as word clouds).
However, speech-to-text apps
are not 100% there yet in terms of the accuracy and reliability of
transcription (thus require a certain level of manual editing after the
transcript has been generated). However, some manual editing isn’t necessarily
a bad thing, as listening through recordings again can help gain a better
understanding of the data you have collected. As with many digital methods,
these apps may also provoke concerns regarding the ethics, privacy, and
security of data collection, processing, and storage.
In sum, artificial
intelligence and machine learning in speech recognition has certainly come a
long way, and apps like Otter.ai are getting there and will only continue to
improve. Speech-to-text transcription is a very exciting and continuously
developing area, with great potential to improve working conditions for social
scientists and other researchers. I’d definitely recommend looking at Otter.ai and
testing different speech-to-text transcription apps for yourself, to see what
works best for you and your research!
Links to some useful resources:
Wordcloud
blog title image created from the text in this article in R. R Core Team (2019). R: A language and
environment for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. URL: https://www.r-project.org/.
The tm (vo.7-7; Feinerer & Hornik, 2019), readtext (v0.76; Benoit
& Obeng, 2020), wordcloud2 (v0.2.2; Lang, 2020), RColorBrewer
(v1.1-2; Neuwirth, 2014), wordcloud (v2.6; Fellows, 2018) packages were used.