Guest blog by Angus Addlesee, one of the showcase presenters at the DataTech during DataFest21’s Our People week.
The Benefits of Voice Assistants
Voice Assistants (like Siri or Amazon Alexa) are now commonplace and can be really useful. For example, my hands were covered in flour the other day as I made pizza. I would have had to wash my hands to set timers in the past, but now I can set them with just my voice.
This is handy, but voice assistants can hugely improve some people’s lives.
In a recent conversation with a gentleman that has MS, he mentioned that even something as routine as turning off his light at night was impossible for him. Now, using just his voice, he has some of his independence back. He noted that this has made a world of difference to his wellbeing.
Similarly, independently putting on music becomes difficult as dementia progresses. Buttons are fiddly, tablet interfaces are confusing, and record players require many steps to put on a particular song. Music is extremely beneficial to people with dementia however (check out Playlist for Life), so can voice assistants help here? I am working to answer this currently and initial results suggest the answer is a resounding yes! One participant in the project said that he turns off the TV to listen to music now, and has even found new songs from artists that he loved when he was younger.
If you are still not convinced, I have one last example for you – people affected by sight loss. Computers can now ‘see’ and interpret these visual scenes. Can we answer questions asked by blind and partially sighted people? Read until the end to find out…
Our Conversations are Messy
How many times have you forgotten a word mid-sentence, pausing to remember it? Often these pauses are silent, but if you have ever listened to a recording of yourself – you may notice some pauses filled with ‘uums’. How many times have you jumbled your sentence and just decided to restart it entirely? I do this all the time… We even repeat words without really realising like: “I really liked that that that show about chess on Netflix”.
This is just talking individually too, not with others. How many times has someone spoken over you completely, or attempted to finish your sentence when you hesitate? We even say things like “yep”, “uh-huh”, and “oh” while someone is telling us a story.
Conversation is more complex than you would initially think.
This is before we even start on the visual cues we use! We nod in agreement, shake our heads in disgust, make eye contact to show engagement, furrow our brows to show confusion, and point at things around us.
We use these ‘conversational phenomena’ subconsciously every day to guide our interactions. Maybe look out for some of these when you next chat to someone.
Importantly, dialogue changes as cognition declines and these phenomena become more common and more pronounced over time. People pause mid-sentence more often as dementia progresses for example, and these pauses become longer. To get around these long pauses, people with dementia will maybe say “Can you get the thing in the oven?” instead of “Can you get the chicken?”, making it less obvious that they have forgotten the word ‘chicken’. These changes are important to explore if we are to design better voice assistants.
We Adapt to Voice Assistants
If I said the following to you, what would you think I wanted you to do?
“Alex – my hands are dirty so could you maybe set me a timer for uuh, I don’t know, a a couple of minutes?”
I’m pretty sure you are all thinking the same thing. I have tested this however, and my voice assistant has no idea. Why? Because my sentence had too many ‘conversational phenomena’ in it. I would say to my assistant:
“Assistant, set a timer for two minutes”
Now this is clearer in my DataTech presentation, but hopefully my point still comes across – we have learned to clean our speech of conversational phenomena when speaking to our voice assistants. In other words, we adapt to the system…
We know our speech is messy and this is incredibly useful when speaking to each other. These phenomena are not well-understood by current voice assistants however, and these phenomena become more common and more pronounced as cognition declines. This is an important point as this happens to be a group that can benefit hugely from this technology!
So how do we start sorting this? Many people have adapted to current systems, so can we adapt current systems to certain groups of people? My PhD supervisors and I published a paper in 2019 exploring this. We identified two main challenges: Lack of data and unnatural interaction.
Lack of Data
Today’s voice assistants are trained on millions of hours of people’s speech. This is usually clean speech and not from specific user groups. For example, the Common Voice dataset actively chucks out any audio that contains ‘umm’. Now this is an incredible dataset containing almost 14,000 hours of transcribed audio in 76 languages – but it doesn’t help us understand how conversational phenomena change as cognition declines.
I spent over a year developing a new data collection that would provide researchers with recordings (both audio and video) of a conversation between someone with dementia and a volunteer. This was set to take place in April 2020 with Alzheimer Scotland, but is postponed for now… I did publish a paper on the ethics of collecting a dataset like this with Pierre Albert in 2020 (a more casually written article was also published in Towards Data Science).
I had to develop an alternate data collection due to covid, so I am delighted to say that we have a new ongoing data collection!
We are capturing the interactions between a person with dementia and an Amazon Alexa device (just the audio) with Spotify. Most of the participants that have taken part so far have loved it. As I mentioned earlier, the ability to independently access music has been a hit. With this data, we will be able to explore how often Alexa misunderstands the user and what conversational phenomena cause the most misunderstandings – guiding our future work.
Unnatural Interaction
In order to ask a voice assistant for something, you must say the ‘wake word’ e.g. Alexa, Hey Siri, Hey Google, etc… It then lights up or beeps to show you that it is listening. You ask your question, and the short silence lets the assistant know you are done. The assistant at this point stops listening, indicated again by lights or beeps, and it finally answers once it has a response ready.
This is not a natural interaction.
Imagine you are at a human travel agent and ask for “all flights tomorrow from Scotland to Paris… oh no Berlin”. The travel agent would maybe start looking for flights to Paris for a second, but would then understand and give you the results you actually wanted. A voice assistant on the other hand would stop listening to you at the pause, something like this:
You: “Assistant, give me all flights tomorrow from Scotland to Paris”.
Assistant: <light twirls or beeps>
You: “oh no Berlin”.
Assistant: provides you with flights to Paris.
You: “No. Berlin”.
You: “Assistant, to Berlin”.
Assistant: “I’m sorry, I am not sure about that one”.
You: “Assistant, give me all the flights tomorrow from Scotland to Berlin”.
Assistant: provides you with flights to Berlin.
In order to make these interactions more natural, we must break away from taking strict turns. The next generation of voice assistants need to understand someone as they say each word.
How to Understand Someone Word by Word
As someone speaks, the sound waves they produce hit our ears and our brain translates them into words. Computers do this similarly using automatic speech recognition (ASR), transforming audio into text. Voice assistants take the audio of an entire utterance (what the user has said) to translate, hence the strict turn-taking. We want to do this word-by-word, aka incrementally, outputting text live as the user says each word. This is a harder task – but how?
Well, let’s say we hear someone say “I want to cut a []edge”. We don’t quite hear the last word there but “edge” makes no sense, so it could be “hedge”. That is a fair prediction, but then the user finishes their utterance: “I want to cut a []edge of cheese”. We now know that the word is “wedge” thanks to future words. Incremental ASR does not have access to future words, so it may predict “hedge”, and then change its mind later to “wedge”.
My colleagues and I released a whole paper on this topic in 2020, evaluating state of the art incremental ASR systems by Google, IBM, and Microsoft. We found that none of them are yet suitable for conversational AI – but Microsoft is currently the best for several technical reasons detailed in the paper.
Once we get this text word-by-word, we need to understand the user word-by-word. For example, if I say “Yesterday I ate some bananas and”. You all understand that I ate bananas and something yesterday. You do not need me to finish that sentence to understand it. When I add “grapes”, your brain updates your understanding incrementally.
This is what I am currently working on with my supervisor Arash Eshghi. We are building a system that can understand what is being said, as it is being said. This work is ongoing but we have just had a new paper published on the 4th of October 2021. We hope to tweak this understanding as we learn more from the data we collect, making the system more dementia friendly. Here is an example of this understanding displayed in a graph:
Building Voice Assistants for Blind and Partially Sighted People
I had the wonderful opportunity to supervise almost 30 MSc students over the past two years to work on voice assistants for visually impaired people. We identified the kitchen as a particularly problematic area by talking to Sight Scotland, and looking through a dataset (called VizWiz) of questions asked by people affected by sight loss.
The first group built a foundation, called Aye-saac, that could ‘see’ and answer some standard visual questions. This was fantastic work that you can read about here.
This year we focused on two challenges, textual visual question answering (VQA), and spatial relations.
The textual VQA group tackled questions about text in images, like the soup example above, requiring the system to ‘read’ text in an image. Aye-saac can now answer questions like “I am a vegetarian, can I eat this?”, “Is this still in date?”, and “Am I allergic to this?”. These are all extremely difficult for people with sight impairment to answer on their own. The students did such a fantastic job that they had a paper published on the 18th of October 2021 – and a Rasa article!
The other group looked into guiding a person to objects in their kitchen. We learned that people affected by sight loss can easily locate ‘anchor objects’ like sinks, ovens, and fridges that don’t move. Finding ingredients and utensils that move however is more of a challenge. The students built a system that could find almost 300 different kitchen objects in a scene, and describe their location relative to the ‘anchor objects’. Aye-saac’s responses were so much more usable than other systems that they also had a paper published on the 5th of October 2021.
I couldn’t be more delighted with all their hard work, it was a pleasure.
Want to Find Out More?
If you want to save someone a read, I presented this work at DataTech, part of DataFest 2021. You can also see my new updates on Twitter, LinkedIn, or Medium.
Last but not least, I am sponsored by Wallscope and The Data Lab – thanks!