The novelty of asking Siri to ‘call Mum’ has worn off and there is more competition than ever for our smooth-talking iPhone. Just as we moved from print to the screen, and from the desktop to mobile, voice technology is the next big shift in how we receive and consume information.
Similar to the evolutions that came before, voice technology poses its own unique challenges to brands, particularly in considering how to develop and deliver information so that it leverages the channel and elevates the content. This technology is young, and as we discovered, still very limited, and many are getting caught in the trap of trying to apply screen-rules to voice in an emerging voice-driven world.
How voice technology works
The status and limitations of the technology
Current applications and examples
It’s influence over how we receive information
Tips for creating a good experience
Predictions for the future
Nick Howden, CEO, Real Thing
Melinda Jennings, Copywriter, Avion Communications
Ed Sartori, Senior Digital Producer, Versa Agency
You speak, and it responds - how exactly does voice technology work?
To put it simply, there are three layers to voice technology -
1. Speech-to-text - input
The first step involves converting the audio - what you say - into text.
2. AI - search and discover
Once converted to text, the AI engine works to interpret definitions and meaning, and search to product the correct response.
3. Text-to-speech - output
The response is first identified as text, and to be delivered it needs to be converted back into audio.
Voice also works on the same protocols as websites - HTTP and HTML, and while
speech-to-text technology has been around for some time, AI and text-to-speech is where huge strides have been made in recent years, enabling Silicon Valley to bring to market Alexa, Google Home and HomePod.
How sophisticated is the technology now? What are its limitations?
While the hardware for the mainstream device is impressive (able to pick-up voice from across the room, cancel out background noise, and hear you over the top of music), the technology is young, evolving, and unfortunately, limited and ‘stupid’. Despite the impressive campaigns, intimating that Google Home, or Alexa, could be the Jarvis to your Ironman, the reality is different.
Currently the market options can only complete very basic tasks, or skills - i.e. ‘What is the weather like today?’, or ‘What time is it?’, or ‘Play ‘Ok, Computer’ by Radiohead’. These skills require only one or two interactions. Once you start asking for something more complex - a multistep dialogue - or attempt to switch skills before completing your original request, the technology becomes confused.
Even creating new skills or content within these limitations can be difficult for developers working on voice-tailored projects for brands. The bottle-necks holding up the technology are the differences between the market-leaders (if there was uniformity it would likely progress faster), and AI.
As we learned from our AI event last year, developing algorithms to be able to truly derive context and meaning in order to provide an intelligent response, is one of our greatest technological challenges. There are smaller companies that are managing to overcome the limitations of more highly publicised devices. Real Thing have developed specialised systems in accessibility, or specialist industries, that do allow for multistep dialogue, or even switching in and out of skills.
How can you create a good experience now?
The risk of a bad user experience in voice is higher than for screen. Unless you invest more in a custom solution through a smaller company (like Real Thing), then you need to work within the limitations of Echo and Google Home. The tolerance for a bad experience is also much lower for voice users - they are likely to be less familiar with voice than they are with screen, but also expect it to be the quicker, easier option, making your window for success smaller.
To give your skill the best chance of succeeding, be sure to do the following -
Use natural language - write for conversation, and embrace long-form
While some might bark orders at their phone, the best experience is one that feels conversational. To create that you need to use copywriters, and you need to do your research. Look into any forums related to your industry and observe how your audience are using language.
This also has implications for SEO, which is now rewarding natural language and longer-form content.
Determine every possible iteration of your skill request, develop a response to match
Did you know there are over 40,000 ways to request a flight? They include airlines, times, days, locations, and then mannerisms and phrasings for each combination. You need to identify each of the iterations for your skill, and develop responses to match. Nothing turns a user off quicker than, ‘Sorry, something went wrong. Please repeat your request.’
But also try to do some of thinking
It is a monstrous, and largely manual task identifying the thousands, and thousands, and thousands, and thousands, of different ways different people can ask the same question, so chances are high you may miss one or two. Create shortcuts that allow your skill to fill in the blanks, rather that failing to progress unless it has a 100% match to something it recognises.
You can do this through account-linking and assumptions. Account-linking may help to pull in some of the users basic data - full name, birth date etc - without having to ask too many follow-up questions. These follow-ups can make the experience feel more like a phone conversation and the point-of-difference (convenience) is lost.
Assumptions can be built in to the AI engine to assist the skill. For example; if a user asks for session times for a movie but does not state the day, you can assume it is for that day and respond with those times, rather than asking.
Don’t use Google Translate
The nuances of language and translation are a key challenge to taking a skill to audiences in different countries, with different languages. To capture these, it is best to employ a team on the ground in each of the countries you wish to target - translation is not a simple conversion.
Use your manners
The unique freedom that comes from interacting with a screen, rather than a person, is why we have keyboard warriors. The same can apply to voice interactions. It’s important to include polite responses as part of your output not only for these users, but for older generations who will speak to technology as if they were a person sitting across from them. Programming greetings, ‘thank yous’ and ‘pleases’ will help to personalise the experience for your audience.
Avoid a branded voice, for now
For around fifty thousand dollars, brands can get their own ‘voice’. There are benefits to having your own - it can mirror your brand more closely, and appeal to your specific audience. However, because of the limitations on the technology the risks of errors, and a bad experience, are high and you may find that the voice you have invested in, becomes associated with failure. So for now, let Alexa take the heat.
Remember that you’re catering to different senses and skills
Voice is a whole new experience compared to screen, using entirely different human skills and senses - speech and hearing, rather than sight. Just as you can’t directly translate into another language without consulting cultural norms and mannerism, you can’t do a direct translation of your screen content to voice. Imagine trying to maintain the attention of your user while Alexa reads out a full web page? Consider the best applications of voice for your users needs and develop an experience tailored to help fulfil and achieve these.
What are the implications for advertising and the way we consume content?
Facebook’s influence on the US election has shown us how our digital environments can shape, influence and narrow the information we receive, and in-turn how we perceive the world around us.
Voice creates a context where overt advertising is likely to be less tolerated. When we type our searches, we are presented with the top 20 ‘organic’ results, and paid results. We’re unlikely to sit patiently while Alexa reads out these results in the same way, and this poses a challenge to the way search engines derive much of their revenue. The predicted effect is that advertisements will become more and more intertwined with our ‘regular’ content.
In a Post-Cambridge Analytica world - what does technology like this mean for privacy?
In 2009 Google launched Google Voice, a call transfer and voicemail system. It was exclusive to the US and Canada, and millions of people signed up to use the free service over the 3 years that it ran. Google recorded all of the calls for their millions of customers, and used this data to inform the machine learning that is now the basis of Google Home, leapfrogging voice technology leaders like Nuance. (If it’s free, you are the product, right?)
In the case of Amazon Echo, there are two chips. One that listens and records specifically for commands and the other that listen and records everything else. There are laws around the retention of data for the second chip, which should delete data that isn’t useful, but as we’ve learned through Google, listening and recording interactions can be useful in terms of improving the machine learning.
The lesson? Whilst there are legal safeguards and ethics in place, it is worth investigating what you are signing up to when you purchase or engage with this technology.
Where is voice technology heading?
Young and stupid, might be what it is now but if a demonstration by Real Thing and the predictions from our panel, are anything to go by then it’s unlikely to stay that way for much longer. These were the key insights:
Language nuance, tone, and accents are the next frontier
Hardware seems to be the strong suit for this technology at the moment and it’s only going to get better when buoyed by continually sophisticated machine learning. Amazon are even working on developing voice profiles, to help safeguard certain features and information (so your 12 year-old can’t buy that bike on Amazon Prime).
The future of work
Voice seems to have similar appeal as Google Glass to industry sectors that work with their hands. There are multitudes of opportunities for voice technology to support workers in construction, logistics, engineering and more. One such example was aged care workers, where carers are often expected to work under stressful conditions, and above their education and skill level. Voice can not only assist them in helping patients, or residents, but record their activities to improve accuracy in administration (a key component of funding).
A dominant search channel
In the last 10 years, mobile has superseded desktop for searches and it’s predicted that voice will soon secure a large portion of that pie.
Current screen-readers are fairly basic and un-engaging. Voice technology is enabling the vision impaired to interact with content in a way that is tailored to their strongest senses, and allowing them to complete tasks with an ease they otherwise would not have.
Nobody likes an awkward pause and with Alexa and Siri, there are many as the engine ticks over looking for what you have requested, and converting it to an audio response. Developers are working on how to fill these gaps, either with conversation (a bit of banter) or enabling users to ask for other information while it completed the original request.
This report was produced prior to Google unveiling its polarising Duplex software at I/O 2018 where it’s Assistant called a hair salon to book an appointment and a restaurant to make a reservation. The people on the other end of the phone were not informed, nor realised that they were conversing with a voice assistant. Following the widespread public backlash, Google has just reiterated that Assistant would explicitly identify itself at the top of the call rather than just say it was calling on behalf of a user.