28.09.14
Big data helping speech recognition become mainstream
Source: National Health Executive September/October 2014
Steve Young, Professor of Information Engineering at the University of Cambridge, and a global expert in speech recognition technologies, gives his thoughts on the advances and challenges facing this ‘growing’ research area. David Stevenson reports.
University of Cambridge’s Professor Steve Young will be the 2015 recipient of the Institute of Electrical and Electronics Engineers (IEEE) James L Flanagan Speech and Audio Processing Award.
The annual prize is given to an individual or small team for “an outstanding contribution to the advancement of speech and/or audio signal processing”. For the last 35 years, Young, who is Professor of Information Engineering at the University, has focused his attention on developing systems that allow humans to interact with machines using voice.
He told NHE that research in this area made steady but not spectacular progress from the mid-1980s to the mid-2000s. “But over the last five to 10 years we’ve seen really quite significant acceleration in progress,” he said. “And that is why we are now seeing speech recognition coming into the mainstream with services like Apple Siri and Google Now, and the new smart watches that do speech recognition.”
Prof Young, who is the senior pro-vice-chancellor responsible for planning and resources at Cambridge, added that modern systems are built on the notion of building statistical models that represent the data.
“So the way you build a speech recogniser, essentially, is that you get some data, which is people speaking, you transcribe the data and then you try to model the data and find a way to automatically generate the transcriptions yourself – and then you have a speech recogniser,” he said. “The key to all of that is some quite sophisticated statistical modelling algorithms and the availability of the data.”
Big data
The expert told us that it is the nature of data, and its wide availability nowadays, that has changed the speech recognition landscape. “When you speak into your phone, the signal is being routed to a server farm somewhere in North Virginia if you’re Apple or the Arizona desert if you’re Google, and it is being processed there and the result is being fed-back to your phone,” said Prof Young.
This allows two things to happen. Firstly, it unleashes the possibility of using some very powerful computing to recognise people’s voices. Then secondly, and more importantly, the companies are capturing the data.
“When Siri was first launched, for example, it wasn’t that great,” said Prof Young, “but as more people started using it the company was capturing huge amounts of data. And then by using and collecting the data and upgrading the models, people found the recognition improved so they used the system more, so they gave more data. That has happened over a wide range of fields, and it is the ‘big data’ paradigm that we are hearing a lot about.”
He added that the internet is also allowing organisations, be it research or commercial, to collect huge datasets and do things that they could never do before, and that is what has led to a rapid improvement in performance and an “explosion of interest” in the field.
“It is nice, looking back, that we’ve had these bursts of speech recognition research being a very hot topic, then people being disillusioned with it and it going out of fashion, and then coming back,” added Prof Young. “And we are in one of those phases where it is back and people are very interested, especially with the big players investing huge amounts in improving services.”
Dictation and voice recognition in healthcare
Dictation in the medical area has been one of the mainstays, certainly for commercial dictation applications, he added.
He noted that doctors have persevered, and because they have persevered, “and in some cases had to – particularly in the US where everything has to be recorded – the dictation systems have made progress and been widely used”.
He stated that advances have also been made in transcription and more general conversation systems (where a computer can listen to two humans having a conversation or it can be one of the participants).
In fact, Prof Young feels that the challenges in developing these technologies are now moving more from transcribing the audio into words, which has been the focus for the last 30 years, into actually understanding what the words mean and the semantics behind them, especially with regards to conversational systems.
“I would expect that what has been started by Siri and Google Now is going to expand and we’re going to see a whole plethora of agents being available for having conversations about booking hotels and restaurants,” he said, “but particularly in healthcare, as this is the field which is ripe for providing this type of service.
“I think we’ll start to see these coming in within the next few years in focused application areas and then becoming more and more general and widely acceptable over the next decade.”
Conversational systems
Currently, Prof Young is working on developing conversational systems – not specifically in healthcare yet – to access tourist information.
“For example, finding a restaurant or hotel,” he told us, “and we’ve been working with some automobile companies to develop in-car voice recognition.”
He outlined that with many people getting used to satnavs, in the future people may be able to talk to their cars and say: ‘I’d like to stop off and have a meal, what is there in the local area?’ The car would then be able to search and book into wherever it finds appropriate, after a conversation with the driver about the available options.
“We’re working on that now and many of the algorithms we’re starting to develop are not rule based,” said Prof Young.
“Traditionally these types of things have been developed by a programmer sitting down and writing rules, such as ‘what would the user ask?’ And ‘how should the system respond?’ But this doesn’t scale and the system you deploy doesn’t get any better. What we want to do is deploy systems that learn from their own users and get better and more competent automatically, and that really is the focus of my work now.”
He added that conversational systems are particularly interesting, and believes that the use of automation if it is done “sensibly and effectively”, could make a big impact in the future care of the elderly and in managing an ageing population.
Despite dedicating 35 years to research in the field of speech recognition, and with his research helping to set global standards for benchmarking systems and being the basis of many commercial systems, Prof Young remains modest about his award, joking that organisations sometimes feel they have to give them out “just because someone has been around long enough”.
Nevertheless, he said he is “humbled” to become the 2015 recipient of the IEEE James L Flanagan Speech and Audio Processing Award.
Tell us what you think – have your say below, or email us directly at [email protected]