Jimmy Breck-McKye

A lazy programmer

Webapps You Can Talk to, Webapps That Talk Back: Using the New Voice APIs

The Web Speech API is an exciting new proposal for making JavaScript applications interact with users via speech – listening via voice recognition and replying via speech synthesis. As part of a recent Guardian Hack Day project, I’ve built a simple app that demonstrates its capabilities, and I’d like to show you how you can do the same.

The API is still very much an experimental technology, and it’s not quite production-ready. At the time of writing, it is only really supported in Chrome*, though work in Firefox and Opera should be forthcoming too. Right now, it’s more for hacking with than building anything serious.

The demonstration app

The application I’m going to show you is based upon my recent entry to the Hack Day, ‘Hands Free Recipes’*. It’s a voice-controlled recipe app. It solves a common problem: following a recipe on your phone means taking your eyes off the food, and touching your device with dirty, messy hands. With a speech-based app, however, you can control the device with your voice, and let the app read out instructions whilst you keep your attention on your cooking.

The basic app comprises three main objects:

  • microphone: listens to voice input on the microphone. When the user issues commands like ‘next’ and ‘previous’, it raises navigation events.
  • recipeBook: keeps a list of recipe steps and a pointer to the current step. Upon a valid navigation command, it changes the current step and raises a change event.
  • speaker: listens for changes to the book’s current step, and outputs the new step’s text over the speaker. It also listens for ‘repeat’ commands from the microphone.

I’m using an event-based approach here because it makes it very simple to ‘plug in’ new kinds of input and output. For example, you might want to accompany your voice controls with some kind of swipe or button controls. Or you may wish to accompany the speaker with some on-screen output, too. It’s very easy to add these enhancements in an evented approach.

You’re making a complicated meal for someone you’d really like to impress. There’s a great recipe online, but using your phone whilst cooking is a total no-go. Your hands are wet and dirty, and you want to be watching the food, not squinting at tiny text on a screen. Wouldn’t it be handy if the website could read out instructions for you, and let you navigate with voice commands?

This is exactly the problem I set out to solve on a recent Guardian Hack Day. With the new Web Speech API, I wrote an app that could read out instructions for a recipe, and respond to hands-free voice commands. In this blog post, I’m going to show you how you too can experiment with this exciting new technology.

Overview of the API

Web Speech is an as-yet-experimental API that lets JavaScript applications output synthesized speech and respond to speech input in turn. Being in its formative stages, browser support is still essentially limited to Chrome, but the feature is under active development in Firefox, and other browser vendors have made positive noises too.

Because the specification is still under review, certain details in this post might fall out of date in the future. Right now, I’d advise you not to use it in a serious production application, because it’s always possible certain features might change or even be withdrawn with time.

Speech synthesis

Making the browser talk is easy.

1
2
const utterance = new SpeechSynthesisUtterance('War. War never changes');
window.speechSynthesis.speak(utterance);

The utterance also has various fields you can play around with to change the way it sounds –

1
2
3
4
5
utterance.volume = 0.5;   // 0 is silent, 1 is loudest
utterance.rate = 5;          // 0.1 is slowest, 10 is fastest
utterance.pitch = 1;     // 0 is lowest, 2 is highest
utterance.text = 'Blah'; // you don't need to inject the message into the constructor
utterance.lang = 'en-US';    // this gives the browser a clue which pronunciation to use

There’s a very comprehensive runthrough of the synthesis API over on SitePoint.

Speech recognition

The recognition API is a bit more complicated.

As I speak into the microphone, the machine builds a list of words. It begins trying to identify each asynchronously. Sometimes, identifying one word helps identify another: if my first word is ‘I’, then it is more likely my second is ‘am’ rather than ‘an’. As the machine identifies words, it raises resultEvents that comprise the list of words and a pointer to the word whose identity has changed. The ‘word’ is actually a list – the most likely match coming first, followed by less likely options. If the identity of a word will no longer change – its identity is certain, or there is no way to resolve it further – the result is flagged as finalized.

This is a very rich API that makes perfect sense given the complexities of speech recognition. But it is not exactly suited to our needs – we just want to dictate commands, and have each command drive an action. In the example below, I’m going to show you how you can build a proxy over the API to achieve just that.

The application comprises five key classes:

  • Microphone. It attaches a listener for user voice events, and broadcasts events when it hears certain commands.
  • Monitor. Displays words heard on screen, so the user gets feedback for their input.
  • Book. Keeps track of where we are in a recipe, and the current ‘page’ we’re on.
  • Speaker. When the book’s ‘page’ changes, it speaks it aloud. It also repeats the page when it hears a ‘repeat’ command.
  • Page. Shows the book’s ‘page’ on screen, in case the user didn’t hear the instruction.

We’re also going to need an event emitter to tie them all together. Using events lets us keep the various classes nicely decoupled, and allows us to ‘plug in’ different kinds of input and output easily. In this instance, I’ll just be using EventEmitter – it’s one of the simplest.

Most applications that use voice will also use other input and output types for accessibility reasons (and for the sake of browser support), so making things ‘pluggable’ is really helpful.

Microphone

function Stub() {

????
event.emit('heard', word);

}

Monitor

function ????() {

event.on('heard', printWord);

}

Book

function ????() {

event.on('heard', processCommand);
event.emit('newPage', page);

}

Speaker

function ????() {

event.on('newPage', speakPage);

}

Page

function ????() {

event.on('newPage', displayPage)

}

Comments