Using Text to Speech effectively for phone menu’s using SSML

I’ve mentioned this before, but I recently began working for a new healthcare company. It’s a fairly small and budget conscious organization, so spending money for a professional voice for every phone menu needed isn’t going to happen. Instead there was an IT employee who recorded almost ever phone menu. This, of course, isn’t a great use of an IT resource to take time out of their day constantly to record, rerecord, change a word here, change a word there, change our open hours in the information prompt, etc.. My solution to this, so that I didn’t have to bother that person, is to use Text to Speech.

I began by exploring the options. First I used Amazon Polly. I signed up for an AWS account, learned some of the SSML expressions available with Amazon Polly, and went to work. The results were good, without a doubt, but not great. Not really what you want for phone menu’s in a healthcare environment, because it sounds so painfully robotic.

After a while of exploring some more, I finally landed on the IBM TTS engine. It’s free to use, and even better, you don’t have to sign up for an IBM Cloud account and build an application to use it, IBM has set up a very nice demo application that accepts SSML.

Here are your resources:

IBM Watson Text to Speech Demo Generator

IBM Watson SSML Elements Reference Guide

I would highly recommend reviewing these, and even book marking them if you think you will use them often.

So lets get started.

First I’m going to navigate to the IBM TTS Demo Generator. I’m going to select my favorite voice, which personally I prefer the Lisa V3 voice. And then I’m going to select the SSML option.

Some demo text will generate with demo SSML. You can see a few things done here, such as the prosody rate adjustments, which is the speed in which the voice speaks, as well as a couple of inserted breaks.

I tend to use the prosody, break, and say-as elements the most for my SSML recordings. Prosody, by definition, is the rhythm and sound of speech. In this SSML editor, we can use prosody rate, pitch, and volume using the following commands:

<speak version="1.0">
  <prosody pitch="150Hz">Transpose pitch to 150 Hz</prosody>
  <prosody pitch="-20Hz">Lower pitch by 20 Hz from baseline</prosody>
  <prosody pitch="+20Hz">Increase pitch by 20 Hz from baseline</prosody>
  <prosody pitch="-12st">Lower pitch by 12 semitones from baseline</prosody>
  <prosody pitch="+12st">Increase pitch by 12 semitones from baseline</prosody>
  <prosody pitch="x-low">Lower pitch by 12 semitones from baseline</prosody>
</speak>
<speak version="1.0">
  <prosody rate="slow">Decrease speaking rate by 25%</prosody>
  <prosody rate="50">Set speaking rate at 50 words per minute</prosody>
  <prosody rate="+5%">Increase speaking rate by 5 percent</prosody>
</speak>
<speak version="1.0">
  <prosody volume="75">Modified volume is 75</prosody>
  <prosody volume="88.9">Modified volume is 88.9</prosody>
  <prosody volume="loud">Modified volume is 90</prosody>
</speak>

The break element is a pretty self explanatory command. It inserts a pause in the recording. There are several ways to accomplish this, personally I will generally just go by milliseconds.

<speak version="1.0">
  Different sized <break strength="none">no pause</break>
  Different sized <break strength="x-weak">x-weak pause</break>
  Different sized <break strength="weak">weak pause</break>
  Different sized <break strength="medium">medium pause</break>
  Different sized <break strength="strong">strong pause</break>
  Different sized <break strength="x-strong">x-strong pause</break>
  Different sized <break time="1s">one-second pause</break>
  Different sized <break time="1500ms">1500-millisecond pause</break>
</speak>

The say-as element allows you to really personalize how something is said. I use this mostly for phone numbers, so I will generally use the digits command. There is an official telephone command, but I prefer using digits element because I can better personalize how it is read. Refer to the IBM reference for full usage of the say-as element.

<speak version="1.0">
  <say-as interpret-as="digits">123456</say-as>
</speak>

So to start off, I will need to know the transcript of this phone menu. I’m going to provide one here:

Thank you for calling Our Healthcare Group, the office of Dr. This Guy.  If you are experiencing a medical emergency, please hang up and dial 911.  Please listen closely to the following options:
Press 1 to schedule an appointment.
Press 2 to speak to a medical assistant.
Press 3 for our hours, location, and fax number.
For all other calls, please hold on the line for the next available agent.
To repeat this message, please press * now.
Thank you for calling Our Healthcare Group, the office of Dr. This Guy.

Go ahead and throw that into the SSML editor and see what it sounds like.

Not terrible, right? But there are a few issues. She reads 911 as nine hundred eleven. She doesn’t read the * as “star”. She adds in a few breaks that aren’t wanted. And overall she reads really fast.

Lets go ahead and apply some SSML into the mix to see if we can clean it up a bit. I’m going to use visual studio code editor, but you can use whatever or you, or even just edit right there in the generator.

The first thing I want to do is slow down the rate of speech. I feel like she was speaking way too fast for our older callers to be able to follow her. I’m going to encapsulate the entire script in a prosody element, and set the rate to -10%.

Next, because it recognized Dr. as the end of the sentence, I’m going to write out the word Doctor.

Going down to the next line, 911 was read as nine hundred eleven, and I need that to be read as 9-1-1. To accomplish this, I’ll use the say-as element with the digits command.

As mentioned earlier, the asterisk isn’t recognized, so we need to write out the word star.

Just that alone actually sounds pretty good. But just to have some fun, I’m going to add in a break near the end for some added effect.

And finally, I want to put a bit of inflection on the word “office” in the last line just to make it a bit more realistic. This prevents us from repeating the same sentence in the same way multiple times. I’ll go ahead and add a comma for a small break after the inflection as well.

Here is our final SSML script:

<prosody rate="-10%">
Thank you for calling Our Healthcare Group, the office of Doctor This Guy.  
If you are experiencing a medical emergency, please hang up and dial <say-as interpret-as="digits">911</say-as>.  
Please listen closely to the following options:
Press 1 to schedule an appointment.
Press 2 to speak to a medical assistant.
Press 3 for our hours, location, and fax number.
For all other calls, please hold on the line for the next available agent.
To repeat this message, please press star now.
<break time="300ms"/> 
Thank you for calling Our Healthcare Group, the <prosody pitch="+2st">office</prosody>, of Doctor This Guy.
</prosody>

Here is Before we applied SSML

Here is After we applied SSML

Leave a Reply

Your email address will not be published. Required fields are marked *