New TTS engine: Microsoft Translator

@tomgru

As a follow up to my previous post, I did a bit more digging and can see there is also available the Project Oxford Bing Voice Output API:

https://msdn.microsoft.com/en-us/library/mt679063.aspx

Whose API endpoint is:

https://speech.platform.bing.com/synthesize

I’m guessing this is a newer API and suppose the mostly supported one going forward? In your post, however, the people you spoke with did mention the “/Speak” endpoint, which this one is not, as compared to the one being used by the plugin, which while a very different URL, that is calling “/Speak”.

Regardless, I used the documentation on the Oxford Project based API, where gender can indeed be specified, and built out some PHP code to test, and can, with 100% consistency, have my text converted to audio in the specific “dialect” and “gender” specified.

The key difference is calling this:

https://speech.platform.bing.com/synthesize

Versus:

http://api.microsofttranslator.com/V2/Http.svc/Speak

So, I suppose, if worse comes to worse, the plugin could potentially be updated to use the Oxford Project based API, however, it seems the limits are a bit lower. I was able to use the same client ID, but had to signup there to obtain a new / different key / secret to get it going.

I confirm there is no gender in the API.
For French, it is stable with a change from female to male voice.

@lolodomo,

Any chance of considering using / offering the use of the Project Oxford version of the MS API referenced above, so that users can choose the specific voice they prefer?

Here (Dutch) also the female voice changed suddenly in maile voice last week. But also it’s much slower now…it takes more than 10 seconds before the TTS starts. I also use the Google TTS and this is much faster (immediate response) only I can use it only 2 times per day on one of my Sonos players. I hope the Microsoft TTS can be improved…

I noticed a bigger delay between the end of the TTS message and the resume of previous playback. I have to check if they increase the bitrate of the MP3 file.

The delay is also at the start. When I manually run a “say” message it takes more than 5 seconds before it starts. Before this was much faster with the MS TTS. Google TTS reacts always instantly. I have made my own personsl weather announcement that runs every morning and consists of several seperate messages. With google its ready in15-20 seconds. With the recent MS this takes more than 1 minute…Really annoying so I switched this one back to Google

This was another reply from the Microsoft team. Do you want me to ask them about the delay?


Looking at the doc in the link below (Azure Cognitive Services Translator documentation - quickstarts, tutorials, API reference - Azure Cognitive Services | Microsoft Learn) could see why this is not clear, as it is not fully explained.

This is a sample to the API call for a female dialect for en-CA language.

http://api.microsofttranslator-int.com:80/v2/http.svc/Speak?appId=xxxx&language=en-CA&format=audio%2Fmp3&options=MinSize|Female&text=Someone%20at%20the%20door.

In this case, language query parameter has the value of:
en-CA

The options query parameter has the value of:
MinSize|Female

[quote=“tomgru, post:127, topic:188264”]This was another reply from the Microsoft team. Do you want me to ask them about the delay?


Looking at the doc in the link below (Azure Cognitive Services Translator documentation - quickstarts, tutorials, API reference - Azure Cognitive Services | Microsoft Learn) could see why this is not clear, as it is not fully explained.

This is a sample to the API call for a female dialect for en-CA language.

http://api.microsofttranslator-int.com:80/v2/http.svc/Speak?appId=xxxx&language=en-CA&format=audio%2Fmp3&options=MinSize|Female&text=Someone%20at%20the%20door.

In this case, language query parameter has the value of:
en-CA

The options query parameter has the value of:
MinSize|Female[/quote]

Hot diggity dog! That’s it. The documentation does not reference gender and they didn’t state that this was an options parameter in their initial response to you!

I just took the code in the LUA file:

local returnCocde = os.execute(SAY_EXECUTE:format(file, file, token, url.escape(text), language, url.escape(“audio/mp3”), “MaxQuality”))

And modified the last value (hard-coded) which is what is set for the “options” querystring parameter, and made the line:

local returnCocde = os.execute(SAY_EXECUTE:format(file, file, token, url.escape(text), language, url.escape(“audio/mp3”), “MaxQuality|Male”))

And after an upload and Luup restart, I’m 10 for 10 with the Male voice (based on my settings).

While I didn’t mind my morning weather announcement as a female, my nightly joke of the day, after setting the alarm, comes off so much better in a male voice :slight_smile:

Thanks for looking into this for us!

I’m guessing a quick update to the plugin could be released with a drop down for gender (assuming all languages support), in order to dynamically set, though, this may be a bit tricky, since only the Microsoft API supports gender (I’m assuming).

The solution could be to finally add a voice/gender paramter to the Say action.
For MS translator, we will use Male or Female as parameter value.
For Mary TTS, it will allow to choose the voice.
And for MS translator, a default gender will be required.

My idea was good.
After checking, the MP3 file is now @128 kbps (MaxQuality).
If I switch to option “MinSize”, bitrate is @32 kbps.

So I fixed the file L_SonosTTS.lua. Delay to resume is now ok.
You can get the last version from the ZIP file you can download at the bottom of this page: trunk – Sonos Wireless HiFi Music Systems

This increased bitrate could explain why it takes more time to start playing. The file to download is bigger. We could decide to switch to option MinSize and have a smaller file and so a faster download…

I have finally added a parameter “Microsoft option”. By default, the parameter is empty and the result will be a small audio file and no gender specified.
You can set in the plugin UI one of these values: “Male”, “Female”, “MinSize|Male”, “MinSize|Female”, “MaxQuality|Male”, “MaxQuality|Female”.

You will have to upload 4 files I have updated: I_Sonos1.xml J_Sonos1.js L_SonosTTS.lua S_Sonos1.xml
Don’t forget to free your WEB browser cache as the JavaScript file is updated.

So with this update, by default smaller files will be produced. It is safer for our Vera with so small memory and by the way I don’t really hear a difference in quality. The other advantage is a shorter delay to listen the text. It is noticeable especially when using a big text.
Of course if you think the quality is reduced too much, you can set the new option to MaxQuality to restore the previous quality.
And of course with this update, duration calculation is fixed, meaning a faster resume.
And finally you can even use the new option parameter to force a male or female voice.

I will replace tomorrow in the UI the free text by a list of choices. More easy for everybody.

awesome…thanks guys. Glad my stomping grounds could actually help!

Thanks to everyone who helped figure this out. I’ll see in the AM if the friendly Canadian lady will be telling me teh wether (As I set) or it’s the burly Canuck guy doing so…

Thanks again for your help. Without your “insider” status, it would have been a while, if ever, to find a resolution.

Files updated, but not tested… need to get home to do that.

Does this look right?

this worked beautifully… thank you!

I also uploaded the 4 files and did some tests. It works indeed, both male and female voice (Dutch).
At “Minsize” the TTS works much faster but still not as fast as the Google TTS
Tomorrow morning when I hear my more extended weather report (several sentences) I know if in this way the MS is workable
(it’s now 23h in Holland…)

Rather than one unique call, you can do several calls, for example one for each sentence. You even have not to care about timings, messages will be read in sequence. Doing like that, you have only small audio files to download. It is how my weather report is done and it works well.