dslreports logo
 
    All Forums Hot Topics Gallery
spc
Search similar:


uniqs
1796

AllThumbs
join:2006-02-07
Charleston, SC

AllThumbs

Member

[Asterisk] Amazon Takes Text-to-Speech to Whole New Level with Polly TTS

We've put up a demo on SoundCloud of Amazon's Polly TTS service delivering Yahoo News using Incredible PBX for Wazo. It's not only flawless and almost real time delivery, but it's also dirt cheap. First year is free for 5 million characters a month. After that, it's $4 for a million characters of text-to-speech.

Demo TTS at SoundCloud: »soundcloud.com/nerduno/a ··· ible-pbx

Setup Instructions for Asterisk/Wazo on PIAF Forum: »pbxinaflash.com/communit ··· e.21318/

LSBINTB
join:2014-10-06

1 recommendation

LSBINTB

Member

Google chick IS still the best TTS IMO. Polly's voice made me think of "open the pod bay doors HAL". Too catatonic for me.
Stewart
join:2005-07-13

2 recommendations

Stewart to AllThumbs

Member

to AllThumbs

TTS shootout

yahootest.wav
557,826 bytes
TTS samples
A clip with samples of the big four (not identified) is attached..

Please post your opinions on intelligibility and naturalness.

Trev
AcroVoice & DryVoIP Official Rep
Premium Member
join:2009-06-29
Victoria, BC

Trev

Premium Member

The voice behind door number 3 sounds the least unnatural to me. They all still have a strong robotic accent.
RonR
join:2003-10-10
Ash Flat, AR

1 recommendation

RonR to Stewart

Member

to Stewart
My vote is for number 2.
restamp
join:2016-07-22
Pickerington, OH

restamp to Stewart

Member

to Stewart
4, then 3, then 1, then 2. But I suspect the order would change if we had longer samples
hwittenb
join:2003-12-20

hwittenb to Stewart

Member

to Stewart
I would favor 3 then 4 as the top two
VoipisGreat
join:2013-03-25

VoipisGreat to Stewart

Member

to Stewart
I vote for #4

WhyADuck
Premium Member
join:2003-03-05

WhyADuck to Stewart

Premium Member

to Stewart
#4, then #3.
JeanInNepean
join:2012-09-19
Grenoble, FR

JeanInNepean to AllThumbs

Member

to AllThumbs

Re: [Asterisk] Amazon Takes Text-to-Speech to Whole New Level with Polly TTS

My order (best to worst) is 3, 4, 2, 1
OzarkEdge
join:2014-02-23
USA

1 recommendation

OzarkEdge to Stewart

Member

to Stewart

Re: TTS shootout

said by Stewart:

Please post your opinions on intelligibility and naturalness.

I listened repeatedly to the clip on my PC 2.0 speakers, tablet, and cast to my AVR 5.0 speakers (streamed as Linear PCM 2.0 48KHz). I did not listen to it on a phone handset.

I added a star to a sample after each hearing if it still sounded more intelligible and natural than the other samples still earning stars:

1 **** best, ignoring that it sounded like listening on a handset.
2 ** a bit unnatural.
3 *** good, but there is a high frequency noise component.
4 * clearly worst.

OE
Stewart
join:2005-07-13

2 recommendations

Stewart to AllThumbs

Member

to AllThumbs

Re: [Asterisk] Amazon Takes Text-to-Speech to Whole New Level with Polly TTS

#1: Amazon, copied directly from Ward's clip.

#2: Google, using Translate interface (may favor intelligibility over pleasantness, compared with OK Google).

#3: IBM, using website demo.

#4: Microsoft, using website demo.

Apple was not included as AFAIK they don't offer a speech synthesis or recognition API to the general public.

All converted to 16-bit 16 kHz mono PCM and normalized to peak at 6 dB below full scale, based on the assumption that the audio was intended for the PBX user and heard on an IP phone or typical SIP app. For PSTN callers, the balance would tip towards #1, which I believe is already limited to 8 kHz sample rate.

IMO, for a short clip where 100% understanding is important (account balance, phone number, contact's name, weather forecast), #4 would best survive impairments such as moderate hearing loss, non-native language, highway or crowd noise. OTOH as restamp noted, for a podcast or audiobook where missing an occasional word is unimportant but natural sound and minimal fatigue are paramount, I'd choose #3. To me, #1 sounds like it was transmitted over G.729 or GSM, which I find very annoying (it wasn't really; those codecs share algorithms also used in speech synthesis).

@OzarkEdge, I greatly respect your opinions and am very curious why your choices were almost the complete opposite of mine.
OzarkEdge
join:2014-02-23
USA

1 recommendation

OzarkEdge

Member

said by Stewart:

@OzarkEdge, I greatly respect your opinions and am very curious why your choices were almost the complete opposite of mine.

Well, it's subjective, I suppose, like all audio evaluation. At first I was unsure how to evaluate your two criteria. Then I decided to focus on naturalness since they are all intelligible more or less... and played them loud on various output devices. I was then able to hear/focus on the differences better. When I went back to my PC, I could then hear/focus on those differences more easily. Then I scored them as noted by repeated hearings until one outscored the others.

I was confident from the beginning that #4 is the least natural sounding. A more robotic voice could be more intelligible, so maybe intelligibility and naturalness (comfort) are somewhat opposing criteria.

I would suggest that naturalness is the icing on the intelligibility cake. Which icing one prefers can get into a whole slew of personal biases... but sexy is universally popular.

OE

AllThumbs
join:2006-02-07
Charleston, SC

AllThumbs to Stewart

Member

to Stewart
FWIW my Amazon clip was converted to GSM from WAV because that was the best format I could come up with that could actually be used with Asterisk. Someone with better expertise probably could have come up with a usable WAV format. I'm all ears on how to do it.
OzarkEdge
join:2014-02-23
USA

OzarkEdge

Member

I recently switched my VoIP.ms voicemail to WAV format (lossless)... the voice quality is noticeably better... I find it to be more enjoyable to hear my callers sounding more like themselves. Given my level of usage and the short life of stored messages, I figure the small increase in storage space required is worth it.

So, should we test the Amazon sample again in its native WAV format, or at least WAV converted to "16-bit 16 kHz mono PCM and normalized to peak at 6 dB below full scale"?

OE
Stewart
join:2005-07-13

Stewart

Member

yahootest2.wav
263,654 bytes
said by OzarkEdge:

So, should we test the Amazon sample again in its native WAV format ...?

OK, here you go. GSM vs. sln16.

cb14
join:2013-02-04
Miami Beach, FL

cb14 to Stewart

Member

to Stewart
I listened to it after you published the names but that does not influence me in this case. My clear preference goes to # 2, followed by # 4, closely followed by # 3 with # one as clearly the last.
cb14

cb14 to Stewart

Member

to Stewart
The first one is horrible, the second one better.
OzarkEdge
join:2014-02-23
USA

1 recommendation

OzarkEdge to Stewart

Member

to Stewart
said by Stewart:

OK, here you go. GSM vs. sln16.

#2 (#1v2) sounds more precise than #1 (#1v1). Leaves me wondering about the WAV version, but I suppose that's not your test intent.

Back to the original four samples, I felt #2 had some unnatural pacing at certain points.

OE

AllThumbs
join:2006-02-07
Charleston, SC

1 recommendation

AllThumbs

Member

This thread just proves what we all probably already knew. Sound quality is in the ear of the beholder. That's what makes the world go 'round, I suppose.
OzarkEdge
join:2014-02-23
USA

1 recommendation

OzarkEdge

Member

said by AllThumbs:

Sound quality is in the ear of the beholder.

Yes, but... by my own experience, one can listen casually, or one can listen more intently to discern specific attributes (as much as their ear and brain will permit), and come to different conclusions. Sound is complex... life is complex... that is why we are prone to simplify it with idioms and shortcuts in critical thinking.

/political commentary

OE

cb14
join:2013-02-04
Miami Beach, FL

cb14

Member

Depends on our expectations. For me, intelligibility even under less ideal circumstances is the top priority while I could not care less about pleasantness of a machine voice. But I agree, it's in the ear of the beholder.

AllThumbs
join:2006-02-07
Charleston, SC

AllThumbs

Member

Turnkey toolkit for Incredible PBX now available on Nerd Vittles: »nerdvittles.com/?p=22087