|
[Asterisk] Amazon Takes Text-to-Speech to Whole New Level with Polly TTSWe've put up a demo on SoundCloud of Amazon's Polly TTS service delivering Yahoo News using Incredible PBX for Wazo. It's not only flawless and almost real time delivery, but it's also dirt cheap. First year is free for 5 million characters a month. After that, it's $4 for a million characters of text-to-speech. Demo TTS at SoundCloud: » soundcloud.com/nerduno/a ··· ible-pbxSetup Instructions for Asterisk/Wazo on PIAF Forum: » pbxinaflash.com/communit ··· e.21318/ |
|
|
1 recommendation |
Google chick IS still the best TTS IMO. Polly's voice made me think of "open the pod bay doors HAL". Too catatonic for me. |
|
2 recommendations |
to AllThumbs
TTS shootoutA clip with samples of the big four (not identified) is attached.. Please post your opinions on intelligibility and naturalness. |
|
TrevAcroVoice & DryVoIP Official Rep Premium Member join:2009-06-29 Victoria, BC |
Trev
Premium Member
2017-Apr-27 9:47 pm
The voice behind door number 3 sounds the least unnatural to me. They all still have a strong robotic accent. |
|
RonR join:2003-10-10 Ash Flat, AR
1 recommendation |
to Stewart
My vote is for number 2. |
|
restamp join:2016-07-22 Pickerington, OH |
to Stewart
4, then 3, then 1, then 2. But I suspect the order would change if we had longer samples |
|
|
to Stewart
I would favor 3 then 4 as the top two |
|
|
to Stewart
I vote for #4 |
|
|
to Stewart
#4, then #3. |
|
|
to AllThumbs
Re: [Asterisk] Amazon Takes Text-to-Speech to Whole New Level with Polly TTSMy order (best to worst) is 3, 4, 2, 1 |
|
1 recommendation |
to Stewart
Re: TTS shootoutsaid by Stewart:Please post your opinions on intelligibility and naturalness. I listened repeatedly to the clip on my PC 2.0 speakers, tablet, and cast to my AVR 5.0 speakers (streamed as Linear PCM 2.0 48KHz). I did not listen to it on a phone handset. I added a star to a sample after each hearing if it still sounded more intelligible and natural than the other samples still earning stars: 1 **** best, ignoring that it sounded like listening on a handset. 2 ** a bit unnatural. 3 *** good, but there is a high frequency noise component. 4 * clearly worst. OE |
|
2 recommendations |
to AllThumbs
Re: [Asterisk] Amazon Takes Text-to-Speech to Whole New Level with Polly TTS#1: Amazon, copied directly from Ward's clip.
#2: Google, using Translate interface (may favor intelligibility over pleasantness, compared with OK Google).
#3: IBM, using website demo.
#4: Microsoft, using website demo.
Apple was not included as AFAIK they don't offer a speech synthesis or recognition API to the general public.
All converted to 16-bit 16 kHz mono PCM and normalized to peak at 6 dB below full scale, based on the assumption that the audio was intended for the PBX user and heard on an IP phone or typical SIP app. For PSTN callers, the balance would tip towards #1, which I believe is already limited to 8 kHz sample rate.
IMO, for a short clip where 100% understanding is important (account balance, phone number, contact's name, weather forecast), #4 would best survive impairments such as moderate hearing loss, non-native language, highway or crowd noise. OTOH as restamp noted, for a podcast or audiobook where missing an occasional word is unimportant but natural sound and minimal fatigue are paramount, I'd choose #3. To me, #1 sounds like it was transmitted over G.729 or GSM, which I find very annoying (it wasn't really; those codecs share algorithms also used in speech synthesis).
@OzarkEdge, I greatly respect your opinions and am very curious why your choices were almost the complete opposite of mine. |
|
1 recommendation |
said by Stewart:@OzarkEdge, I greatly respect your opinions and am very curious why your choices were almost the complete opposite of mine. Well, it's subjective, I suppose, like all audio evaluation. At first I was unsure how to evaluate your two criteria. Then I decided to focus on naturalness since they are all intelligible more or less... and played them loud on various output devices. I was then able to hear/focus on the differences better. When I went back to my PC, I could then hear/focus on those differences more easily. Then I scored them as noted by repeated hearings until one outscored the others. I was confident from the beginning that #4 is the least natural sounding. A more robotic voice could be more intelligible, so maybe intelligibility and naturalness (comfort) are somewhat opposing criteria. I would suggest that naturalness is the icing on the intelligibility cake. Which icing one prefers can get into a whole slew of personal biases... but sexy is universally popular. OE |
|
|
to Stewart
FWIW my Amazon clip was converted to GSM from WAV because that was the best format I could come up with that could actually be used with Asterisk. Someone with better expertise probably could have come up with a usable WAV format. I'm all ears on how to do it. |
|
|
I recently switched my VoIP.ms voicemail to WAV format (lossless)... the voice quality is noticeably better... I find it to be more enjoyable to hear my callers sounding more like themselves. Given my level of usage and the short life of stored messages, I figure the small increase in storage space required is worth it. So, should we test the Amazon sample again in its native WAV format, or at least WAV converted to "16-bit 16 kHz mono PCM and normalized to peak at 6 dB below full scale"? OE |
|
|
said by OzarkEdge:So, should we test the Amazon sample again in its native WAV format ...? OK, here you go. GSM vs. sln16. |
|
cb14 join:2013-02-04 Miami Beach, FL |
to Stewart
I listened to it after you published the names but that does not influence me in this case. My clear preference goes to # 2, followed by # 4, closely followed by # 3 with # one as clearly the last. |
|
cb14 |
to Stewart
The first one is horrible, the second one better. |
|
1 recommendation |
to Stewart
said by Stewart:OK, here you go. GSM vs. sln16. #2 (#1v2) sounds more precise than #1 (#1v1). Leaves me wondering about the WAV version, but I suppose that's not your test intent. Back to the original four samples, I felt #2 had some unnatural pacing at certain points. OE |
|
1 recommendation |
This thread just proves what we all probably already knew. Sound quality is in the ear of the beholder. That's what makes the world go 'round, I suppose. |
|
1 recommendation |
said by AllThumbs:Sound quality is in the ear of the beholder. Yes, but... by my own experience, one can listen casually, or one can listen more intently to discern specific attributes (as much as their ear and brain will permit), and come to different conclusions. Sound is complex... life is complex... that is why we are prone to simplify it with idioms and shortcuts in critical thinking. /political commentary OE |
|
cb14 join:2013-02-04 Miami Beach, FL |
cb14
Member
2017-Apr-29 3:57 pm
Depends on our expectations. For me, intelligibility even under less ideal circumstances is the top priority while I could not care less about pleasantness of a machine voice. But I agree, it's in the ear of the beholder. |
|
|
Turnkey toolkit for Incredible PBX now available on Nerd Vittles: » nerdvittles.com/?p=22087 |
|