gnuspeech-contact
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gnuspeech-contact] Synthetic Speech...


From: David Hill
Subject: Re: [gnuspeech-contact] Synthetic Speech...
Date: Wed, 30 Mar 2016 13:14:13 -0700

Dear Alex,

I have copied this response to the gnuspeech list as it will be of general interest, and also encourages others to participate.

You certainly could use gnuspeech to do the sort of thing you discuss in you email (copy below). However, right now you'd have to get yourself a Macintosh running OS x 10.10.x and use the Monet system. Used Mac Pro Towers (say early 2009 versions) are available at Other World Computing (a very reliable company). They are powerful and upgradeable. I have an early 2008 and an early 2009, and have added an SSD to both of them, which gives an even better performance than the standard machine.

The platform independent gnuspeechsa does not yet incorporate the Monet facility though I believe Marcelo is working on that aspect, judging by some of the image material he has previewed to me.

In order to get different accents, intonation and rhythm, as required for your examples, you may have to get involved in significant manual work, modifying the databases. For intonation, you'd have to create the required intonation contour manually.

However, here are two variants of the "Miss Jones" utterance you suggested using just the basic system and the existing databases:

Attachment: english-mj.au
Description: Basic audio

TIFF image

TIFF image


TIFF image

TIFF image

Attachment: german-mj.au
Description: Basic audio


It would be helpful if the original code were modified to allow a manually constructed intonation contour to be save. At the moment, if the synthesis is saved to a sound file, the contour is regenerated prior to synthesis, losing any contour that was constructed manually. This is an unfortunate omission, but would be fairly easy to correct. You'll notice that the threatening "German accent" version has two added tonic feet added (marked by '*' in the parsed version of the utterance).

In order to make the process easier and less trouble, the user and application dictionaries should be added and made usable. Then particular dictionaries (a lot smaller than the main dictionary) could be set up for particular dialogue and accent requirements.

The cut-in and phrase echoing would have to be done by synthesising the cut-in phrase and then mixing, or possibly in the future by having two copies on Monet running.

Having access to the source code and databases, you could in principle create any facilities you needed to facilitate the kinds of dramatic dialogue for which you are looking. Do you have a programmer with whom you could work? It would amount to creating a "dramatic dialogue" application, based on gnuspeech.

Warm regards.

david


On Mar 29, 2016, at 2:53 19AM, Farlie A wrote:

On 29/03/2016 00:59, David Hill wrote:
Dear Alex,

Yes, indeed. There have been too many alligators in the swamp recently, and something in the speech synthesis line is way overdue. Thanks for the prod! :-)

Real soon now!

All good wishes.

david
--------
David Hill
--------
Simplicity, patience, compassion. These three are your greatest treasures  (Tao Te Ching #67)
---------



Thanks,   My query was prompted by the absence of any "free" voices that I could use for free content development, namely the use of synthetic speech for audio-drama.   Not many of the voice in Espeak http://espeak.sourceforge.net/)( most work now being done on the Espeak-ng fork here - https://github.com/espeak-ng/espeak-ng)  ( which was about the only "free" synthetic speech generator I could find for Windows XP) sound that naturalistic which may be a limitation of the speech generation model it uses ( formanant synthesis).  MBROLA ( a diphone based system) sounded better, but can't be used in commercial instances , which means I couldn't really use it for audio-drama purposes for licensing reasons. 

Free content audio-drama, needs free resources, like sound effects (and this includes free voices for synthetic speech.)

Please note I am note in speech research or any related fields professionally.

Some issues with dramatic speech I noted.  (which I am not sure current synthetic speech systems handle that well.) 

1. Different prosody for the same text phrase,
"You will come with us, Miss Jones?"   Spoken (as British RP) in a comedy sounds very different from "You WILL come WITH US, Miss Jones!" spoken as  a command in a heavily German accented voice in a thriller, despite being nominally the same text. 

Some other examples :
I can think of is the way that different actors have approached Shakesperean passages..   I've heard at least 3 (or more) version of some of them, which sounded very different.

A phrase like " We have seen the outcome of the dozens of missed opportunities for the government establishment  to resolve this unfortunate situation.." is another which has different prosody depending on the context , and whose speaking, which is also related to the next issue I ran up against.

2. Accented and dialect styles.
In an audio-drama script I was working on, one of the characters  although speaking in English has a Germanic accent for dramatic purposes.  In order to get this working in E-Speak, I was in effect having to manually re-code English words into an approriately sounding phonetic form. This could be somewhat automated, but would need information about phoneme mapping between different languages.  

In addition different accents run at different speeds.   The example phrase,  "We have seen the outcome of the dozens of missed opportunities for the government establishment  to resolve this most unfortunate situation!",  can be at different speeds even in English.   A standard BBC jouranalistic voice would read it slightly faster, than an  Irish Republican might, but considerably slower than a British Asian. 

There are many "staged" accents/dialects which are not necessarily representative of the language they sterotype, and thusly to do "staged" voices, looking at the original language group may not be the whole story.    Some examples ( nominally all speaking in English) are (I've tried to give some sample phrases.)

the  "Indian doctor" (invented by Peter Sellers among others), "Well , I think you'll find it's not as usual as you think..."

Nannete ( the French maid in Farces.), "B-B-ut, I has only juzt  feeneeshed ze floors!"

The mad scientist - "You vill understand the importance of my vork, even if it kills me vurst!",

'Strine ( As in broad but clearly staged Australian) - " I don't know about you mate, but that fella was a darned lucky one!"

 Mommerset ( a generic rural dialect use by the BBC amongst other producers, based on a British West Country dialect,
"I don't know what the squire was on about, That bull never left the field."

"pirate" (which is supposedly Bristol/Plymouth but seems to have been a complete invention for a film according to the internet), amongst others.-
"I;ll be thinkin, you'll be more respectful, when the Cap'n' brings his friends."

"Countess" - A Central European/Slavic dialect, trope used in a lot of Vampire films...- "So you, wish to view the cryp-t? I would advise against dist-urbing the sleep of my anc-es-tors.."

Whilst professional research has understandably focused on actual languages and dialects, for audio drama work, especially that aiming to emulate earlier production styles, synthetic speech that follows "staged styles" should be considered in my view.

3. Cut-in and echoing.

In dramatic speech,  the situation where one character cuts in or echoes the phrasing of another occurs.  This may present an issue for multi-voice synthetic speech generation, as without additional coding (such as marking cut in and echo points) there isn't an easy way of knowing where a second voice should be overlaid on the first.

An example ( Both voices are New York possibly Brooklyn)

DANIELS "So you went to the Gallery, Joey?"
JOEY "Yes.. Mr Daniels"
DANIELS: "And?"
JOEY: "And I didn't fi(nd it!)"   
DANIELS (cutting in angrily ) I don't want excuses!
JOEY: You want me to take a second look?
DANIELS (nods) A second , third look until you (find the "artefact")
JOEY (echoing over DANIELS) find the "artefact"

I'm not personally aware of synthetic speech generators that can handle cut-in and phrase echoing, even though they occur in normal speech. I am not sure if it could even be done in real-time because of the need to mix two audio sequences.

I said earlier that I am not in Speech research professionaly, so these issues may have already been dealt with in certain systems.

Alex Farlie

































 


This email has been sent from a virus-free computer protected by Avast.
www.avast.com


reply via email to

[Prev in Thread] Current Thread [Next in Thread]