Dear Alex,
I have interpolated my reply in your email, in what follows. On Mar 30, 2016, at 16:30 09PM, Farlie A wrote:
On 30/03/2016 21:14, David Hill wrote:
Dear Alex,
I have copied this response to the gnuspeech list as it
will be of general interest, and also encourages others to
participate.
Feel free to copy my further thoughts below,,
You certainly could use gnuspeech to do the sort
of thing you discuss in you email (copy below). However,
right now you'd have to get yourself a Macintosh running OS
x 10.10.x and use the Monet system. Used Mac Pro Towers (say
early 2009 versions) are available at Other World Computing
(a very reliable company). They are powerful and
upgradeable. I have an early 2008 and an early 2009, and
have added an SSD to both of them, which gives an even
better performance than the standard machine.
Is Monet "free" software?, or at the very least the voice generator
compatible with Creative Commons Share Alike? That was important
because the aim was to try and be as 'free' as possible and so that
the output was usable by others ( and could be edited in tools like
GarageBand.)
Monet, like all the gnuspeech software is free in the sense that Richard Stallman pioneered—you are free to use it and modify at at your pleasure, but you can't make it non-free. It is also free in the financial sense. The sources are available (see below).
The platform independent gnuspeechsa does not yet
incorporate the Monet facility though I believe Marcelo is
working on that aspect, judging by some of the image
material he has previewed to me.
Thanks.
In order to get different accents, intonation and rhythm,
as required for your examples, you may have to get involved
in significant manual work, modifying the databases. For
intonation, you'd have to create the required intonation
contour manually.
Hmm, and as I am not a speech professional, this may be beyond my
level of expertise, other than marking notes in the script. as to
intonation intent. Your note about adding tonic feet below is
something I was missing.
Something else that will need to be worked out is how to translate
between Gnuspeech's phoneme names and E-Speak's one (based on the
Kirshenbaum encoding. (see my other recent e-mail).
I think that would be a bad idea. The gnuspeech phonetic representation is well descibed in the Monet manual. You shouldn't need to arbitrarily change the set of phonetic symbols. That is likely to cause problems and seems pointless. The input is punctuated, plain English text. If you want to modify the phonetic script produced, learn the symbols. They are very intuitive and are documented in the manual.
However, here are two variants of the "Miss Jones"
utterance you suggested
using just the basic system and the existing
databases:
It would be helpful if the original code were modified to
allow a manually constructed intonation contour to be save.
At the moment, if the synthesis is saved to a sound file,
the contour is regenerated prior to synthesis, losing any
contour that was constructed manually. This is an
unfortunate omission, but would be fairly easy to correct.
You'll notice that the threatening "German accent" version
has two added tonic feet added (marked by '*' in the parsed
version of the utterance).
Ah. that's possibly something I also need to look up how to do in
E-speak, add tonic-feet.
In order to make the process easier and less trouble, the
user and application dictionaries should be added and made
usable. Then particular dictionaries (a lot smaller than the
main dictionary) could be set up for particular dialogue and
accent requirements.
Hmm... Would consideration some kind of Unintophonic( Universal
intonation phonetic encoding) to represent both sounds and
intonation intent? An older speech synth program I found called
Superior Speech! ( running under RISC OS 3 years ago) , allowed for
at least 8 different (albeit fixed) intonation pitches on individual
phonemes as well as some more advanced features for "singing"
phonemes at specific notes ( something which I understand is an area
of current research by others. ). There are some possible encodings
like XSAMPA which incorporate intonation advice. MBROLA (which is
non-free) stores intonation data in a format which deals at a much
lower level so it is possible to do much more finely tuned
intonation contouring, if I understand what that means correctly. (
Thought: If there was a way to add MBROLA's PHO style data to
GNUspeech/Espeak input files.... hmmm...)
You really need to read the Monet manual. I have just updated my university web site, specifically the page accessed through the left-hand menu selection "Gnuspeech material", to include both the TRAcT and Monet manuals, together with precompiled versions of both TRAcT and Monet. Monet needs Mac OS X 10.10.x or better to run; TRAcT will run on OS X 10.6 or higher. On that same page, in the list of papers relevant to Gnuspeech, there's also a new historical view of the work on intonation and rhythm that may be on interest (the first paper in the list) and there's access to the early data on which the rhythm model was based (the last item, which is a report that was present to an Acoustical Society of America conference in 1977).
There are a whole bunch more papers, less specific to Gnuspeech, but undoubtedly some of interest, under the lefty menu selection "Published papers" which takes you to a new main page.
The cut-in and phrase echoing would have to be done by
synthesising the cut-in phrase and then mixing, or possibly
in the future by having two copies on Monet running.
That's what I thought the current situation was likely to need.
However for audio-drama this is less of an issue given that ihe
generated speech audio will probably be edited together in a
non-linear way anyway. Marking the cut-in's then become a
partioning(?) issue during the lexical parsing(?) and timecoding in
any automated scripts that would generate to ressamble the audio
output. Muse ( http://www.muse-sequencer.org/) is certainly
scriptable, and depending on programmer interest, it looks possible
that a future gnuspeech might be able to pipe output directly into
the tool via various Audio interfaces like LV2, JACk etc...
Granted that 'scripted' semi-automated editing for cues is outside
your area of focus on the speech generation portion.
Having access to the source code and databases, you could
in principle create any facilities you needed to facilitate
the kinds of dramatic dialogue for which you are looking. Do
you have a programmer with whom you could work? It would
amount to creating a "dramatic dialogue" application, based
on gnuspeech.
I don't yet, but was considering asking around on projects like
Wikipedia/Wikisource/Wikiversity, given that certain aspects of it
are quite broad.
You could put out a request on the gnuspeech list (address in the "Copy" field of this email). People reading the list are quite likely to be interested.
Although not strictly within the remit of Gnuspeech, I am wondering
if anyone collects dialect examples. There was an Australian
Government project to collect examples of Australian English. and
the BBC may have done so in the UK a couple of years ago. I am not
sure how 'free' these examples would be for researchers.. I will
also note that there is a Spoken Wikipedia archive at Wikimedia
commons, which contains audio examples of manualyl read Wikipedia
articles. ( Aside: I wonder if anyone's tried using Text to Speech
for Wikipedia articles? )
Well there are a number of projects that involve "Accessibility" issues. There's the Orca project in the Free Software community, but mainstream manufacturers (Apple, Microsoft) are expected to deal with accessibility. On the Mac there are various accessibility facilities under the "System Preferences" heading "Universal Access".
You may also have an interest in some of the work I and my students did, not just for reading pages, but for editing them:
http://pages.cpsc.ucalgary.ca/~hill/papers/ieee-touch-n-talk-1988.pdf
On a different but related topic... from some of your papers you
built an approximate Tract model. This is presumably flexible
enough to cope with most human charcteristics (including voices that
"Sound like that guy from the Trailer, that's been smoking since he
was old enough to buy them."
(another 'staged' voice type I will add to my earlier examples of
vocal types.).
If you read the Monet manual, you'll find that there are various controls to change various aspect of the voice -- and yes, they include changing the settings for the tube resonance model (TRM). You can investigate the quality directly by using the TRAcT application to play with the TRM, but it isn't dynamic. Monet is dynamic and can speak and has an equivalent bunch of controls.
Call this a very premature April 1st peice if you like (but others
with an interest may want to take this in more serious earnest.) but
this got me thinking way outside the box, in respect of what might
be termed "Fantasy Voices" and how they could be modelled. (Clearly
an involved project for someone creative given that you'd almost
have to develop a constructed language at the same time in order to
have a dictionary/transliteration to draw on. "Analysis on the
creatiion of speculative phonology for speech vocalistion in
non-humanoids" almost sounds like an unusual paper ;) )
Presumably by modifying the tract model you could have "fantasy
voices" based on alternative evolution of a dominant mammal, at the
very least, it being a matter of scaling certain frequencies IIRC,
based on some audio reconstruction done in the Early 1990's to
recreate mammoth calls. I'm also not a biologist so whilst I
appreciate speech capability in humans has certain anatomical
aspects that are required, I'm not sure of what precisely the
characteristics are.
Humanoid like aliens, are also a possibility, The so-termed Nordic
types would probably have a voice closest to human (from a tract
model perspective, based on internet accounts of alleged
encounters), "Stage" aliens such as in old radio/TV are from what I
recall mostly accented human langauge albiet with much modified
grammar or intonational rythm. On the other hand you may have
aliens that have "clicks" in their language (not sure of what these
are called in speech/IPA terms) in additional to tonal and noise
based phonemes.
Clicks are not yet in the repertoire! You'd have to generate them somewhere else, for now, and edit them in. :-(
Nearly all the non-naturalistic 'robot/computer' voices I've heard
in TV/Film/Radio have different (or at the least modified) tone and
intonation counters, with many being post-processd using specifc
effect Although not strictly a "robot", I will note here that both
the Dalek and Clasic Series Cybermen used a ring modulator, and that
Classic era Galactica Cylons seemingly used a vocoder. I'm not
sure how Glados (Half-life) was done, but the intonation contor(?)
is not naturalistic."Oh, HelLO, itS You AGaiN!" However in some UK
Sci-Fi shows, the robots are speaking a natural sounding sterotyped
British RP, albiet pitched ever so slightly higher than
normal..Example: "Oh really.. I don't think that would be possible
at all, Sir"
Some of those effects could be generated, I think. For exampleDaleks are basically monotone with a particular voicing frequency.
Getting back to some other thoughts on biological "voices" (these
are all HIGHLY speculative)
Human speech (at least in English) doesn't have a teeth grinding
component in it's phonemes, but if modelling speculative speech for
other biological species this may need to be considered ( seem my
thoughts on Insects below).
Avian (i.e birds) would have a beak component, in addition to a
throat... I'm not sure if this would be directly comparable to
adding an overly scaled nose though.
There is already provision for changing the nose shape and creating nasality.
Insectoid - (Highly Speculative) - Insect (audio) speech in some
examples I once considered informally could be depending on the type
could consist of
1. Pure Tone, albiet very high pitched, The insectoid effectivly
singing in a pure tone at very high pitch ( maybe ultrasound?), but
slowed done so Humans could understand it.) - Examples
2. Clicks, or rasping , Caused by moving mouth parts.....
3. Drone.... This component may be caused by high speed movement
of part of the body or an interefence pattern set up by vents for
air and so on.
( This is the sawtooth/noise sound you hear when a bee comes close
to your ear etc.)
Not surprisingly, the best approximation I thought of for a Bee
like "voice" was the output from a not very advanced 80's speech
add-on called the Currah Microspeech, extending the end of certain
of the z,s,sh,ch (fricatives?) giving the output an 'buzzy'
quality...
Try playing with the glottal pulse shape.
A sample phrase in a sci-fi audio drama with bees might be "Yooou
hasss beeen iinviiiteeed, yooou wiiil meeet thhheee quuueeen,
oooonccceee yooouu hasss beeen trrraaansssiiitionnnneeddd. Yooou
wwwilll preeepaaaiiiirrr fffoor trrraaansssiitiiiooon" (English:
You have been invited, you will meet the queen, once you have been
transitioned. You will prepare for transition." with the intonation
countor emphasing certain vowel frequncies and extended fricatives.
In this instance the specific context is to build a sense of some
fearful act to occur.
Can possibly be faked by duplicating the phonetic elements.
Unlike human speech some fricatives(?) may be a sawtooth sound
rather than noise.. (Hmm... maybe you could have "bee" speech where
the drone level indicates the urgency of the speaker? - This
suggests a more general thought that maybe looking at how speech
rythm changed with urgency is something to look at in respect of
'dramatic' speech generation.)
To make "bee speech" (or to have any creature or machine speak) you could simply substitute the noise made by the bee, orchestra, car, or whatever for the pitch pulse input.
Most bees would be female sounding, but in audio-drama attempting
to do a "special voice" they may be pitched wrongly (Some
reasearch on actual bees sugsted they 'spoke' at around alto pitch)
.
I'm not sure how you might do reptilian or snake-like creatures...
Maybe by extending the fricatives and vowels as with an insectoid,
but with a different intonation countor. Snakes would be "noisy"
rather than an insects sawtooth drone...
Moving into the realm of mythological creatures... (albeit
human-ish ones for now) - again speculative, and mostly noting the
trope styles used in various Radio/TV/film.
Jinn/Genie - Convention seems to vary, sometimes they have a
definite Arab dialect, and in others not, sometimes they have a very
deep Bassed pitching..An example line known to most people that have
seen British pantos is :- "(flashbang) Al-A-din! I AM the GENIE of
the LAMP! "
Leprachauns - Convention seems to be that these are high pitched
but not quite child-like Irish dialect.. "Well then , You''ll be not
getting, me pot 'o gold!"
Elves - Tolkein-like elves, speak like humans, allbiet the
pitching may be a tad higher.. Tolkein wrote extensively about
Elvish in both the Appendices to Lord of the Rings and in other
works published since. ( Of constructed languages, Quenya probably
has enough information on it's phonology for someone to make a
plausible elvish voice in my view.)
Fairies - These vary considerably, Some are very human like,
perhaps with a very slightly upward pitching (Oberon and Titania in
Shakespeare come to mind).
Yes, you can best answer your questions by experiment. If you need the source code to do things like substitute the excitation source for the vocal tract, it is available on the savannah site that is reference on the web site page for Gnuspeech material.
I can probably think of more (or related aspects to ones mentioned)
if anyone is interested in discussing this in more depth seriously.
Alex Farlie.
----------------------------------------------
|