Re: [gnuspeech-contact] GNUSpeech Console Utility

Hi John,

The original text-to-speech system on the NeXT, on which the port is based, did address the "question" intonation pattern.

The intonation patterns are affected by the punctuation and intonation control parameters. But, properly, only questions expecting the answer "Yes" or No", or statements expressing uncertainty that really have rising intonation at the end.

The rampant "up-talk" by the younger generation in Canada is an exception -- everything in "up-talk" gets a rising intonation at the end, perhaps a sign of insecurity in the speaker! :-).

Wh- questions don't show the rising intonation. The system did not make allowance for this distinction -- it would have required some grammatical analysis which we had not tackled, but it should be. It isn't just a matter of detecting the presence of words like "why", "when" "who", what", and "how" because it is fairly easy to frame a "Yes/No" question that also contains one or more of these words (for example: "Did you tell her when we were supposed to meet?").

The system also had regular statements and emphatic statements. There should have been a lot more, and the plan was to implement the whole of Michael Halliday's description of the intonation of British English (he wrote an excellent tutorial book, with accompanying taped examples: A course in spoken English: Intonation" -- Oxford U. Press 1970 SBN [sic] 19 453066 3).

The intonation system was tied to the metrical aspects of English described by a number of British linguists -- most notably Professor David Abercrombie who was at Edinburgh university. We carried out significant research at the U of Calgary on the rhythm and intonation of British English and this was used when we spun off Trillium Sound Research and built the original NeXT system. The rhythm and intonation were regarded as significantly effective features of the text-to-speech system, even though the research results and Halliday were only partially implemented. The speech was found to be much less tiring to listen to for long periods than, for example, DECTalk (which was based on MITalk developed at MIT: "From text to speech: the MITalk system," Allen, Hunnicutt & Klatt, Cambridge University Press, 1987 ISBN 0-521-30641-8)

Abercrombie's claim was that spoken British English had "a tendency towards isochrony". Specifically, spoken phrases and sentences could be split into "feet", rather like the bars in music, and the rhythmic "beat" falls on the first syllable of this unit (the stressed syllables dictate where the foot boundaries fall). A tendency towards isochrony then asserts that the beats fall at more regular intervals than would be expected from the differing number of syllables in each foot, and this is because the length of the syllables becomes shorter as the number increase. American linguists are skeptical about this idea but our analyses of a corpus of English spoken for purposes of illustrating intonation revealed that such a tendency definitely exists. You'd think it was an easy enough question to resolve one way or the other, but if you think this you don't know linguists! :-)

There are several descriptions of the rhythm work we did. The most complete one, though very academic, is:

JASSEM, W., HILL, D.R. & WITTEN, I.H. (1984) Isochrony in English speech: its statistical validity and linguistic relevance. Pattern, Process and Function in Discourse Phonology (collection ed. Davydd Gibbon), Berlin: de Gruyter, 203-225 (J)

but there is a shorter version that summarises the actual research data:

HILL, D.R., WITTEN I.H. and Jassem, W. (1977) Some results from a preliminary study of British English speech rhythm which was presented at 94th. Meeting of the Acoustical Society of America, Miami, Dec 12-16 but only appears as a summary in the proceedings. The full text available as U of Calgary Computer Science Dept. Report 78/26/5

I could send you a draft electronic copy as I am currently working on putting a copy on the web but there's also a hard copy version published as a departmental report.

The intonation work is best accessed through Halliday's book though Craig Taube-Schock's thesis (for which he received the Governor General of Canada's Gold Medal) reports the initial experimental work we did to validate and extend Halliday's descriptions for purposes of computer speech intonation:

"Synthesizing intonation for computer speech output" Craig-Richard Taube-Schock. M.Sc. Thesis, Department of Computer Science, The University of Calgary 1993, 109 pages.

It is available from Proquest (who archive all university theses in North America) though they have the date as 1994. In implementing the intonation for the TextToSpeech kit, a number of improvements were made that are not written up in the thesis, especially the smoothing of contours.

From the original Developer TextToSpeech kit manual:

The Parser Module takes the text supplied by the client application (using the speakText:
or speakStream: methods) and converts it into an equivalent phonetic representation. The
input text is parsed, where possible, into sentences and tone groups. This subdivision is done
primarily by examining the punctuation. Each word or number or symbol within a tone group
is converted to a phoneme string which indicates how the word is to be pronounced. The
pronunciation is retrieved from one of five pronunciation knowledge bases.
The Parser must also deal with text entered in any of the special text modes. For example, a
word may be marked in letter mode, which means the word is to spelled out a letter at a time, or
in emphasis mode, which means the word is to receive special emphasis by lengthening it and
altering its pitch. The Parser marks the phonetic representation appropriately in these cases.

...

The system attempts to speak the text as a person would. Punctuation is not pronounced, but
is used as guide to pronounce the text it marks. For example, a period that marks the end of
sentence is not pronounced, but does indicate that a pause occurs before proceeding to the next
sentence.

A question mark at the end of a sentence caused the rising intonation of a question to be selected. Another special mode allowed punctuation to be spoken, rather than used to control how the text was spoken. I have put the whole manual on my university web site where it is easier to find than digging through the savannah repository, though it doesn't really address these issues completely (but is useful for many purposes, and you will find it useful background). Go to:

http://pages.cpsc.ucalgary.ca/~hill

Select "Published papers" from the left-hand menu, scroll down to section "E. Other publications" and you'll find a whole lot of Gnuspeech-related documents there. The sixth item is "Manual for the original NeXT Developer TextToSpeech kit". Clicking the link witll allow you to download a .pdf file of the whole manual. The five previous links in that section are also useful references for Gnuspeech and will help you in your work on porting the server.

Many thanks for your willingness to get involved. Very much appreciated. Feel free to bug me with any questions/problems that come up.

HTH. All good wishes.

david

---------

David Hill

address@hidden

http://savannah.gnu.org/projects/gnuspeech

--------

The only function of economic forecasting is to make astrology look respectable. (J.K. Galbraith)

--------

On Nov 4, 2009, at 6:21 PM, John Delaney wrote:

Here I was trying to implement a speech synthesis API for a graduate musical synthesis class, and now I'm getting roped into actually working on the project. I'll implement some sort of Parameter class to hold the current intonation parameters, that should be pretty simple.

Would it be possible for the synthesis engine to ramp up the intonation at the end of a sentence whenever there is a question mark? I don't think I have seen a synthesis engine, yet to do this, and it seems like such a small/easy thing to do.

Perhaps I'll revisit this when I eventually take machine learning classes.

Thank you,
John Delaney

On Wed, Nov 4, 2009 at 5:09 PM, Dalmazio Brisinda <address@hidden> wrote:

Yes, you are correct. All those server methods are yet to be implemented. Currently the server just supports speaking text with the defaults that were taken from Monet. This is certainly one area that could use some filling out, and any contribution would be more than welcome.

Best,
Dalmazio

On 2009-11-04, at 5:54 PM, John Delaney wrote:

Thank you all for your help. I have switched to using the server method because its very easy and functional. Am I mistaken, though that many of the parameters such as pitch and intonation have not yet been implemented to the server? I am looking at the server and all the get/set methods have return zero. I suppose I will need to impliment those if this is the case.

On Wed, Nov 4, 2009 at 12:37 PM, Dalmazio Brisinda <address@hidden> wrote:

Have a look at Linked Frameworks section in the Xcode Groups & Files pane. I've found in the past that for setting up the project on a different system, I've often had to remove the custom frameworks (Tube and GnuSpeech) and then add them again, so Xcode correctly picks up the new locations -- unless they're in standard system Framework folders. If you would like additional information on Xcode, have a look at the book "Xcode Unleashed" -- there may be others.

[snip]

---------

From:	David Hill
Subject:	Re: [gnuspeech-contact] GNUSpeech Console Utility
Date:	Thu, 5 Nov 2009 12:29:22 -0800