The xvoice-sphinx Project: detailed status

Current development efforts were focused on generating a good language model. Sphinx comes with a small LM for testing purposes, but we need one that lets theuser dictate a full range of English text.

We were playing with using Usenet posts as our corpus. The source is available at http://www.ai.mit.edu/~jrennie/20Newsgroups/ and there is information on how to use it.

We have been using the CMU-Cambridge Statistical Language Modeling toolkit to generate the LM.

Jonathan Young kindly donated some scripts for use in building an LM with this toolkit. I (Jessica) was able to build an LM with them, but Sphinx would segfault upon being handed the LM. Jonathan believed that my specialwords.txt file did not contain all of the special words it needed (words like <sil> to denote a pause in speech) but I was unable to determine which words it was missing.

Steven Benner contributed some more scripts. His Java program to build a dictionary ran for several days on my 650Mhz machine before I gave up and killed it. I was able to run the rest of his scripts using the (presumably incomplete) dictionary that resulted, and did get a working LM.

I never did determine what exact blend of these scripts would give an error-free build.

The various files I used to generate an LM are also available. This tarball, since it contains my corpus (after having been run through some Perl scripts) is rather large.

You may also want a copy of the CMU unstressed dictionary.

.