diff --git a/README.old b/CHANGELOG similarity index 55% rename from README.old rename to CHANGELOG index 5f65aa9..1086a9c 100644 --- a/README.old +++ b/CHANGELOG @@ -1,7 +1,44 @@ -cmudict Notes +cmudict Change Log and Notes ------------- -[20060203] (air) +Additional Notes +---------------- +scripts/make_baseform.pl +Takes a base cmudict and transforms it into Sphinx-compatible format. + +scripts/test_dict.pl +Checks a Sphinx format dictionary for correctness (since Sphinx is +picky about format). This should not be necessary unless you manually +modify the Sphinx dictionary. You can use it to check a .handdict file +for errors (see Logios). + + +ToDo +---- +1. a version of test_dict for the base dictionary. +2. consider moving from a flat file to an xml format such as PLS. +3. consider elininating predictable inflected forms (such as -S -'S + -ED) to reduce the size of the dictionary. + + +[200803] (air) +cmudict has been placed under version control in cmusphinx on Sourceforge. +Subsequent changes are documented there. + + +[20071029] (air) +removed some bogus variants + + +[20070921] (air) +switched the pronunciation order for AND to make reduced form first. + + +[20070607] (air) +added words from DeepSpeak and TIMIT sx + + +[20060203] (air) cmudict.0.6d was the last publicly released version of the dictionary. It was used in the lmtool service until 2008. @@ -9,21 +46,21 @@ dictionary. It was used in the lmtool service until 2008. cmudict.0.7 was an extension of the 6d version and represented an additional year or so of development work, however it was never publicly released. At some point it removed from distribution by the -then maintainer. +then maintainer. cmudict.0.7a and cmudict.0.6e are modifed versions of the above dictionaries. 0.6e updates 0.6d with new words found in 0.7 as well as any corrections to 0.6d entries. It should be considered a transitional version of cmudict, or perhaps as a corrected version of 0.7. -0.7a reflects corrections to systematic errors as well as the addition +0.7a reflects corrections to systematic errors as well as the addition of new words. The modifications are as follows: 1) Entries appearing in both 0.6d and 0.7 that had different pronunciations were examined and the appropriate version inserted in -both dictionaries. Notionally this meant taking a corrected entry +both dictionaries. Notionally this meant taking a corrected entry from 0.7 and putting it into 0.6d. 2) 0.7 was examined for variant errors. Specifically, words that had >6 @@ -37,43 +74,53 @@ collapsed. corrected. (Specifically McXxxx and O'Xxxxx forms.) 5) Serendipitously discovered errors (encountered in the course of the -above clean-ups) were corrected. +above clean-ups) were corrected. 6) Proper names in the top 10k found in the 1990 US Census and not already in these dictionaries were added. -[For modifications 2-6, all changes were propagated to 0.6d.] -??? not sure what this means anymore +[For modifications 2-6, all changes were propagated to 0.6d.] +??? not sure what this means anymore -[20070607] (air) -added words from DeepSpeak and TIMIT sx -[20070921] (air) -switched the pronunciation order for AND to make reduced form first. +[19951108 weide] -[20071029] (air) -removed some bogus variants +Version 0.4 -[200803] (air) -cmudict has been placed under version control in cmusphinx on Sourceforge. -Subsequent changes are documented there. +cmudict.0.1.Z is the first one we put out. cmudict.0.4.Z is the latest +and most up-to-date) containing approximately 100k words and their +transcriptions; lists of the words are in cmulex.0.[134].Z. We use +these dictionaries at Carnegie Mellon in our speech understanding +systems. -Additional Notes ----------------- -scripts/make_baseform.pl -Takes a base cmudict and transforms it into Sphinx-compatible format. +The phone set for cmudict.0.4 contains 39 phones, a list of which can be +found in phoneset.0.4. -scripts/test_dict.pl -Checks a Sphinx format dictionary for correctness (since Sphinx is -picky about format). This should not be necessary unless you manually -modify the Sphinx dictionary. You can use it to check a .handdict file -for errors (see Logios). +Lexical stress is indicated by means of a numeral [012] attached to a vowel: + 0 = no stress + 1 = primary stress + 2 = secondary stress +Alternate transcriptions are identified with a numeral in parentheses as +part of the lexical entry. -ToDo ----- -1. a version of test_dict for the base dictionary. -2. consider moving from a flat file to an xml format such as PLS. -3. consider elininating predictable inflected forms (such as -S -'S - -ED) to reduce the size of the dictionary. +We generated this dictionary using the following independent sources: +- a 20k+ general English dictionary, built by hand at Carnegie Mellon + (extensively proofed and used). +- a 200k+ UCLA-proofed version of the shoup dictionary. +- a 32k subset of the Dragon dictionary. +- a 53k+ dictionary of proper names, synthesiser-generated, unproofed. +- a 200k dictionary generated with Orator, unproofed. +- a 200k dictionary generated with Mitalk, unproofed. + +All entries that occur solely in copyrighted sources, like the Dragon +dictionary, are not currently included in this dictionary. If you have +words and transcriptions that you would like included in this unrestricted +resource, please send them to Robert L. Weide (weide@cs.cmu.edu) and we +will consider them for an upcoming version. +All of the above sources were preprocessed and the transcriptions in the +current cmudict.0.1 were selected from the transcriptions in the sources or +a combination thereof. We have removed some potentially unreliable +transcriptions from this dictionary, including those based on only one +source, and will reintroduce them once we have verified the transcriptions. diff --git a/README.developer b/MAINTENANCE similarity index 97% rename from README.developer rename to MAINTENANCE index 2d9a628..29264c9 100644 --- a/README.developer +++ b/MAINTENANCE @@ -1,50 +1,49 @@ -Development and maintenance for cmudict ---------------------------------------- -[20100118] (air) - -The maintainer is responsible for acquiring and vetting new entries, -and for fixing errors that they otherwise encounter. - -At this point, the cmudict project has been incrementally -re-organized; maintenance has been simplified and several aspects have -been automated. The scripts/ folder contains instructions and scripts -for routine maintenance. It has everything you should need to get -started. - -Version numbers and files -------------------------- -There is no particular rule for incrementing the version number. To -date the minor version (letter suffix) has been incremented to reflect -changes in maintainers. Major version increments (right now, the -decimal) are incurred when some (subjectively) large change -occurs. For example, the 0.6-->0.7 increment was marked by a large -number of new entries and by the removal of many incorrect entries -from the preceeding 0.6e version. - -The cmudict.*.phones file lists all legal phones, plus their phonetic class. -The cmudict.*.symbols file lists all legal phonetic symbols (the only -substantive difference is that stress combinations are explicitly noted). - - -Projects for the ambitious --------------------------- - -1. Change the current flat-file version to a database format. This -should still allow producing a flat file, but it will simplify adding -useful information to the dictionary. Some possible data includes: - -a. part-of-speech information -b. domain information (e.g., location, medical, non-english, etc) -c. spelling variants -d. source information (who, when, ...) -e. probabilities for pronunciation variants - -There's additional stuff that can be done but the above bits seem the -most useful ones. I also have ideas on how to do it, so feel free to -get in touch (air at cs cmu edu). - -2. Create an OS independent GUI for managing the database. This should -allow the maintainer to view and modify entries, while dealing with -bookkeeping. It would be nice if the GUI included a synthesizer so -that entries can be checked by listening. - +Development and maintenance for cmudict +--------------------------------------- +[20100118] (air) + +The maintainer is responsible for acquiring and vetting new entries, +and for fixing errors that they otherwise encounter. + +At this point, the cmudict project has been incrementally +re-organized; maintenance has been simplified and several aspects have +been automated. The scripts/ folder contains instructions and scripts +for routine maintenance. It has everything you should need to get +started. + +Version numbers and files +------------------------- +There is no particular rule for incrementing the version number. To +date the minor version (letter suffix) has been incremented to reflect +changes in maintainers. Major version increments (right now, the +decimal) are incurred when some (subjectively) large change +occurs. For example, the 0.6-->0.7 increment was marked by a large +number of new entries and by the removal of many incorrect entries +from the preceeding 0.6e version. + +The cmudict.*.phones file lists all legal phones, plus their phonetic class. +The cmudict.*.symbols file lists all legal phonetic symbols (the only +substantive difference is that stress combinations are explicitly noted). + + +Projects for the ambitious +-------------------------- + +1. Change the current flat-file version to a database format. This +should still allow producing a flat file, but it will simplify adding +useful information to the dictionary. Some possible data includes: + +a. part-of-speech information +b. domain information (e.g., location, medical, non-english, etc) +c. spelling variants +d. source information (who, when, ...) +e. probabilities for pronunciation variants + +There's additional stuff that can be done but the above bits seem the +most useful ones. I also have ideas on how to do it, so feel free to +get in touch (air at cs cmu edu). + +2. Create an OS independent GUI for managing the database. This should +allow the maintainer to view and modify entries, while dealing with +bookkeeping. It would be nice if the GUI included a synthesizer so +that entries can be checked by listening. diff --git a/00README_FIRST.txt b/README similarity index 87% rename from 00README_FIRST.txt rename to README index b75ac47..337e690 100644 --- a/00README_FIRST.txt +++ b/README @@ -17,9 +17,8 @@ time to time a new major version will be released. We welcome input from users: Please send email to Alex Rudnicky (air+cmudict@cs.cmu.edu). -The Carnegie Mellon Pronouncing Dictionary, in its current and -previous versions is Copyright (C) 1993-2014 by Carnegie Mellon -University. Use of this dictionary for any research or commercial +The Carnegie Mellon Pronouncing Dictionary has been dedicated to +the public domain. Use of this dictionary for any research or commercial purpose is completely unrestricted. If you make use of or redistribute this material we request that you acknowledge its origin in your descriptions. @@ -31,7 +30,7 @@ subsequent version. All submissions will be reviewed and approved by the current maintainer, Alex Rudnicky at Carnegie Mellon. --------------------------------------------------------------------- -The current version of cmudict is now cmudict-0.7b +The current version of cmudict is now cmudict-0.7b [First released November 19, 2014] diff --git a/README.weide b/README.weide deleted file mode 100644 index 70b1d21..0000000 --- a/README.weide +++ /dev/null @@ -1,67 +0,0 @@ -Date: 11-8-95 - -Files: README (this file), cmudict.0.1.Z (compressed), cmulex.0.1.Z, -cmudict.0.2.Z (compressed), cmudict.0.3.Z (compressed), cmudict.0.4.Z, -cmulex.0.3.Z, cmulex.0.4.Z, phoneset.0.1, phoneset.0.3, phoneset.0.4. - -This directory contains pronunciation dictionaries (cmudict.0.1.Z is -the first one we put out, cmudict.0.4.Z is the latest and most -up-to-date) containing approximately 100k words and their -transcriptions; lists of the words are in cmulex.0.[134].Z. We use -these dictionaries at Carnegie Mellon in our speech understanding -systems. - -The phone set for cmudict.0.4 contains 39 phones, a list of which can be -found in phoneset.0.4. - -Lexical stress is indicated by means of a numeral [012] attached to a vowel: - 0 = no stress - 1 = primary stress - 2 = secondary stress - -Alternate transcriptions are identified with a numeral in parentheses as -part of the lexical entry. - -We generated this dictionary using the following independent sources: -- a 20k+ general English dictionary, built by hand at Carnegie Mellon - (extensively proofed and used). -- a 200k+ UCLA-proofed version of the shoup dictionary. -- a 32k subset of the Dragon dictionary. -- a 53k+ dictionary of proper names, synthesiser-generated, unproofed. -- a 200k dictionary generated with Orator, unproofed. -- a 200k dictionary generated with Mitalk, unproofed. - -All entries that occur solely in copyrighted sources, like the Dragon -dictionary, are not currently included in this dictionary. If you have -words and transcriptions that you would like included in this unrestricted -resource, please send them to Robert L. Weide (weide@cs.cmu.edu) and we -will consider them for an upcoming version. - -All of the above sources were preprocessed and the transcriptions in the -current cmudict.0.1 were selected from the transcriptions in the sources or -a combination thereof. We have removed some potentially unreliable -transcriptions from this dictionary, including those based on only one -source, and will reintroduce them once we have verified the transcriptions. - -CMU does not guarantee the accuracy of this dictionary, nor its suitablity -for any specific purpose. In fact, we expect a number of errors, omissions -and inconsistencies to remain in the current result. We intend to -continually update the dictionary as we make progress in correcting them. -We will make subsequent versions available via anonymous ftp, and those -who would like notification when updated versions are available should -send email to weide@cs.cmu.edu. - -We welcome input from users: send e-mail to Robert L. Weide -(weide@cs.cmu.edu) if you have comments and suggestions on the content -of the dictionary. - -The Carnegie Mellon Pronouncing Dictionary [cmudict.0.4 and all previous -versions] is Copyright 1993, 1994, and 1995 by Carnegie Mellon University. -Use of this dictionary for any research or commercial purpose is completely -unrestricted. If you make use of or redistribute this material, we would -appreciate acknowlegement of its origin. - -If you add words to or correct words in this dictionary, we would like -the additions and corrections sent to us (weide@cs) for consideration -in a subsequent version. All final entries will be approved by Robert L. -Weide, editor of the dictionary.