# HG changeset patch # User HackEso # Date 1564836217 0 # Node ID 1dacb6b7788bb42864d793a622ae2b1332346a74 # Parent d6abdb6694662250b919e28a4a038292ef57ea98 fetch share/dict/12dicts/ReadMe.html https://hack.esolangs.org/get/share/dict/12dicts/ReadMe.html diff -r d6abdb669466 -r 1dacb6b7788b share/dict/12dicts/ReadMe.html --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/share/dict/12dicts/ReadMe.html Sat Aug 03 12:43:37 2019 +0000 @@ -0,0 +1,9313 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + The 12dicts Word Lists + + + + + + + + + + + + + +

Introduction

+ + + + + + + + +

Welcome to version 6.0.2 of 12dicts, a +collection of English word lists. It differs in several important +ways from most of the other free word lists you can download. +

+ + + + + + + + + + + + + + + + + +

+Originally, 12dicts was composed of lists derived from a specific set +of 12 source +dictionaries. In addition to these "classic" lists, 12dicts now +includes lists derived +from other sources. It would perhaps be appropriate to rename 12dicts +to something +more generic, such as BAWL (Beale's Assorted Word Lists), but I have +not done so in +order to preserve continuity. +

+ + + + + + + +

The remainder of this document is organized as +follows: +

+ + + + + + + + + + + + + + + + + +

This release

+ + + + + + + + +

+This is release 6.0.2 of 12dicts, released June 2016. +This is an update to release 6.0. The following is a brief rundown of the +changes and additions in release 6.0 and beyond:

+ + + + + + + + + + + + + + + + + +

Some general +observations

+ + + + + + + + +With the exception of the neol2016 list, all the 12dicts +lists were assembled in a similar fashion. Words were extracted from a +set of source dictionaries and, in most cases, a list was assembled by +selecting all words and phrases present in some number of the sources +meeting certain criteria. For instance, the 2of12 list comprises +lower-case and hyphenated words present in at least two of twelve +source dictionaries. For some lists, rules are added establishing +exceptions for certain words or classes of words - for instance, +the 2of12 list contains the upper-case words I and O as exceptions to +its general exclusion of upper-case words and names.
+ + + + + + + + +
+ + + + + + + + +Some lists contain annotations, which are special characters +appended to certain words. For instance, the ":" character is used in +some lists to identify abbreviations which are ordinarily used without +a terminating period. This annotation allows these abbreviations to be +distinguished from possibly similar regular words. Another annotation, +used in the 3of6game and 3of6all lists, is the "$" character, +indicating a word that was placed in the list even though fewer than +three of the sources mention it. The "+" and "!'" annotations are used +to identify signature words and neologisms, as described below. Note +that is it possible for a word to have more than one annotation, though +this is uncommon. For instance, in the 6of12 list, the word boldfaced~= has both +a "~" and a "=" annotation, signifying that the word was an arbitrary +choice between two equally attested forms (boldfaced +and bold-faced), +and that it was not given a separate definition in a majority of the +sources listing it.
+ + + + + + + + +
+ + + + + + +A number of the lists contain signature words. These are words (or +phrases) which do not meet the formal criteria for inclusion in a +list, but which I have chosen to add anyway, as words which "ought to +be" present. Whether a list contains signature words depends on the +specific list. Usually, but not always, a signature word is present in +some of +the sources used for a list, but not enough of them to qualify for +inclusion on that basis. Some lists may "inherit" signature words from +other lists from which they were assembled. For instance, the 6phrase +list includes the signature words from the 3of6all list. In most +cases, signature words are marked with the "+" annotation.
+ + + + + + + + +
+ + + + + + + + +The neol2016 list contains +neologisms, words which are not listed in +some or all of the source dictionaries for 12dicts, generally for one +of two reasons. First, many of the words are recent coinages which were +not yet fully recognized by mainstream lexicographers when the 12dicts +sources were published. Examples of such words are selfie, Obamacare, emoji +and snarky. +Other so-called neologisms are well-established, often well-known, +words which are +considered scandalous, such as sexual slang and ethnic slurs, and which are +often deliberately omitted from dictionaries. (I will not give any +examples of this sort +of word here, but you will find some in the neol2016 list.) Note that +the neologism list has been accumulating for about fifteen years now, +and +some of its words have become almost old-fashioned, such as spam and dotcom. The +neologism list is provided so that some or all of its words can be +added to the other lists where the intended usage makes that +appropriate. However, I have added the single-word neologisms to the +2of12inf and 3of6game, as these lists are the most likely to be used in +coding word games, where it is desirable to recognize the very +latest hot vocabulary. In these lists, neologisms are +annotated with the "!" character.
+ + + + + + + + +
+ + + + + + + + +One other observation worth making is about diacritics. Some +dictionaries will tell you that there are English words correctly +spelled café, naïve, façade and piñata, +and I do not wish to disagree with these authorities. But as a +practical matter, Americans do not like to use diacritics. Furthermore +they use keyboards which do not contain accented letters, and are often +unfamiliar with the often clumsy techniques that their software +provides to use such characters. For this reason, 12dicts drops all the +accents from its English vocabulary. This is particularly valuable for +coding word games, where expecting players to accent the e in cafe is not going to +make them happy. (I cannot help pointing out that Scrabble® contains +no É tiles.) I apologize to those who consider it a matter of some +emotional importance that resume +and résumé +should be differently spelled.
+ + + + + + + + +
+

The +organization of 12dicts

+ + + + + + + + +The 12dicts lists are organized into four directories, +grouping +lists with similar characteristics together. The remainder of this +document follows this organization as well. For each directory, a +section of the documentation describes in detail the lists it contains.
+ + + + + + + + +
+ + + + + + + + +Most users of 12dicts end up using only a single list. If it is clear +which directory will contain the list you need, you can go directly to +the appropriate documentation.
+ + + + + + + + +
+ + + + + + + + +The four directories are:
+ + + + + + + + +
+ + + + + + + + + +

Picking a list to +use

+ + + + + + + + +If you are not certain which directory might contain the +kind of +list you are looking for, here is a breakdown of the 12dicts lists by +size and purpose which may be helpful. If it does not help you find what you are looking +for, you might want to check out this table, +which summarizes the characteristics of all the 12dicts files, put +together by Kevin Atkinson. Also, I suggest reading the introduction to +each directory presented in the previous paragraph, each +of which contains a table summarizing exactly what you can expect from +each list in that directory.
+ + + + + + + + +
+ + + + + + + + + +

The classic (American) 12dicts +lists

+ + + + + + + + +

+The 12dicts project began as the n-dicts projects, n being a variable +whose +value finally stabilized as 12. The purpose of the project was to +create a +list of words approximating the common core of the vocabulary of +American +English. +

+ + + + + + + + +

The methodology of the project was to record and +correlate the words +listed in a number of small dictionaries. The number of dictionaries +so recorded ended up as 12, comprising 8 ESL (English as a Second +Language) +dictionaries and 4 "desk dictionaries". The dictionaries chosen +varied widely by publisher, by style, by completeness and by depth. All +of them were dictionaries of American +English (three from British publishers). The smallest of them contained +about 20,000 entries, and the largest 46,000. (All totaled, there are +about 75,000 entries, many of which appeared in only a single +dictionary.) +All but two of the sources were published between 1992 and 1999, when +12dicts +was first released.

+ + + + + + + + +

The following table summarizes the contents of each +of the classic lists, located in the American directory, ordered by +size in words: +

+ + + + + + + + +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
3esl6of122of122of12inf
Size (Words)22,00032,00041,00082,000
Number of Sources3121212
American EnglishYYYY
British English
Ordinary wordsYYYY
InflectionsY
HyphenationsYYY
PhrasesYY
NamesYY
AbbreviationsYY
AcronymsYY
Prefixes/Suffixes
Signature wordsYY*
NeologismsY
AnnotationsYYYY
+ + + + + + + + +

+ + + + + + + + +

A * in the "Signature Words" row means that +signature +words associated with some other list may be present, but there are no +signature words associated specifically with that list.

+ + + + + + + + +

The 6of12 and 2of12 +lists

+ + + + + + + + +

+I initially tried two different ways of winnowing the 12dicts data to +produce lists of common words. Both produced interesting results. +One list, the 6of12 list, contained all words and phrases +listed in 6 of the 12 dictionaries. One way of describing this list +is that it contains those words and phrases which a (seeming) majority +of lexicographers believe are relevant to people learning English, +and/or to everyday usage. This list contained about 32,000 words and +phrases. The other list, the 2of12 list, was more inclusive in that it +included words listed in as few as two of the source dictionaries, but +less inclusive in that it excluded items of various sorts, including +multi-word phrases, proper names and abbreviations. This list contained +about 41,000 words. It was likely more suitable for use in areas +like spell checking or word games than the 6of12 list. (Honesty +compels me to admit that neither of these lists is, by itself, a good +choice for spell checking, due to the absence of inflections, proper +names, Roman numerals, etc.) +

+ + + + + + + + +

A third list, 2of12inf.txt, developed later, was of +a rather different +character, and is discussed in a later section. +

+ + + + + + + + +

A more precise description of the criteria by which +the above lists +were composed is as follows: +

+ + + + + + + + +

6of12 list word selection

+ + + + + + + + + + + + + + + + + +

2of12 list word selection

+ + + + + + + + + + + + + + + + + +

Signature words

+ + + + + + + + +As indicated, both lists have been augmented with words +(and, in the +case of the 6of12 list, phrases) which fail to meet the formal +requirements for inclusion. In the case of the 6of12 list, 1024 +words were added (about 3 % of the total). These are all words which, +in the judgment of the compiler, are as familiar as many of the words +which did meet the criteria for inclusion. Examples of some of the +sorts +of words which were added are: + + + + + + + + + + +Note that the signature words in the 6of12 list can be +identified via +the annotation "+", and eliminated if desired. + +

A much smaller set of words (49) was added to the +2of12 list. These +were of two sorts: +

+ + + + + + + + + + + + + + + + + +These words are not marked with suffix characters. +

Annotations

+ + + + + + + + +Some of the 6of12 list entries are annotated with a suffix +character, +giving additional information about the associated word. The +annotations can be easily removed with an editor or a script if +they are unwanted. + +

These annotations are: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
:The word is an otherwise unmarked +abbreviation. This suffix always occurs before any other suffix.
&The word is primarily a non-American usage.
#The word is generally held to be a variant +or less preferred +form of another word.
=Roughly, this indicates a "second class" +word, as described +below.
<This form of a word is held to be the +primary form by fewer +dictionaries than some other form of the word.
^This form of the word was selected +as the most commonly listed of a set of variant spellings.
~This form of a word is one of a set of +variant spellings, none of which was clearly preferred.
+The word is a signature word.
+ + + + + + + + +

+ + + + + + + + +

The reasons a word might be marked with the = +annotation +are: +

+ + + + + + + + + + + + + + + + + +Note that, in the determination of the "<", "^", and +"^" suffixes, only certain very close spelling variations are +considered, namely single word vs. hyphenated word vs. multi-word, +differences in capitalization, and presence or absence of a terminating +period for abbreviations. The words tenderhearted +and tender-hearted +are close variants by this definition, but judgment and judgement are not. +

The words in the 2of12 list are not annotated. +

+ + + + + + + + +

The 2of12inf list

+ + + + + + + + +

+The 2of12inf list is of a rather different character from the two +original "classic" lists. Conceptually, +it is simple. It consists of all the unhyphenated words in the 2of12 +list, plus +their inflections, amounting to about 82,000 words. This list may +be more useful than the other lists for applications like word games. +It was created to help Kevin Atkinson in his Aspell and SCOWL projects +(for which, follow these +links). +Unlike the 6of12 and +2of12 lists, this list was not based exclusively on the contents of my +12 source dictionaries, and for this reason it has, I feel, less +authority than the other classic 12dicts lists. It also probably has a +significantly higher error rate than the other lists, for reasons +explained below. +

+ + + + + + + + +

The criteria defining the 2of12inf list are as +follows: +

+ + + + + + + + + + + + + + + + + +

+Though the 2of12inf list still consists mostly of very common words, +criteria 3 through 5 above cause the 2of12inf list to contain a greater +proportion of unfamiliar and unusual words than the other classic +12dicts lists. +

+ + + + + + + + +

The 2of12inf list was not derived directly from the +12 source +dictionaries. The starting point was a subset of Kevin Atkinson's +AGID list, a list of words, parts of speech and inflections derived +from public-domain sources, notably Moby Words and WordNet. (See the +file agid.txt in the 12dicts archive, which is a copy of the AGID +"readme", +for more information on the antecedents of AGID.) 2of12inf was created +by a process of editing the AGID subset to remove spurious entries and +those which reflected a more esoteric English vocabulary than the other +12dicts lists, and to add inflections which AGID failed to identify. +This process required significantly less effort than would have been +needed to derive the list directly from the source dictionaries. +Unfortunately, a side effect of the process was that the result is +probably somewhat less reliable than the other 12dicts lists. +In particular, Moby Words is notoriously unreliable, and I find it +unlikely that I have successfully identified all the spurious +inflections its use has introduced. It would be nice to +release another edition of 2of12inf which is not derived from AGID, +and therefore not "infected" by Moby Words, but I haven't done so in 15 +years, and so it probably won't happen. +

+ + + + + + + + +

After the first version of the 2of12inf list was +released, I replaced +one of the source dictionaries, officially an international dictionary +but in actuality rather British in its orientation, with a more +American dictionary by the same publisher. It was not practical +(nor necessarily desirable) for me to go through the list removing +inflections endorsed only by the superseded dictionary. For this +reason, the 2of12inf list has a slightly more international character +than the other 12dicts lists. It is not altogether clear that this +is a bad thing. +

+ + + + + + + + +

Selection of inflections

+ + + + + + + + +

+Ideally, the 2of12inf list would contain only inflections listed in +one of the 12dicts source dictionaries. This proved not to be +practical. The reason for this has to do with the nature of these +sources, which are mostly ESL dictionaries. An ESL dictionary might +well list the word esophagus, but, because an +English learner is +unlikely to need to talk about this organ in the plural, it will +probably not bother to list the plural form esophagi. +For words of +this sort, I therefore needed to obtain their inflections from other +sources. Obviously, the decisions on when to include additional +inflections were judgment calls, as were the choices of which +inflections to add. +

+ + + + + + + + +

Adjectival inflections (comparatives and +superlatives) proved to be +an especially annoying problem. Only 2 of my 12 source dictionaries +provided remotely reliable information of this sort. In fact, such +information is sparse and inconsistent in most dictionaries of any +size. I relied on a small set of additional dictionaries for this +information, which was mostly disjoint from the sources for plurals +and verb forms. Several of these sources were Scrabble®-related, +and therefore inclined to include forms of little plausibility such +as iller/illest or fertiler/fertilest. +Accordingly, I ended up rejecting some of the documented inflections on +grounds of implausibility. I have no doubt that, in the process, I made +a number of errors of both inclusion and exclusion and, in any case, +many +of the forms listed have no connection with any of the 12dicts source +dictionaries. +

+ + + + + + + + +

One additional problem in the creation of the +2of12inf list was that +of "uncountable" nouns and their plurals. Some English dictionaries, +especially ESL dictionaries, as well as other linguistic sources +attest to the existence of nouns which cannot be counted or used in +the plural. Examples of such nouns include mud, rayon, +oregano, +chess, fairness, wisdom, aluminum, training, materialism +and chickenpox. This is an entirely commonsense +notion, but a +difficulty is the fact that the boundary between the countable and the +uncountable is extremely vague and ill-defined. For example, the word +coffee is ordinarily uncountable, but not when +ordering in a +restaurant, as is the word symmetry, except in +physics or math. +In general, it is possible to contrive a context where use of the +plural of any noun whatsoever is reasonable. +

+ + + + + + + + +

An alternate position, therefore, is that in fact +no nouns are +uncountable, and that any noun which is not already plural possesses +a plural. This position is especially useful in the context of word +games, where words such as zeals and anthraxes +may produce large scores. For this reason, the official Scrabble +dictionaries list words such as thens, onces and +mankinds, which most people find +rather implausible. The fact that the 2of12inf list might well be +useful in gaming contexts, together with the fact that the boundary +between countable and uncountable nouns is so ill-defined, served as +a powerful argument for inclusion of all plural forms, whether +commonly used or not, while its derivation from ESL sources argued +for including only the plurals of countable nouns, however +distinguished. +

+ + + + + + + + +

As I prepared the list for release, I was unable to +resolve this dilemma, +and adopted a +compromise. The 2of12inf list includes all plurals, but with the +plurals of uncountable nouns marked, making it easy to remove them +if they are not wanted. That left the issue of how to establish +countability. Six of my source dictionaries included information +on countability, which was adequate to decide the status of most of +the included nouns. As for the rest, as usual, I used my best +judgment. I will confess to occasionally overriding the source +dictionaries when I believed they were clearly incorrect. (For +instance, I chose not to mark the word hatreds as +an +uncountable plural, in defiance of the opinion of all my sources, +on the grounds that it has been used in too many news stories from +Bosnia to be considered unusual.) It is interesting to note that +most of the plurals I added from auxiliary sources were of words +considered uncountable. I also note that at some point after the +release of the 2of12inf list, I decided that it would have been better +to have left the Scrabble plurals out, and, while I was not +comfortable with removing them, no list I've created +since then which lists inflections includes them. +

+ + + + + + + + +

The difficulties listed above, and the fact that I +was forced to +exercise personal judgment frequently in creating it, emphasizes a +fundamental difference between this list and the other classic 12dicts +lists. I have tried to make the 6of12 and 2of12 lists reflect only the +source dictionaries, and to keep my own judgments and opinions out of +the picture (except for my addition of signature words). This has +proved impossible to achieve for the 2of12inf list, which accordingly +represents a less authoritative and more arbitrary collection. +Additionally, the 2of12inf list has undergone less proofreading and +validation than the other lists, and I suspect the error rate is +somewhat higher than the idealistic goal of 0.02% I adopted for this +project. Nevertheless, I hope it may prove to be +of some use and interest. +

+ + + + + + + + +

I wish to offer my special thanks to Kevin +Atkinson, for supplying me +with the AGID list, and for encouraging me to add the inflections. Of +course, any errors that remain in the 2of12inf list are my own +responsibility, and should not be blamed on Kevin, AGID, or even on +Moby. +

+ + + + + + + + +

The 3esl list

+ + + + + + + + +

+The 3esl list represents another attempt to produce an English "core +vocabulary" list. It is about 2/3 of the size of the 6of12 list, +which it resembles in terms of the sorts of words included. +

+ + + + + + + + +

The 3esl list is a far more subjective list than +any of the classic +12dicts lists. It was compiled from 3 small ESL dictionaries, using +the same criteria for eligibility as the 6of12 list. I started with +a list composed of all words from the smallest of the 3 sources, plus +all words contained in both of the others. This list was then edited +in the following ways: +

+ + + + + + + + +
    + + + + + + + + +
  1. I removed alternate spellings for included +words, such as grey +and off-stage. I also removed very similar synonyms +for the +same concept, for instance, removing cable television +as a +duplicate of cable TV.
  2. + + + + + + + + +
  3. I added one form of each word which would have +been included if +the sources had agreed on spelling, such as shortchange +and back seat.
  4. + + + + + + + + +
  5. I removed some words which were present in the +smallest of the +sources but seemed too esoteric, such as the symbols of chemical +elements. I did this only for words which were not present in the +other sources.
  6. + + + + + + + + +
  7. I added some words which were present in only +one of the two +larger sources, but which seemed appropriate to add. These words +were frequently of the sort added to the 6of12 list as signature +words, as well as some inflections that often function as words +with meanings of their own, such as comforting and notes. +
  8. + + + + + + + + +
+ + + + + + + + +

+All of these changes were quite subjective in nature, and quite +numerous. Probably more than 10 % of the candidate words were added +or removed in this way. For this reason, it is pointless to speak +of signature words for this list; the composition of the list is too +arbitrary for the term to make any sense. (I will note that the list +is still not entirely arbitrary, as I added only words found in +some form in one of the sources, and removed no words present in two +of the sources other than duplicates. Thus, words like front +page were not added, no matter how familiar, and words such +as lugubrious were not removed, despite clearly not +being +part of anyone's "core vocabulary".) +

+ + + + + + + + +

Like the 6of12 list, the 3esl list marks lower-case +abbreviations +with a ":" suffix, to prevent them from being mistaken for regular +English words. +

+ + + + + + + + +

One final note on this list. The 3esl list contains +about 1500 words +not present in the 6of12 list. Because these two lists have the same +rules for the kinds of words included, one could easily combine +the two to produce a slightly larger list including a number of words +whose omission from 6of12 is rather surprising. Be warned that in a +few cases, the spelling chosen for words with multiple spellings is +different in the two lists, and I would recommend that the duplicates +be removed. (I'll be happy to provide a list of the duplicates if +anyone wants one.)

+ + + + + + + + +

The +international 12dicts lists

+ + + + + + + + +Four 12dicts lists contain a more cosmopolitan vocabulary +than the classic lists. Two of these lists, 2of4brif and 5d+2a +(previously called 5desk), were released over ten years ago. The +2of4brif list was derived from four British dictionaries, and has now +been deprecated, as I believe the 3of6game list to be a superior +implementation of the same concept, compiled from more recent sources. +The 5d+2a list was originally compiled from a variety of sources, but +was extensively revised for this release by addition of several fairly +recently published sources.
+ + + + + + + + +
+ + + + + + + + +For release 6, two new international lists were added to 12dicts: +3of6game and 3of6all. These were based on 6 "advanced learner's" ESL +dictionaries, released by both American and British publishers, +most of which covered both strains of English. The +3of6game list +is intended primarily for use in word games, and can be compared to +2of12inf in its general approach. The 3of6all list includes more forms +of +words (hyphenated, capitalized, multi-word phrases, etc.), and can be +compared to 6of12 in its general approach.
+ + + + + + + + +
+ + + + + + + + +Two other more unusual lists were derived from these sources: 6phrase +and 2of5core. 6phrase is a collection of all the multi-word phrases from +any of the six dictionaries. Five of the six international sources flag +some words as being the most important words for an English beginner to +master. The 2of5core list collects those words that are flagged in at least two +of these dictionaries. Both of these lists are discussed in a little +more detail in the "Specialized Lists" +section of this document.
+ + + + + + + + +
+ + + + + + + + +
The following table summarizes the contents of +each +of the lists in the International directory, ordered +by size in words: +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
2of4brif3of6game5d+2a3of6all
Size (Words)60,00065,00068,00083,000
Number of Sources467 (+5 minor)6
American EnglishSomeYYY
British EnglishYYYY
Ordinary wordsYYYY
InflectionsYYY
HyphenationsY
PhrasesY
NamesYY
AbbreviationsY
AcronymsYY
Prefixes/SuffixesY
Signature wordsYY
NeologismsY
AnnotationsYY
+ + + + + + + + +

+ + + + + + + + + +

The 2of4brif list

+ + + + + + + + +

+All of the classic 12dicts lists are unabashedly oriented towards +American English. After receiving a few expressions of interest in a +British English list, I put together the 2of4brif list. This list +was compiled from 4 large "international" ESL dictionaries, published +by British publishers. To this American, they are more British than +they are international; quite possibly, they seem more American than +international to British readers. It is interesting to note that, +although there were only a third as many sources for this list as for +the 12dicts lists, these dictionaries resembled each other far more +closely than their American counterparts, which could mean that the +2of4brif list is as good an approximation of a "core" British English +vocabulary as the 6of12 list is for American English. (Or, alternately, +it may simply mean that my choice of sources was too narrow.) +

+ + + + + + + + +

This criteria for inclusion in this list were +basically those of the +2of12inf list. In particular, inflections are included for all words, +but hyphenated words, contractions, phrases, proper names and +abbreviations are all excluded. One important difference between +the two is the way in which inflections were determined for inclusion. +The 2of12inf list includes some inflections found in one (or even none) +of its sources. Further, as discussed in detail above, +it includes plurals for words which are not normally +considered to have plurals. The 2of4brif list differs in both of +these regards. It includes only inflections endorsed by two or more +of the sources, specifically excluding any plural forms for nouns +listed as uncountable. +

+ + + + + + + + +

The 2of4brif list includes no signature words as +such. I made a small +number of adjustments for consistency, such as making sure that +-ise and -ize spellings were +equally +represented, and adding plurals for ordinal numbers. (Why +fourteenth would be defined as a fraction, but not +seventeenth, I must simply regard as a mystery.) +These +edits were so few, and so clearly harmless, that I have not marked +them. +

+ + + + + + + + +

Prospective users of the 2of4brif list should +realize that it was +compiled by an American. If my sources contained any glaring errors +(and most dictionaries have a few), I might well not have noticed, +and perpetuated them in the list. The fact that two citations were +required is some protection against such an event, but no guarantee. +

+ + + + + + + + +

As the 2of4brif list is very similar in makeup to +the 2of12inf list, +a user who wants a larger, more international list than either could +reasonably merge the two. If you do this, you should remove the +unusual plurals (marked with a "%") from the 2of12inf list in the +process, for consistency.

+ + + + + + + + +

Note that I have deprecated the 2of4brif list. I +believe that any applications of this list would be better off using +the 3of6game list in its place.

+ + + + + + + + +

The 3of6 +lists

+ + + + + + + + +The lists 3of6game and 3of6all are new with version 6 of +12dicts. Both were derived from a set of six advanced learner's ESL +dictionaries. The dictionaries can be broken down as follows:
+ + + + + + + + +
+ + + + + + + + + +This provided a good balance between British and American +usage. My goal was to produce lists that contained blancmange and swede as well as applesauce and boysenberry. Note +that +some of the British dictionaries include words from Australian, Indian, +African and Caribbean English, and a fraction of this vocabulary made +it into the 3of6 lists.
+ + + + + + + + +
+ + + + + + + + +In previous versions of 12dicts, I asked users to tell me what they +were doing with the lists. The most common answer was that they were +used to supply the vocabulary for a word game. The 3of6game list was +designed to fulfill this purpose. It contains only the sort of words +likely to be used in a word game (no hyphenated words, proper names, +abbreviations, contractions or phrases), but does contain inflections. +In general, words must appear in three of the sources to be +included. The rules, however, do provide for a number of (annotated) +exceptions, including uncommon inflections and words whose most common +form is either hyphenated or phrasal. Details are below.
+ + + + + + + + +
+ + + + + + + + +The 3of6all list is a larger list, basically containing any kind of +word you can imagine, if found in three of the sources. As with +3+3game, some additional words were added as exceptions, but +there are not as many of them, as the goal of this list is to be as +faithful as reasonable to the sources.
+ + + + + + + + +
+ + + + + + + + +Both the 3of6game and 3of6all lists contain signature words/phrases. +The 3of6game list also contains neologisms, as game players are likely +to want to play recently coined or popularized words.
+

The +3of6game list

+ + + + + + + + +The 3of6game list contains words which are listed in 3 of +the 6 advanced learners dictionaries described above. Only words +suitable for play in most word games are included, excluding hyphenated +words, multi-word phrases, capitalized words, abbreviations and +contractions. There are no restrictions on length - in particular, it +contains four one-letter words: a, +x (a verb +meaning to cross out), I +and O, the +last two of which are included despite their capitalization (which is +an English spelling phenomenon entirely disconnected from +logic). In certain cases, words are present in this list despite being +listed in fewer than three sources. This serves the purpose of +offering game players more words in situations where lexicographers +differ about what word forms are correct. Some exceptional situations +are:
+ + + + + + + + +
+ + + + + + + + + +Words not present in three of the source dictionaries are +marked with the "$" suffix character if the "^" and "&" +annotations do not apply.
+ + + + + + + + +
+ + + + + + + + +The 3of6game list includes both signature words and neologisms, marked +with a "+" or "!" respectively.
There are 520 +signature words for this list, representing words +that I feel "ought to be" included. Each signature word is present in +at least one of the source dictionaries. Virtually all of these words +are American English, as I am not qualified to tell whether a +interesting Britishism like tosspot +is used often enough to justify its addition as a signature word. Note +that the presence of annotations allows a user to remove these +extra words if she finds their addition unjustified.
+ + + + + + + + +
+ + + + + + + + +The 3of6game list could be combined with the 2of12inf list (minus the +uncountable plurals) and/or 2of4brif if a larger list is required. Note +that because 2of2inf is very strongly American, such a combination will +be less balanced between American and British English than 3of6game +itself.
+ + + + + + + + +
+

The +3of6all list

+ + + + + + + + +The 3of6all list contains words which are listed in three of +the six advanced learner's dictionaries. In contrast to the 3of6game +list, no words are excluded, not even abbreviations, prefixes or +suffixes. Most words have their inflections included. An exception is +made for phrasal verbs and other verb phrases, whose inflections are +completely predictable from the initial word of the phrase.
+ + + + + + + + +
+ + + + + + + + +The 3of6all list contains many phrasal verbs, such as let down, take after, sound off and make out, whose +meanings are often quite hard for inexperienced +students of English to guess. Phrasal verbs are marked by the ";" +suffix +character. Only four of the six source dictionaries provide phrasal +verb information in an easy-to-collect way. For this +reason, I put a phrasal verb into the 3of6all list even if I found it +in only two of the sources.
+ + + + + + + + +
+ + + + + + + + +The 3of6all list contains some other words present in fewer than three +of the +dictionaries, though not as many as 3of6game. All such words are +marked. The cases where this occurs are as follows:
+ + + + + + + + +
+ + + + + + + + + +There is one other situation where an annotation suffix is +used. This occurs when a word is shown by a majority of the sources as +being used only in a few +specific phrases, even though other dictionaries may give it a regular +definition. An example is the word bated, +which is shown by most of the sources as used only in the phrase with bated breath. +In this case, the word is flagged with a ">" suffix. A search on +a word so flagged will reveal the key phrase(s) elsewhere in the list.
+ + + + + + + + +
+ + + + + + + + +Recall that, sometimes, a word may have more than one suffix. An +abbreviation shown with the ":" suffix (indicating the absence of a +final period) may be followed by another suffix, and the combination +">^" appears upon occasion.
+ + + + + + + + +
+ + + + + + + + +The 3of6all list contains signature phrases, but no neologisms. The +signature phrases are marked with the "+" suffix. The 629 3of6all +signatures are all basic conversational idioms and common connective +phrases, like I told you +so, in +front of and on +the other hand. Though these phrases often show up in the +sources in lists of idioms, they generally do not appear as separate +headwords, which kept me from easily recording their presence. I +believe, however, that all of these phrases are extremely common, and +deserve to be included in this list. The signature phrases are all +marked with the "+" suffix.

+ + + + + + + + +
+

The 5d+2a list

+ + + + + + + + +

+I created the 5d+2a list (originally called 5desk) in an attempt to do +a better /usr/dict/words +(the failings of which were a large part of my motivation for doing +12dicts in the first place). +The sorts of words admitted are the same sorts that /usr/dict/words +traditionally contains. Though somewhat larger in size than many +versions of +/usr/dict/words, this is still a short word list, striving for +inclusion +of words one is likely to encounter rather than the complete jargon of +every possible scientific, artistic or occult endeavor. +

+ + + + + + + + +

The original 5desk list was assembled primarily +from five "desk +dictionaries". It +was augmented by words from five minor sources, including a "vocabulary +builder" and a collection of proper names. It excluded +prefixes, suffixes, phrases, hyphenated words, contractions and most +abbreviations and acronyms. There was no requirement for multiple +listings; all qualifying words from each of the sources were included. +Inflections of included words were not included themselves except when +irregular, or separately defined. Variant and non-American spellings +were not excluded, and no signature words were added.

+ + + + + + + + +

Words commonly considered to be +abbreviations/acronyms were included +if they contained at least one upper case character, and were defined +with an explicit part of speech. This excluded items like Mr +and +Feb, which are abbreviations in the classic sense, +but allowed words +like DNA and ATM, which are +used far more frequently than that +which they abbreviate. While there is a trend in modern dictionaries +to list such words as nouns (or occasionally verbs, adverbs, etc.), +it is a trend in progress, and rather inconsistently applied. For +this reason, the set of such words in the 5desk list is somewhat +incoherent, including SPCA but not PETA, +AIDS but not SAD, +KGB +but +not CIA, and PDQ but not ASAP.

+ + + + + + + + +

When version 6 of 12dicts was released, the 5desk +list was +augmented by adding qualifying words from two advanced learner's ESL +dictionaries, and as a result renamed to 5d+2a.txt. Both of the +additional dictionaries had a strongly international vocabulary, +causing the new list to have a less American and more cosmopolitan +character. This increased the size of the list by about 20% to about +68,000 words.

+ + + + + + + + +

One class of commonly-used words is regrettably +absent from the 5desk +list, because I was unable to find a satisfactory source for them. +This is the class of commercial names such as Exxon, Tylenol, +Pepsi and Chevy. This is probably +forgivable, +as this class of names is as ephemeral and transitory as teenage slang. +The one-time household words Kool, Ovaltine, Philco +and +Ipana serve now only as answers to trivia questions, +with modern wonders like Starbucks, Google, Ritalin +and TiVo taking their place on the tongues of the +trendy.

+ + + + + + + + +

The 5d+2a list contains no signature words. I did +take the liberty of adding the personal names of around thirty +well-known individuals, mostly statesmen and politicians. Though the +original 5desk list contained many such names from all periods of human +history, I have not found a useful source to bring the list into the +twenty-first century. At the same time, I felt that distributing a list +full of +names that did not include Cheney and Obama was not +reasonable. So I compromised by adding a few names whose historical +significance was clear to me, until such time as a better source than +my own memories of the last 15 years can be found. +

+ + + + + + + + +

The 5d+2a list has clearly moved beyond any "core +vocabulary" concept. +It includes quite esoteric words (ogee, pleonastic), +very +uncommon spellings (thiamine, yuppy), and obscure +geographical +and historical names (Paricutin, Nevelson). Like +the traditional /usr/dict/words, it is frequently inconsistent and +arbitrary, but I +hope at the least I have avoided including spelling errors, and +overlooking the stuff of everyday conversation. Perhaps it will be +useful as a compromise between basic lists such as 3esl, and truly +massive lists like Mendel Cooper's ENABLE.

+ + + + + + + + +

The +lemmatized 12dicts lists

+ + + + + + + + +Version 6 of 12dicts provides three lemmatized lists +combining words from the 2of12inf, 3of6game and 2of4brif lists. The +word "lemmatized" is a rare +word, which you will find in none of these lists, but what it means is +that these lists are formatted as a collection of word sets, called +lemmas (or lemmata, if you're into irregular plurals), each set +composed of a headword and some number (possibly zero) of closely +related +words. Two of these lists were introduced in version 5 of 12dicts, but +they have undergone major revisions since then.
+ + + + + + + + +
+ + + + + + + + +The three lists are 2+2+3lem (originally 2+2lemma), 2+2+3frq +(originally 2+2gfreq) and 2+2+3cmn. 2+2+3lem simply arranges +the words of the three source lists into lemmas and lists them +alphabetically by headword. 2+2+3frq arranges the same lemmas by +approximate order of their frequency of usage, computed with the help +of a frequency list obtained from Brigham Young University (BYU), +omitting those words and lemmas whose usage is so small that they fail +to show up in the BYU material. 2+2+3cmn extracts a subset of the +lemmas of 2+2+3lem, namely those lemmas with a certain minimum level of +usage (approximately the level of the word butterscotch), and +lists them alphabetically by headword. This is yet another attempt in +12dicts to generate a core English vocabulary.
+ + + + + + + + +
+ + + + + + + + +The advantage of a lemmatized presentation of words is that it puts +related words together, even when spellings differ greatly, as for
be, are, is and were. A moderate +disadvantage is that the same word can appear in more than one lemma, +such as putting, +which is present in the lemmas headed by both put and putt. Overall, I +find the lemmatized format to be clearer and more useful than a simple +alphabetized list, and I rather wish I had released the other lists +which include inflections in that format.
+ + + + + + + + +
+ + + + + + + + +
The following table summarizes the contents of +each +of the lists in the Lemmatized directory, ordered +by size in words:
+ + + + + + + + +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
2+2+3cmn2+2+3frq2+2+3lem
Size (Words)25,00034,00084,000
Number of Sources212121
American EnglishYYY
British EnglishSomeSomeY
Ordinary wordsYYY
InflectionsSomeSomeY
HyphenationsSomeSomeY
Phrases
NamesSomeSome
AbbreviationsSomeSome
AcronymsSomeSome
Prefixes/Suffixes
Signature wordsY**
NeologismsA fewA fewY
AnnotationsYYY
+ + + + + + + + +

+ + + + + + + + +

A * in the "Signature Words" row means that +signature +words associated with some other list may be present, but there are no +signature words associated specifically with that list.

+ + + + + + + + +

The 2+2+3lem list

+ + + + + + + + +

The list 2+2+3lem.txt contains the words in the +2of12inf, 2of4brif and 3of3game lists. +Also, the new words from the neol2016.txt list have +been added, marked with a "!" if they would not have otherwise been +included. (Marking the new words permits them to be removed if it is +preferred for this list to be in synch with the other 12dicts lists.) +Furthermore, some high-frequency hyphenated words from 2of12.txt and +3of6all have been added. These words were originally added to the +lemmatized frequency list (see below), +and I liked the results so much that I added them to this list as well. +Finally, British forms of words in +the 2of12inf list not already in the other lists have been added. +Words +marked with a % in the 2of12inf list ("Scrabble plurals") have +however been omitted.

+ + + + + + + + +

In the previous version of 12dicts, the 2+2+3lem list was +called 2+2lemma. The only significant changes were the addition of new +words, and switching from "+" to "!" to mark neologisms in the list.

+ + + + + + + + +

The 2+2+3lem list is not formatted as a simple list +of words. +It is composed of entries of 1 or 2 lines each. The +first +line contains a headword, and the second line, which is indented if +present, contains an alphabetized list of related words. A +simple example:

+ + + + + + + + +

funny
+ + + + + + + + +    funnier, funnies, funniest, funnily, funniness

+ + + + + + + + +

The list of related words contains three sorts of +entries.

+ + + + + + + + +
    + + + + + + + + +
  1. + + + + + + + +

    Inflections.

    + + + + + + + + +
  2. + + + + + + + + +
  3. + + + + + + + +

    Variant spellings.

    + + + + + + + + +
  4. + + + + + + + + +
  5. + + + + + + + +

    Words formed with certain suffixes.

    + + + + + + + + +
  6. + + + + + + + + +
+ + + + + + + + +

In addition to true variant spellings such +as grey +for gray +and thru +for through, +item 2 also includes words +which, though pronounced differently, are clearly variants +of the headword. Thus, hooray is considered +a variant of hurrah +(but mere synonyms like furze +and gorse +remain +independent).

+ + + + + + + + +

Item 3 is based on a small list of suffixes, +producing closely +and consistently related words. These suffixes are -ful, -ish, +-less, -like, -ly, -most and -ness. -ally is also +allowed, if +there is no -al +word to apply the -ly +suffix to. (For instance, basically is +considered to be derived from basic, because there +is +no word basical.) When +one of these suffixes is used in an +unusual way, the resulting word is considered independent. +For +instance, likely +is not considered to be derived from like, nor bashful +from bash. +There are some rather difficult questions +here, such as how closely slavish +is related to slave, +or sluggish +to slug. +In general, I have chosen the course of +least surprise by treating such pairs as independent.

+ + + + + + + + +

Here are some other notes on the determination of +what words are related.

+ + + + + + + + +

Certain uses of the suffixes -ed and -s are treated as +inflections, even though technically they are not. +Thus, talented +is treated as derived from talent, +and optics +from optic.

+ + + + + + + + +

Words ending with the suffix -ability/ibility are +treated as relatives of the corresponding -able/ible word.

+ + + + + + + + +

Sometimes, the choice of which variant to treat as +the headword +is somewhat arbitrary. I have consistently chosen an American +spelling over a British spelling here. This has some effect on +the number of headwords. I treat cheque as a variant +of check, +whereas, to an observer with a British bias, they would no doubt be +separate headwords.

+ + + + + + + + +

No distinction is made of different meanings of the +same word, +even when they are so different that dictionaries list them +separately. wind +the noun and wind +the verb are considered as a +single word, as are second +the adjective, second +the noun and second +the verb.

+ + + + + + + + +

It may sometimes happen that two different words +have the same inflection (putting +derives both from putt +and put; holier relates +to holey +as well as holy), +or that an inflection +is a headword in its own right (as with wound, the past +tense of wind, +or crooked, +the past tense of crook). +These +situations are noted in the 2+2+3lem list as cross-references to the +alternate headword. There are two specific situations +which might not be obvious where +inflections are treated as different words. +These occur when a present tense form or a -ness word has a +plural inflection, as with meaning +and weakness. +Such words +are always made headwords, even when the relationship to the original +root is very close. Here is an example showing how +cross-references are indicated:

+ + + + + + + + +

base
+ + + + + +    based, baseless, basely, baseness, +baser, bases -> [basis], basest, basing

+ + + + + + + + +

Almost always, a given word has only one +cross-reference - the +biggest exception is the incredible tangle shown in the example below:

+ + + + + + + + +

slue +-> [slough]
+ + + + + + + + +     slew -> [slay, slew, slough], +slewed, slewing, +slews -> [slew, slough], slued, slues -> [slough], sluing +

+ + + + + + + + +

where 4 uncommon words mostly pronounced sloo have become +thoroughly confused.

+ + + + + + + + +

The 2+2+3frq list

+ + + + + + + + +In the previous version of 12dicts, there was +a file called +2+2gfreq.txt. This file has been completely replaced by a new +implementation of the same idea. Like the older list, the 2+2+3frq list +presents the lemmas of 2+2+3lem in bands of lemmas +with about +the same frequency of use. However, there are the following major +differences from what was done before:
+ + + + + + + + +
+ + + + + + + + + +The lemmas of 2+2+3frq are grouped into bands by the +combined +number of occurrences in the BYU data of the words in the lemmas. Band +21 contains lemmas whose words together appear between 16 and 31 times +in the BYU data. Each other band contains lemmas of twice the frequency +of the following band, that is, each lemma in band 20 appears in the +BYU data between 32 and 63 times, and so on. The first band contains +the three lemmas most frequently used in the English language +(according to BYU), namely the, +be (plus its +inflections) and to. +As already noted, some words are found in multiple lemmas. One helpful +aspect of the BYU data is that it separates frequency data for a word +by parts of speech, and notes the base word for inflected words. This +often allows the frequency counts for a word like building to be +accumulated under the correct lemma (either build or building). +In the event that the BYU data is unable to completely resolve the +appropriate lemma for a word, its frequency count is divided equally +among the various candidates.
+ + + + + + + + +
+ + + + + + + + +2+2+3frq is divided into bands by lines like this:
+ + + + + + + + +
+ + + + + + + + +----- 5 -----
+ + + + + + + + +
+ + + + + + + + +
The lemmas in each band are presented in alphabetical +order, not by the frequency of the individual lemma.
+ + + + + + + + +
+ + + + + + + + +Note that because the BYU data was extracted from a corpus of American +English, the 2+2+3frq file tilts in an American direction, though some +British words like bloke, +colour and lorry have made it +through.
+ + + + + + + + +
+ + + + + + + + +A useful attribute of the BYU +data is that it, +unlike the Google data, includes hyphenated words, as well as some +abbreviations, contractions and capitalized words. The two cases are +rather different. The inclusion of hyphenated words is explicitly +intended. However, the BYU documentation states that proper names have +been excluded where possible, while admitting that, in many cases, the +software processing the data was unable to be sure whether a word was a +proper name or not, in which case the word was included. The effect is +that there are many words generally considered to be proper names +present, notably the names of months of the year and days of the week, +plus those of religions, nationalities and ideologies. You will not +find names like linda, +picasso, vladivostok, microsoft or rumpelstiltskin in +the data, but you will find november, +buddhist, peruvian and marxist, +to the extent that I wonder if BYU might have used a different +definition of "proper name" than the one I was taught in school. As for +abbreviations, the BYU documentation makes no mention of them, but +there are some very familiar abbreviations in the data. There are not a +lot of them, which makes me wonder whether their presence was +intentional or a processing error. Either way, I have no reason to +doubt their frequency counts.
+ + + + + + + + +
+ + + + + + + + +I decided that I wanted to add high-frequency hyphenated words, proper +names and abbreviations to the frequency list, as I consider this data +to be very interesting. When I did so, I discovered in band 17 the +words atlantean +and klingon. +I really don't think that these words have anywhere close to the same +frequency as armband +and carpool, +which are also present in band 17. This makes me suspect that, for +words of this frequency or less, the BYU data is starting to become +less reliable. For this reason, I decided to stop adding hyphenated +words, capitalized words, contractions and abbreviations after band 17.
+ + + + + + + + +
+ + + + + + + + +In the case of hyphenated words, I added them to the 2+2+3frq list only +if they were present in either 2of12.txt or 3of6all.txt. I also added +these words to the 2+2+3lem list. In the case of abbreviations and +capitalized words, there were not all that many of them, and some of +them were not present in any other 12dicts list, such as Americanist, Thatcherism and, of +course, Klingon. +For this reason, when I added capitalized words, contractions and +abbreviations to 2+2+3frq, I parenthesized them to indicate that their +presence had nothing to do with any source but the BYU data. The same +consideration led me to omit these words from the 2+2+3lem list.
+
+I should note that, though the BYU data is superior to the previous +Google web data, it is not without its flaws. Three issues of +particular importance are difficulties with part of speech information +for words like painting and filling, an inconsistent approach to words which are also proper names like rose, king and miller, and a tendency to combine data for words and common acronyms, such as eta/ETA and sac/SAC. +I have attempted to tweak the frequencies in such cases, using various +public word frequency sources, whenever I observed them, which is to +say whenever the results of taking the BYU data at face value led to +implausible results.
+ + + + + + + + +
+ + + + + + + + +The 2+2+3frq list is considerably smaller than the previous 2+2gfreq +list due to my decision to drop lemmas which were absent from the BYU +data, especially since the BYU data was considerably less voluminous +and so left out many more words than the Google data. In addition, I +observed that many high-frequency lemmas contained unusual spellings +and archaic forms that were not present in the BYU data, such as cocoanut, iodin and didst, +and decided to drop non-headwords from the lemmas unless their +frequency was at or above the level of band 17. A similar decision was +made to drop regular inflections from the lemmas in the 2+2+3frq list +unless they had high frequency with a different part of speech, for +example, loving +as an adjective or fighting +as a noun. Finally, I chose to drop the word/lemma cross-references +from the 2+2+3frq list, replacing them with a * indicating that a word +was to be found under another headword (though it might have been +suppressed if it was a regular inflection).
+ + + + + + + + +
+ + + + + + + + +As an example of how this works out in practice, here is the lemma for time from 2+2+3lem:
+ + + + + + + + +
+ + + + + + + + +time
+ + + + + + + + +    timed, timeless, timelessly, timelessness, times, timing -> +[timing]
+ + + + + + + + +
+ + + + + + + + +and here is the condensed version from 2+2+3frq.
+ + + + + + + + +
+ + + + + + + + +time
+ + + + + + + + +    timed, timeless
+ + + + + + + + +
+ + + + + + + + +
The words timelessly +and timelessness +are not used often enough (according to BYU) to mention in the +frequency list, while the word times +was not frequently used except as a form of time, and, while the +word timing +was frequently used as a noun, its counts were collected under the +lemma timing +rather than time.
+ + + + + + + + +
+

The 2+2+3cmn list

+ + + + + + + + +The 2+2+3cmn list is a relatively simple transformation of +the +2+2+3frq list, in yet another attempt to produce a "core English" word +list. It is composed of the lemmas of the 2+2+3frq list from bands 1 +through 17, sorted in alphabetical order by headword. Minor formatting +differences are that the "!" is removed from neologisms, and +the +parentheses are removed from capitalized words, abbreviations and +contractions.
+ + + + + + + + +
+ + + + + + + + +I have added 77 signature words to 2+2+3cmn, which are +abbreviations, contractions and capitalized words (mostly +contractions) which I know to be extremely high frequency, but which +were not present in the BYU data, words such as can't, Mr. and DVD. These words are +marked with a + to indicate their absence from the 2+2+3frq source data.
+ + + + + + + + +
+ + + + + + + + +Like 2+2+3frq, 2+2+3cmn tilts strongly in the direction of American +English.
+ + + + + + + + +
+ + + + + + + + +Because all the words of 2+2+3cmn are of moderately high frequency +(assuming the BYU data is to be trusted), it probably is a better +claimant than either 2of5core or 3esl to truly representing a core +English vocabulary, at least of the American variety.
+

Specialized +12 dicts lists

+ + + + + + + + +The following table summarizes the contents of +each +of the lists in the Special directory, ordered +by size in words: +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
neol20162of5core6phrase
Size (Words)6004,70022,000
Number of Sources056
American EnglishYYY
British EnglishA littleYY
Ordinary wordsYY
InflectionsY
HyphenationsYA few
PhrasesYA fewY
NamesYA fewA few
AbbreviationsYA fewA few
AcronymsYA few
Prefixes/Suffixes
Signature words*
NeologismsY
AnnotationsYNY
+ + + + + + + + +

+ + + + + + + + +

A * in the "Signature Words" row means that +signature +words associated with some other list may be present, but there are no +signature words associated specifically with that list.

+ + + + + + + + +

The neol2016 list

+ + + + + + + + +The neol2016 list is a very simple list of new or newly +recognized words, as described above. +It is comprised of three parts, separated by blank lines.
+ + + + + + + + +
+ + + + + + + + +The first part lists regular (non-hyphenated, non-capitalized) words +together with their inflections and +variants, laid out similarly to the 2+2+3lem list. It includes plurals +for uncountable nouns, marked with a "%" suffix. These words (except +for the uncountable plurals) have been pre-added to the 2of12inf and +3of6game lists, suffixed with "!", allowing them to be easily +removed if desired.
+ + + + + + + + +
+ + + + + + + + +The second part of the file is a small set of words for which +additional inflections have been added. This portion of the file is in +the same format as the first list. These inflections have also been +added to the 2of12inf and 3of6game lists.
+ + + + + + + + +
+ + +The third part of the file contains new words and phrases which are not +regular words: hyphenated words, multi-word phrases, proper +names, abbreviations and acronyms. These words have not been pre-added +to any other list.
+ + + + + + + + +
+ + + + + + + + +In all cases, users are encouraged to add some or all of these words to +any of the other lists, as they feel appropriate.
+ + + + + + + + +
+

The 2of5core list

+ + + + + + + + +Five of the six advanced learner's ESL dictionaries from +which the 3of6 lists were compiled mark a subset of their words as +being important words which every student of English should master. +These subsets vary widely from dictionary to dictionary. As one of the +original goals of the 12dicts project was to compile a list +representing the +English core vocabulary, I thought it would be interesting to combine +these lists. My original thought was to provide a list that was simply +the union of the marked subsets for each source. However, one +particular dictionary had at least twice as many words in its subset as +any of the others, and in many cases the words seemed to me to be +poorly chosen. (Do moor +and cash flow +seem like key English language concepts to you?) So I chose when +assembling my list to require that all words be marked as important +words by at least two of the sources. The result was the 2of5core list, +which contains about 4,700 words.
+ + + + + + + + +
+ + + + + + + + +While most words selected in this way were the same in American and +British English, some belonged to one variant or the other. In some +cases, a word appeared in two forms, such as center and centre. When I +observed that a word was present in two forms, I combined them into a +single line, for example center/centre. +No other changes were made to the list.
+ + + + + + + + +
+ + + + + + + + +Due to the way in which the list was constructed, it seems somewhat +haphazard. You may want to check out the Oxford 3000™, a list of 3000 +words available from Oxford University, which is a core vocabulary +created by lexicographers, to my eye superior to the 2of5core list.
+ + + + + + + + +
+

The 6phrase list

+ + + + + + + + +When I was compiling the 3of6all list, I noticed something +interesting. There were an extraordinary number of phrases listed by +only one of the sources. Many of these were extremely common phrases, +which I would expect most experienced English speakers to understand. +So, naturally, I decided to compile them all into a list.
+ + + + + + + + +
+ + + + + + + + +The 6phrase list contains all multi-word phrases from any of the six +advanced learner's dictionaries which were used as sources for 3of6all, +all 22,000 of them. The list does not include inflections, except in a +few cases where a plural cannot easily be guessed from the words in a +phrase. Usually, this happens for phrases of non-English origin, such +as eau de cologne, +whose plural is eaux de +cologne. The list includes phrasal verbs, which are +suffixed by the ";" character, as in the 3of6all list. The list is +sorted in a different order than the lexicographical ordering used by +the other lists, in order to group all phrases starting with the same +word together.
+ + + + + + + + +
+ + + + + + + + +You will observe that the same phrase will often be repeated several +times in the list, with slightly different spelling, capitalization +and/or hyphenation. No attempt was made to edit the list to remove or +reduce such "clutter".
+ + + + + + + + +
+ + + + + + + + +The 6phrase list includes the 3of6all signature phrases. These are not +marked with a suffix.
+ + + + + + + + +
+ + + + + + + + +In contrast to most of the other lists, I am unable to think of any +applications of the 6phrase list. But I find it rather interesting, +which is why I'm bothering to include it. At the very least, it may +serve as an illustration of the incredible richness of the English +language, without even venturing into vocabulary too esoteric to be +included in a learner's dictionary.
+ + + + + + + + +
+

How 12dicts came to +be

+ + + + + + + + +

It may have occurred to some to wonder about how +something like +the 12dicts project came to be (though I assume that anyone who bothers +to download this archive must already have some idea that such a +project could be of interest). +

+ + + + + + + + +

Many years ago, there was a post to the sci.crypt +Usenet newsgroup, +on the subject of creating PGP passphrases using randomly selected +entries from a supplied list of very short words. (If this sounds +interesting, follow +this link for an expanded version of the post.) The word +list, +which was extracted from /usr/dict/words on some UNIX system, seemed +to me ill-suited to its intended purpose. It included arcane acronyms +(bstj, fmc), misspellings (diety, ouvre) +and +words of amazing obscurity (bhoy, kombu). +I decided +I +could do better, and eventually did. +This caused me to start downloading English word lists, of which there +were many, from the Internet. I was not impressed by the overall +quality of these lists, and the few which were high-quality were +all-inclusive, burying the everyday words under a mountain of archaisms +and esoterica.

+ + + + + + + + +

This was a long time ago, and an Internet search +for word lists +now turns up lists of higher quality than back then (thanks in part to +the influence of 12dicts), so I will limit myself to two brief +criticisms of the various lists available at that time. First, they contained +far too many misspellings and typos, and had obviously never been +proofread. Additionally, their approach to vocabulary was scattershot, omitting +common words while adding a random selection of highly technical words, +often associated with UNIX and academic computer science. (My favorite +is the list which included bremsstrahlung, +but omitted log +and beer.) +Due to my original purpose of finding a list of short, common words, I +found this sort of thing particularly frustrating.

+ + + + + + + + +

+One result of my frustration with this situation was my working with +Mendel Cooper on ENABLE, a large Scrabble®-oriented list, which was +close to unique in having an active +caretaker who was clearly concerned with quality, and in being oriented towards +American rather than British English. But ENABLE was an +all-encompassing +list and, even if it had been complete at the time I started my search +for a list of common words, it would not have been what I wanted for +that reason. (The ENABLE web site is no longer online, but a Google +search will turn up places where you can still download it.) +

+ + + + + + + + +

I finally decided that only starting from scratch +with a systematic +approach was likely to get me what I was looking for, and that +dictionaries intended for non-native speakers of English were the +best possible source for words that are in some cases so familiar +that we never think of them. This has led to the 12dicts lists, +which I hope have managed to avoid the flaws recited above.

+ + + + + + + + +

My +other projects

+ + + + + + + + +During the intervals between releases of 12dicts, I have +been fooling +around with English spelling reform. One of the results of +this +activity is the development of CAAPR and ABCD, both of which may be +downloaded from my website, www.wyrdplay.org. +CAAPR is the Combined Anglo-American Pronunciation Reference, a +fancy name for a bi-dialectal pronunciation dictionary whose word list +is derived primarily from the 12dicts 6of12 list. ABCD, Alan's +Basic Codes with Diacritics, is also a pronunciation dictionary, of a +somewhat different sort - the notation is designed to clarify when a +word is spelled in accordance with normal English spelling +patterns (as with fault +or tunnel), +and when it is not (as with fought +or colonel). +Though these files were developed as a +result of my interest in spelling reform, they may be of interest to +other +"word nerds" unconcerned with that particular quixotic pastime. +

Click the following links to CAAPR +and ABCD +if interested.

+ + + + + + + + +

Conclusions

+ + + + + + + + +

When I released the first version of 12dicts in +1999, I assumed +I was +done with it. It hasn't worked out that way. I now think I'm pretty +much done with it again, though an occasional update to neol20xx.txt might +be called for. Perhaps in ten more years I'll have reached version 9, and be +laughing uncontrollably at the thought that I might have finished +earlier, but for the present I don't see what else might be both useful +and fun to add.

+ + + + + + + + +

Feel free to send comments, suggestions, +inquiries and/or large sums of money to me at 12dicts@pobox.com. +(Actually, the bit about money is a joke. Do not send me even small +amounts of money; 12dicts is free wordware.) +After making this request in previous versions, I have been +delighted to see the interest in these lists for projects ranging from +interactive games to literacy programs. And I have been +particularly pleased to occasionally hear of first-year Computer +Science assignments specifying a 12dicts list rather than +/usr/dict/words for their input. Keep up the good work, and do let +me know what you're doing. (Oh, and please put "12dicts" in +the +subject line when you email me. This will allow me to easily +notice your mail even if it is misclassified by an overzealous filter +as spam. Speaking of +spam, the publication of my email address in this package has led to a +marked increase in the amount of spam I receive and, ironically, much +of it contains subject lines which appear to have been +extracted at random from my own lists. This is a use of 12dicts of +which I +do not approve!)

+ + + + + + + + +

+The 12dicts lists were compiled by Alan Beale. I explicitly release +them to the public domain, but request acknowledgment of their use. +(Actually, the dependency of the 2of12inf list and the 2+2+3 lists on +AGID prevents their +release into the public domain. However, I do not impose any additional +requirements on their use beyond those imposed by AGID and its sources, +as described in agid.txt.)

+ + + + + + + + +

- Alan Beale - +

+ + + + + + + + + +