# HG changeset patch
# User HackEso Welcome to version 6.0.2 of 12dicts, a
+collection of English word lists. It differs in several important
+ways from most of the other free word lists you can download.
+
+Originally, 12dicts was composed of lists derived from a specific set
+of 12 source
+dictionaries. In addition to these "classic" lists, 12dicts now
+includes lists derived
+from other sources. It would perhaps be appropriate to rename 12dicts
+to something
+more generic, such as BAWL (Beale's Assorted Word Lists), but I have
+not done so in
+order to preserve continuity.
+ The remainder of this document is organized as
+follows:
+
+This is release 6.0.2 of 12dicts, released June 2016.
+This is an update to release 6.0. The following is a brief rundown of the
+changes and additions in release 6.0 and beyond:
+The 12dicts project began as the n-dicts projects, n being a variable
+whose
+value finally stabilized as 12. The purpose of the project was to
+create a
+list of words approximating the common core of the vocabulary of
+American
+English.
+ The methodology of the project was to record and
+correlate the words
+listed in a number of small dictionaries. The number of dictionaries
+so recorded ended up as 12, comprising 8 ESL (English as a Second
+Language)
+dictionaries and 4 "desk dictionaries". The dictionaries chosen
+varied widely by publisher, by style, by completeness and by depth. All
+of them were dictionaries of American
+English (three from British publishers). The smallest of them contained
+about 20,000 entries, and the largest 46,000. (All totaled, there are
+about 75,000 entries, many of which appeared in only a single
+dictionary.)
+All but two of the sources were published between 1992 and 1999, when
+12dicts
+was first released. The following table summarizes the contents of each
+of the classic lists, located in the American directory, ordered by
+size in words:
+
+Introduction
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This release
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Some general
+observations
+
+
+
+
+
+
+
+
+With the exception of the neol2016 list, all the 12dicts
+lists were assembled in a similar fashion. Words were extracted from a
+set of source dictionaries and, in most cases, a list was assembled by
+selecting all words and phrases present in some number of the sources
+meeting certain criteria. For instance, the 2of12 list comprises
+lower-case and hyphenated words present in at least two of twelve
+source dictionaries. For some lists, rules are added establishing
+exceptions for certain words or classes of words - for instance,
+the 2of12 list contains the upper-case words I and O as exceptions to
+its general exclusion of upper-case words and names.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Some lists contain annotations, which are special characters
+appended to certain words. For instance, the ":" character is used in
+some lists to identify abbreviations which are ordinarily used without
+a terminating period. This annotation allows these abbreviations to be
+distinguished from possibly similar regular words. Another annotation,
+used in the 3of6game and 3of6all lists, is the "$" character,
+indicating a word that was placed in the list even though fewer than
+three of the sources mention it. The "+" and "!'" annotations are used
+to identify signature words and neologisms, as described below. Note
+that is it possible for a word to have more than one annotation, though
+this is uncommon. For instance, in the 6of12 list, the word boldfaced~= has both
+a "~" and a "=" annotation, signifying that the word was an arbitrary
+choice between two equally attested forms (boldfaced
+and bold-faced),
+and that it was not given a separate definition in a majority of the
+sources listing it.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+A number of the lists contain signature words. These are words (or
+phrases) which do not meet the formal criteria for inclusion in a
+list, but which I have chosen to add anyway, as words which "ought to
+be" present. Whether a list contains signature words depends on the
+specific list. Usually, but not always, a signature word is present in
+some of
+the sources used for a list, but not enough of them to qualify for
+inclusion on that basis. Some lists may "inherit" signature words from
+other lists from which they were assembled. For instance, the 6phrase
+list includes the signature words from the 3of6all list. In most
+cases, signature words are marked with the "+" annotation.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+The neol2016 list contains
+neologisms, words which are not listed in
+some or all of the source dictionaries for 12dicts, generally for one
+of two reasons. First, many of the words are recent coinages which were
+not yet fully recognized by mainstream lexicographers when the 12dicts
+sources were published. Examples of such words are selfie, Obamacare, emoji
+and snarky.
+Other so-called neologisms are well-established, often well-known,
+words which are
+considered scandalous, such as sexual slang and ethnic slurs, and which are
+often deliberately omitted from dictionaries. (I will not give any
+examples of this sort
+of word here, but you will find some in the neol2016 list.) Note that
+the neologism list has been accumulating for about fifteen years now,
+and
+some of its words have become almost old-fashioned, such as spam and dotcom. The
+neologism list is provided so that some or all of its words can be
+added to the other lists where the intended usage makes that
+appropriate. However, I have added the single-word neologisms to the
+2of12inf and 3of6game, as these lists are the most likely to be used in
+coding word games, where it is desirable to recognize the very
+latest hot vocabulary. In these lists, neologisms are
+annotated with the "!" character.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+One other observation worth making is about diacritics. Some
+dictionaries will tell you that there are English words correctly
+spelled café, naïve, façade and piñata,
+and I do not wish to disagree with these authorities. But as a
+practical matter, Americans do not like to use diacritics. Furthermore
+they use keyboards which do not contain accented letters, and are often
+unfamiliar with the often clumsy techniques that their software
+provides to use such characters. For this reason, 12dicts drops all the
+accents from its English vocabulary. This is particularly valuable for
+coding word games, where expecting players to accent the e in cafe is not going to
+make them happy. (I cannot help pointing out that Scrabble® contains
+no É tiles.) I apologize to those who consider it a matter of some
+emotional importance that resume
+and résumé
+should be differently spelled.
+
+
+
+
+
+
+
+
+
+The
+organization of 12dicts
+
+
+
+
+
+
+
+
+The 12dicts lists are organized into four directories,
+grouping
+lists with similar characteristics together. The remainder of this
+document follows this organization as well. For each directory, a
+section of the documentation describes in detail the lists it contains.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Most users of 12dicts end up using only a single list. If it is clear
+which directory will contain the list you need, you can go directly to
+the appropriate documentation.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+The four directories are:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Picking a list to
+use
+
+
+
+
+
+
+
+
+If you are not certain which directory might contain the
+kind of
+list you are looking for, here is a breakdown of the 12dicts lists by
+size and purpose which may be helpful. If it does not help you find what you are looking
+for, you might want to check out this table,
+which summarizes the characteristics of all the 12dicts files, put
+together by Kevin Atkinson. Also, I suggest reading the introduction to
+each directory presented in the previous paragraph, each
+of which contains a table summarizing exactly what you can expect from
+each list in that directory.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The classic (American) 12dicts
+lists
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 3esl
+
+
+
+
+
+
+
+
+ 6of12
+
+
+
+
+
+
+
+
+ 2of12
+
+
+
+
+
+
+
+
+ 2of12inf
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Size (Words)
+
+
+
+
+
+
+
+
+ 22,000
+
+
+
+
+
+
+
+
+ 32,000
+
+
+
+
+
+
+
+
+ 41,000
+
+
+
+
+
+
+
+
+ 82,000
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Number of Sources
+
+
+
+
+
+
+
+
+ 3
+
+
+
+
+
+
+
+
+ 12
+
+
+
+
+
+
+
+
+ 12
+
+
+
+
+
+
+
+
+ 12
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ American English
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ British English
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Ordinary words
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Inflections
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hyphenations
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Phrases
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Names
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Abbreviations
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Acronyms
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Prefixes/Suffixes
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Signature words
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ *
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Neologisms
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ –
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Annotations
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+ Y
+
+
+
+
+
+
+
+
+
A * in the "Signature Words" row means that +signature +words associated with some other list may be present, but there are no +signature words associated specifically with that list.
+ + + + + + + + ++I initially tried two different ways of winnowing the 12dicts data to +produce lists of common words. Both produced interesting results. +One list, the 6of12 list, contained all words and phrases +listed in 6 of the 12 dictionaries. One way of describing this list +is that it contains those words and phrases which a (seeming) majority +of lexicographers believe are relevant to people learning English, +and/or to everyday usage. This list contained about 32,000 words and +phrases. The other list, the 2of12 list, was more inclusive in that it +included words listed in as few as two of the source dictionaries, but +less inclusive in that it excluded items of various sorts, including +multi-word phrases, proper names and abbreviations. This list contained +about 41,000 words. It was likely more suitable for use in areas +like spell checking or word games than the 6of12 list. (Honesty +compels me to admit that neither of these lists is, by itself, a good +choice for spell checking, due to the absence of inflections, proper +names, Roman numerals, etc.) +
+ + + + + + + + +A third list, 2of12inf.txt, developed later, was of +a rather different +character, and is discussed in a later section. +
+ + + + + + + + +A more precise description of the criteria by which +the above lists +were composed is as follows: +
+ + + + + + + + +A much smaller set of words (49) was added to the +2of12 list. These +were of two sorts: +
+ + + + + + + + +These annotations are: + +
: | + + + + + + + + +The word is an otherwise unmarked +abbreviation. This suffix always occurs before any other suffix. | + + + + + + + + +
& | + + + + + + + + +The word is primarily a non-American usage. | + + + + + + + + +
# | + + + + + + + + +The word is generally held to be a variant +or less preferred +form of another word. | + + + + + + + + +
= | + + + + + + + + +Roughly, this indicates a "second class" +word, as described +below. | + + + + + + + + +
< | + + + + + + + + +This form of a word is held to be the +primary form by fewer +dictionaries than some other form of the word. | + + + + + + + + +
^ | + + + + + + + + +This form of the word was selected +as the most commonly listed of a set of variant spellings. | + + + + + + + + +
~ | + + + + + + + + +This form of a word is one of a set of +variant spellings, none of which was clearly preferred. | + + + + + + + + +
+ | + + + + + + + + +The word is a signature word. | + + + + + + + + +
The reasons a word might be marked with the = +annotation +are: +
+ + + + + + + + +The words in the 2of12 list are not annotated. +
+ + + + + + + + ++The 2of12inf list is of a rather different character from the two +original "classic" lists. Conceptually, +it is simple. It consists of all the unhyphenated words in the 2of12 +list, plus +their inflections, amounting to about 82,000 words. This list may +be more useful than the other lists for applications like word games. +It was created to help Kevin Atkinson in his Aspell and SCOWL projects +(for which, follow these +links). +Unlike the 6of12 and +2of12 lists, this list was not based exclusively on the contents of my +12 source dictionaries, and for this reason it has, I feel, less +authority than the other classic 12dicts lists. It also probably has a +significantly higher error rate than the other lists, for reasons +explained below. +
+ + + + + + + + +The criteria defining the 2of12inf list are as +follows: +
+ + + + + + + + ++Though the 2of12inf list still consists mostly of very common words, +criteria 3 through 5 above cause the 2of12inf list to contain a greater +proportion of unfamiliar and unusual words than the other classic +12dicts lists. +
+ + + + + + + + +The 2of12inf list was not derived directly from the +12 source +dictionaries. The starting point was a subset of Kevin Atkinson's +AGID list, a list of words, parts of speech and inflections derived +from public-domain sources, notably Moby Words and WordNet. (See the +file agid.txt in the 12dicts archive, which is a copy of the AGID +"readme", +for more information on the antecedents of AGID.) 2of12inf was created +by a process of editing the AGID subset to remove spurious entries and +those which reflected a more esoteric English vocabulary than the other +12dicts lists, and to add inflections which AGID failed to identify. +This process required significantly less effort than would have been +needed to derive the list directly from the source dictionaries. +Unfortunately, a side effect of the process was that the result is +probably somewhat less reliable than the other 12dicts lists. +In particular, Moby Words is notoriously unreliable, and I find it +unlikely that I have successfully identified all the spurious +inflections its use has introduced. It would be nice to +release another edition of 2of12inf which is not derived from AGID, +and therefore not "infected" by Moby Words, but I haven't done so in 15 +years, and so it probably won't happen. +
+ + + + + + + + +After the first version of the 2of12inf list was +released, I replaced +one of the source dictionaries, officially an international dictionary +but in actuality rather British in its orientation, with a more +American dictionary by the same publisher. It was not practical +(nor necessarily desirable) for me to go through the list removing +inflections endorsed only by the superseded dictionary. For this +reason, the 2of12inf list has a slightly more international character +than the other 12dicts lists. It is not altogether clear that this +is a bad thing. +
+ + + + + + + + ++Ideally, the 2of12inf list would contain only inflections listed in +one of the 12dicts source dictionaries. This proved not to be +practical. The reason for this has to do with the nature of these +sources, which are mostly ESL dictionaries. An ESL dictionary might +well list the word esophagus, but, because an +English learner is +unlikely to need to talk about this organ in the plural, it will +probably not bother to list the plural form esophagi. +For words of +this sort, I therefore needed to obtain their inflections from other +sources. Obviously, the decisions on when to include additional +inflections were judgment calls, as were the choices of which +inflections to add. +
+ + + + + + + + +Adjectival inflections (comparatives and +superlatives) proved to be +an especially annoying problem. Only 2 of my 12 source dictionaries +provided remotely reliable information of this sort. In fact, such +information is sparse and inconsistent in most dictionaries of any +size. I relied on a small set of additional dictionaries for this +information, which was mostly disjoint from the sources for plurals +and verb forms. Several of these sources were Scrabble®-related, +and therefore inclined to include forms of little plausibility such +as iller/illest or fertiler/fertilest. +Accordingly, I ended up rejecting some of the documented inflections on +grounds of implausibility. I have no doubt that, in the process, I made +a number of errors of both inclusion and exclusion and, in any case, +many +of the forms listed have no connection with any of the 12dicts source +dictionaries. +
+ + + + + + + + +One additional problem in the creation of the +2of12inf list was that +of "uncountable" nouns and their plurals. Some English dictionaries, +especially ESL dictionaries, as well as other linguistic sources +attest to the existence of nouns which cannot be counted or used in +the plural. Examples of such nouns include mud, rayon, +oregano, +chess, fairness, wisdom, aluminum, training, materialism +and chickenpox. This is an entirely commonsense +notion, but a +difficulty is the fact that the boundary between the countable and the +uncountable is extremely vague and ill-defined. For example, the word +coffee is ordinarily uncountable, but not when +ordering in a +restaurant, as is the word symmetry, except in +physics or math. +In general, it is possible to contrive a context where use of the +plural of any noun whatsoever is reasonable. +
+ + + + + + + + +An alternate position, therefore, is that in fact +no nouns are +uncountable, and that any noun which is not already plural possesses +a plural. This position is especially useful in the context of word +games, where words such as zeals and anthraxes +may produce large scores. For this reason, the official Scrabble +dictionaries list words such as thens, onces and +mankinds, which most people find +rather implausible. The fact that the 2of12inf list might well be +useful in gaming contexts, together with the fact that the boundary +between countable and uncountable nouns is so ill-defined, served as +a powerful argument for inclusion of all plural forms, whether +commonly used or not, while its derivation from ESL sources argued +for including only the plurals of countable nouns, however +distinguished. +
+ + + + + + + + +As I prepared the list for release, I was unable to +resolve this dilemma, +and adopted a +compromise. The 2of12inf list includes all plurals, but with the +plurals of uncountable nouns marked, making it easy to remove them +if they are not wanted. That left the issue of how to establish +countability. Six of my source dictionaries included information +on countability, which was adequate to decide the status of most of +the included nouns. As for the rest, as usual, I used my best +judgment. I will confess to occasionally overriding the source +dictionaries when I believed they were clearly incorrect. (For +instance, I chose not to mark the word hatreds as +an +uncountable plural, in defiance of the opinion of all my sources, +on the grounds that it has been used in too many news stories from +Bosnia to be considered unusual.) It is interesting to note that +most of the plurals I added from auxiliary sources were of words +considered uncountable. I also note that at some point after the +release of the 2of12inf list, I decided that it would have been better +to have left the Scrabble plurals out, and, while I was not +comfortable with removing them, no list I've created +since then which lists inflections includes them. +
+ + + + + + + + +The difficulties listed above, and the fact that I +was forced to +exercise personal judgment frequently in creating it, emphasizes a +fundamental difference between this list and the other classic 12dicts +lists. I have tried to make the 6of12 and 2of12 lists reflect only the +source dictionaries, and to keep my own judgments and opinions out of +the picture (except for my addition of signature words). This has +proved impossible to achieve for the 2of12inf list, which accordingly +represents a less authoritative and more arbitrary collection. +Additionally, the 2of12inf list has undergone less proofreading and +validation than the other lists, and I suspect the error rate is +somewhat higher than the idealistic goal of 0.02% I adopted for this +project. Nevertheless, I hope it may prove to be +of some use and interest. +
+ + + + + + + + +I wish to offer my special thanks to Kevin +Atkinson, for supplying me +with the AGID list, and for encouraging me to add the inflections. Of +course, any errors that remain in the 2of12inf list are my own +responsibility, and should not be blamed on Kevin, AGID, or even on +Moby. +
+ + + + + + + + ++The 3esl list represents another attempt to produce an English "core +vocabulary" list. It is about 2/3 of the size of the 6of12 list, +which it resembles in terms of the sorts of words included. +
+ + + + + + + + +The 3esl list is a far more subjective list than +any of the classic +12dicts lists. It was compiled from 3 small ESL dictionaries, using +the same criteria for eligibility as the 6of12 list. I started with +a list composed of all words from the smallest of the 3 sources, plus +all words contained in both of the others. This list was then edited +in the following ways: +
+ + + + + + + + ++All of these changes were quite subjective in nature, and quite +numerous. Probably more than 10 % of the candidate words were added +or removed in this way. For this reason, it is pointless to speak +of signature words for this list; the composition of the list is too +arbitrary for the term to make any sense. (I will note that the list +is still not entirely arbitrary, as I added only words found in +some form in one of the sources, and removed no words present in two +of the sources other than duplicates. Thus, words like front +page were not added, no matter how familiar, and words such +as lugubrious were not removed, despite clearly not +being +part of anyone's "core vocabulary".) +
+ + + + + + + + +Like the 6of12 list, the 3esl list marks lower-case +abbreviations +with a ":" suffix, to prevent them from being mistaken for regular +English words. +
+ + + + + + + + +One final note on this list. The 3esl list contains +about 1500 words +not present in the 6of12 list. Because these two lists have the same +rules for the kinds of words included, one could easily combine +the two to produce a slightly larger list including a number of words +whose omission from 6of12 is rather surprising. Be warned that in a +few cases, the spelling chosen for words with multiple spellings is +different in the two lists, and I would recommend that the duplicates +be removed. (I'll be happy to provide a list of the duplicates if +anyone wants one.)
+ + + + + + + + ++
+ + + + + + + + + | 2of4brif | + + + + + + + + +3of6game | + + + + + + + + +5d+2a | + + + + + + + + +3of6all | + + + + + + + + +
---|---|---|---|---|
Size (Words) | + + + + + + + + +60,000 | + + + + + + + + +65,000 | + + + + + + + + +68,000 | + + + + + + + + +83,000 | + + + + + + + + +
Number of Sources | + + + + + + + + +4 | + + + + + + + + +6 | + + + + + + + + +7 (+5 minor) | + + + + + + + + +6 | + + + + + + + + +
American English | + + + + + + + + +Some | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +
British English | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +
Ordinary words | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +
Inflections | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +– | + + + + + + + + +Y | + + + + + + + + +
Hyphenations | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + +Y | + + + + + + + + +
Phrases | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + +Y | + + + + + + + + +
Names | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +
Abbreviations | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + +Y | + + + + + + + + +
Acronyms | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +
Prefixes/Suffixes | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + +Y | + + + + + + + + +
Signature words | + + + + + + + + +– | + + + + + + + + +Y | + + + + + + + + +– | + + + + + + + + +Y | + + + + + + + + +
Neologisms | + + + + + + + + +– | + + + + + + + + +Y | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + +
Annotations | + + + + + + + + +– | + + + + + + + + +Y | + + + + + + + + +– | + + + + + + + + +Y | + + + + + + + + +
+All of the classic 12dicts lists are unabashedly oriented towards +American English. After receiving a few expressions of interest in a +British English list, I put together the 2of4brif list. This list +was compiled from 4 large "international" ESL dictionaries, published +by British publishers. To this American, they are more British than +they are international; quite possibly, they seem more American than +international to British readers. It is interesting to note that, +although there were only a third as many sources for this list as for +the 12dicts lists, these dictionaries resembled each other far more +closely than their American counterparts, which could mean that the +2of4brif list is as good an approximation of a "core" British English +vocabulary as the 6of12 list is for American English. (Or, alternately, +it may simply mean that my choice of sources was too narrow.) +
+ + + + + + + + +This criteria for inclusion in this list were +basically those of the +2of12inf list. In particular, inflections are included for all words, +but hyphenated words, contractions, phrases, proper names and +abbreviations are all excluded. One important difference between +the two is the way in which inflections were determined for inclusion. +The 2of12inf list includes some inflections found in one (or even none) +of its sources. Further, as discussed in detail above, +it includes plurals for words which are not normally +considered to have plurals. The 2of4brif list differs in both of +these regards. It includes only inflections endorsed by two or more +of the sources, specifically excluding any plural forms for nouns +listed as uncountable. +
+ + + + + + + + +The 2of4brif list includes no signature words as +such. I made a small +number of adjustments for consistency, such as making sure that +-ise and -ize spellings were +equally +represented, and adding plurals for ordinal numbers. (Why +fourteenth would be defined as a fraction, but not +seventeenth, I must simply regard as a mystery.) +These +edits were so few, and so clearly harmless, that I have not marked +them. +
+ + + + + + + + +Prospective users of the 2of4brif list should +realize that it was +compiled by an American. If my sources contained any glaring errors +(and most dictionaries have a few), I might well not have noticed, +and perpetuated them in the list. The fact that two citations were +required is some protection against such an event, but no guarantee. +
+ + + + + + + + +As the 2of4brif list is very similar in makeup to +the 2of12inf list, +a user who wants a larger, more international list than either could +reasonably merge the two. If you do this, you should remove the +unusual plurals (marked with a "%") from the 2of12inf list in the +process, for consistency.
+ + + + + + + + +Note that I have deprecated the 2of4brif list. I +believe that any applications of this list would be better off using +the 3of6game list in its place.
+ + + + + + + + ++I created the 5d+2a list (originally called 5desk) in an attempt to do +a better /usr/dict/words +(the failings of which were a large part of my motivation for doing +12dicts in the first place). +The sorts of words admitted are the same sorts that /usr/dict/words +traditionally contains. Though somewhat larger in size than many +versions of +/usr/dict/words, this is still a short word list, striving for +inclusion +of words one is likely to encounter rather than the complete jargon of +every possible scientific, artistic or occult endeavor. +
+ + + + + + + + +The original 5desk list was assembled primarily +from five "desk +dictionaries". It +was augmented by words from five minor sources, including a "vocabulary +builder" and a collection of proper names. It excluded +prefixes, suffixes, phrases, hyphenated words, contractions and most +abbreviations and acronyms. There was no requirement for multiple +listings; all qualifying words from each of the sources were included. +Inflections of included words were not included themselves except when +irregular, or separately defined. Variant and non-American spellings +were not excluded, and no signature words were added.
+ + + + + + + + +Words commonly considered to be +abbreviations/acronyms were included +if they contained at least one upper case character, and were defined +with an explicit part of speech. This excluded items like Mr +and +Feb, which are abbreviations in the classic sense, +but allowed words +like DNA and ATM, which are +used far more frequently than that +which they abbreviate. While there is a trend in modern dictionaries +to list such words as nouns (or occasionally verbs, adverbs, etc.), +it is a trend in progress, and rather inconsistently applied. For +this reason, the set of such words in the 5desk list is somewhat +incoherent, including SPCA but not PETA, +AIDS but not SAD, +KGB +but +not CIA, and PDQ but not ASAP.
+ + + + + + + + +When version 6 of 12dicts was released, the 5desk +list was +augmented by adding qualifying words from two advanced learner's ESL +dictionaries, and as a result renamed to 5d+2a.txt. Both of the +additional dictionaries had a strongly international vocabulary, +causing the new list to have a less American and more cosmopolitan +character. This increased the size of the list by about 20% to about +68,000 words.
+ + + + + + + + +One class of commonly-used words is regrettably +absent from the 5desk +list, because I was unable to find a satisfactory source for them. +This is the class of commercial names such as Exxon, Tylenol, +Pepsi and Chevy. This is probably +forgivable, +as this class of names is as ephemeral and transitory as teenage slang. +The one-time household words Kool, Ovaltine, Philco +and +Ipana serve now only as answers to trivia questions, +with modern wonders like Starbucks, Google, Ritalin +and TiVo taking their place on the tongues of the +trendy.
+ + + + + + + + +The 5d+2a list contains no signature words. I did +take the liberty of adding the personal names of around thirty +well-known individuals, mostly statesmen and politicians. Though the +original 5desk list contained many such names from all periods of human +history, I have not found a useful source to bring the list into the +twenty-first century. At the same time, I felt that distributing a list +full of +names that did not include Cheney and Obama was not +reasonable. So I compromised by adding a few names whose historical +significance was clear to me, until such time as a better source than +my own memories of the last 15 years can be found. +
+ + + + + + + + +The 5d+2a list has clearly moved beyond any "core +vocabulary" concept. +It includes quite esoteric words (ogee, pleonastic), +very +uncommon spellings (thiamine, yuppy), and obscure +geographical +and historical names (Paricutin, Nevelson). Like +the traditional /usr/dict/words, it is frequently inconsistent and +arbitrary, but I +hope at the least I have avoided including spelling errors, and +overlooking the stuff of everyday conversation. Perhaps it will be +useful as a compromise between basic lists such as 3esl, and truly +massive lists like Mendel Cooper's ENABLE.
+ + + + + + + + ++
+ + + + + + + + + | 2+2+3cmn | + + + + + + + + +2+2+3frq | + + + + + + + + +2+2+3lem | + + + + + + + + +
---|---|---|---|
Size (Words) | + + + + + + + + +25,000 | + + + + + + + + +34,000 | + + + + + + + + +84,000 | + + + + + + + + +
Number of Sources | + + + + + + + + +21 | + + + + + + + + +21 | + + + + + + + + +21 | + + + + + + + + +
American English | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +
British English | + + + + + + + + +Some | + + + + + + + + +Some | + + + + + + + + +Y | + + + + + + + + +
Ordinary words | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +
Inflections | + + + + + + + + +Some | + + + + + + + + +Some | + + + + + + + + +Y | + + + + + + + + +
Hyphenations | + + + + + + + + +Some | + + + + + + + + +Some | + + + + + + + + +Y | + + + + + + + + +
Phrases | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + +
Names | + + + + + + + + +Some | + + + + + + + + +Some | + + + + + + + + +– | + + + + + + + + +
Abbreviations | + + + + + + + + +Some | + + + + + + + + +Some | + + + + + + + + +– | + + + + + + + + +
Acronyms | + + + + + + + + +Some | + + + + + + + + +Some | + + + + + + + + +– | + + + + + + + + +
Prefixes/Suffixes | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + +
Signature words | + + + + + + + + +Y | + + + + + + + + +* | + + + + + + + + +* | + + + + + + + + +
Neologisms | + + + + + + + + +A few | + + + + + + + + +A few | + + + + + + + + +Y | + + + + + + + + +
Annotations | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + +
A * in the "Signature Words" row means that +signature +words associated with some other list may be present, but there are no +signature words associated specifically with that list.
+ + + + + + + + +The list 2+2+3lem.txt contains the words in the +2of12inf, 2of4brif and 3of3game lists. +Also, the new words from the neol2016.txt list have +been added, marked with a "!" if they would not have otherwise been +included. (Marking the new words permits them to be removed if it is +preferred for this list to be in synch with the other 12dicts lists.) +Furthermore, some high-frequency hyphenated words from 2of12.txt and +3of6all have been added. These words were originally added to the +lemmatized frequency list (see below), +and I liked the results so much that I added them to this list as well. +Finally, British forms of words in +the 2of12inf list not already in the other lists have been added. +Words +marked with a % in the 2of12inf list ("Scrabble plurals") have +however been omitted.
+ + + + + + + + +In the previous version of 12dicts, the 2+2+3lem list was +called 2+2lemma. The only significant changes were the addition of new +words, and switching from "+" to "!" to mark neologisms in the list.
+ + + + + + + + +The 2+2+3lem list is not formatted as a simple list +of words. +It is composed of entries of 1 or 2 lines each. The +first +line contains a headword, and the second line, which is indented if +present, contains an alphabetized list of related words. A +simple example:
+ + + + + + + + +funny
+
+
+
+
+
+
+
+
+ funnier, funnies, funniest, funnily, funniness
The list of related words contains three sorts of +entries.
+ + + + + + + + +Inflections.
+ + + + + + + + +Variant spellings.
+ + + + + + + + +Words formed with certain suffixes.
+ + + + + + + + +In addition to true variant spellings such +as grey +for gray +and thru +for through, +item 2 also includes words +which, though pronounced differently, are clearly variants +of the headword. Thus, hooray is considered +a variant of hurrah +(but mere synonyms like furze +and gorse +remain +independent).
+ + + + + + + + +Item 3 is based on a small list of suffixes, +producing closely +and consistently related words. These suffixes are -ful, -ish, +-less, -like, -ly, -most and -ness. -ally is also +allowed, if +there is no -al +word to apply the -ly +suffix to. (For instance, basically is +considered to be derived from basic, because there +is +no word basical.) When +one of these suffixes is used in an +unusual way, the resulting word is considered independent. +For +instance, likely +is not considered to be derived from like, nor bashful +from bash. +There are some rather difficult questions +here, such as how closely slavish +is related to slave, +or sluggish +to slug. +In general, I have chosen the course of +least surprise by treating such pairs as independent.
+ + + + + + + + +Here are some other notes on the determination of +what words are related.
+ + + + + + + + +Certain uses of the suffixes -ed and -s are treated as +inflections, even though technically they are not. +Thus, talented +is treated as derived from talent, +and optics +from optic.
+ + + + + + + + +Words ending with the suffix -ability/ibility are +treated as relatives of the corresponding -able/ible word.
+ + + + + + + + +Sometimes, the choice of which variant to treat as +the headword +is somewhat arbitrary. I have consistently chosen an American +spelling over a British spelling here. This has some effect on +the number of headwords. I treat cheque as a variant +of check, +whereas, to an observer with a British bias, they would no doubt be +separate headwords.
+ + + + + + + + +No distinction is made of different meanings of the +same word, +even when they are so different that dictionaries list them +separately. wind +the noun and wind +the verb are considered as a +single word, as are second +the adjective, second +the noun and second +the verb.
+ + + + + + + + +It may sometimes happen that two different words +have the same inflection (putting +derives both from putt +and put; holier relates +to holey +as well as holy), +or that an inflection +is a headword in its own right (as with wound, the past +tense of wind, +or crooked, +the past tense of crook). +These +situations are noted in the 2+2+3lem list as cross-references to the +alternate headword. There are two specific situations +which might not be obvious where +inflections are treated as different words. +These occur when a present tense form or a -ness word has a +plural inflection, as with meaning +and weakness. +Such words +are always made headwords, even when the relationship to the original +root is very close. Here is an example showing how +cross-references are indicated:
+ + + + + + + + +base
+
+
+
+
+
+ based, baseless, basely, baseness,
+baser, bases -> [basis], basest, basing
Almost always, a given word has only one +cross-reference - the +biggest exception is the incredible tangle shown in the example below:
+ + + + + + + + +slue
+-> [slough]
+
+
+
+
+
+
+
+
+ slew -> [slay, slew, slough],
+slewed, slewing,
+slews -> [slew, slough], slued, slues -> [slough], sluing
+
where 4 uncommon words mostly pronounced sloo have become +thoroughly confused.
+ + + + + + + + ++
+ + + + + + + + + | neol2016 | + + + + + + + + +2of5core | + + + + + + + + + +6phrase | + + + + + + + + + + + + + + + + + + +
---|---|---|---|
Size (Words) | + + + + + + + + +600 | + + + + + + + + +4,700 | + + + + + + + + + +22,000 | + + + + + + + + + + + + + + + + + + +
Number of Sources | + + + + + + + + +0 | + + + + + + + + +5 | + + + + + + + + + +6 | + + + + + + + + + + + + + + + + + + +
American English | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + + +Y | + + + + + + + + + + + + + + + + + + +
British English | + + + + + + + + +A little | + + + + + + + + +Y | + + + + + + + + + +Y | + + + + + + + + + + + + + + + + + + +
Ordinary words | + + + + + + + + +Y | + + + + + + + + +Y | + + + + + + + + + +– | + + + + + + + + + + + + + + + + + + +
Inflections | + + + + + + + + +Y | + + + + + + + + +– | + + + + + + + + + +– | + + + + + + + + + + + + + + + + + + +
Hyphenations | + + + + + + + + +Y | + + + + + + + + +A few | + + + + + + + + + +– | + + + + + + + + + + + + + + + + + + +
Phrases | + + + + + + + + +Y | + + + + + + + + +A few | + + + + + + + + + +Y | + + + + + + + + + + + + + + + + + + +
Names | + + + + + + + + +Y | + + + + + + + + +A few | + + + + + + + + + +A few | + + + + + + + + + + + + + + + + + + +
Abbreviations | + + + + + + + + +Y | + + + + + + + + +A few | + + + + + + + + + +A few | + + + + + + + + + + + + + + + + + + +
Acronyms | + + + + + + + + +Y | + + + + + + + + +A few | + + + + + + + + + +– | + + + + + + + + + + + + + + + + + + +
Prefixes/Suffixes | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + + +– | + + + + + + + + + + + + + + + + + + +
Signature words | + + + + + + + + +– | + + + + + + + + +– | + + + + + + + + + +* | + + + + + + + + + + + + + + + + + + +
Neologisms | + + + + + + + + +Y | + + + + + + + + +– | + + + + + + + + + +– | + + + + + + + + + + + + + + + + + + +
Annotations | + + + + + + + + +Y | + + + + + + + + +N | + + + + + + + + + +Y | + + + + + + + + + + + + + + + + + + +
A * in the "Signature Words" row means that +signature +words associated with some other list may be present, but there are no +signature words associated specifically with that list.
+ + + + + + + + +It may have occurred to some to wonder about how +something like +the 12dicts project came to be (though I assume that anyone who bothers +to download this archive must already have some idea that such a +project could be of interest). +
+ + + + + + + + +Many years ago, there was a post to the sci.crypt +Usenet newsgroup, +on the subject of creating PGP passphrases using randomly selected +entries from a supplied list of very short words. (If this sounds +interesting, follow +this link for an expanded version of the post.) The word +list, +which was extracted from /usr/dict/words on some UNIX system, seemed +to me ill-suited to its intended purpose. It included arcane acronyms +(bstj, fmc), misspellings (diety, ouvre) +and +words of amazing obscurity (bhoy, kombu). +I decided +I +could do better, and eventually did. +This caused me to start downloading English word lists, of which there +were many, from the Internet. I was not impressed by the overall +quality of these lists, and the few which were high-quality were +all-inclusive, burying the everyday words under a mountain of archaisms +and esoterica.
+ + + + + + + + +This was a long time ago, and an Internet search +for word lists +now turns up lists of higher quality than back then (thanks in part to +the influence of 12dicts), so I will limit myself to two brief +criticisms of the various lists available at that time. First, they contained +far too many misspellings and typos, and had obviously never been +proofread. Additionally, their approach to vocabulary was scattershot, omitting +common words while adding a random selection of highly technical words, +often associated with UNIX and academic computer science. (My favorite +is the list which included bremsstrahlung, +but omitted log +and beer.) +Due to my original purpose of finding a list of short, common words, I +found this sort of thing particularly frustrating.
+ + + + + + + + ++One result of my frustration with this situation was my working with +Mendel Cooper on ENABLE, a large Scrabble®-oriented list, which was +close to unique in having an active +caretaker who was clearly concerned with quality, and in being oriented towards +American rather than British English. But ENABLE was an +all-encompassing +list and, even if it had been complete at the time I started my search +for a list of common words, it would not have been what I wanted for +that reason. (The ENABLE web site is no longer online, but a Google +search will turn up places where you can still download it.) +
+ + + + + + + + +I finally decided that only starting from scratch +with a systematic +approach was likely to get me what I was looking for, and that +dictionaries intended for non-native speakers of English were the +best possible source for words that are in some cases so familiar +that we never think of them. This has led to the 12dicts lists, +which I hope have managed to avoid the flaws recited above.
+ + + + + + + + +Click the following links to CAAPR +and ABCD +if interested.
+ + + + + + + + +When I released the first version of 12dicts in +1999, I assumed +I was +done with it. It hasn't worked out that way. I now think I'm pretty +much done with it again, though an occasional update to neol20xx.txt might +be called for. Perhaps in ten more years I'll have reached version 9, and be +laughing uncontrollably at the thought that I might have finished +earlier, but for the present I don't see what else might be both useful +and fun to add.
+ + + + + + + + +Feel free to send comments, suggestions, +inquiries and/or large sums of money to me at 12dicts@pobox.com. +(Actually, the bit about money is a joke. Do not send me even small +amounts of money; 12dicts is free wordware.) +After making this request in previous versions, I have been +delighted to see the interest in these lists for projects ranging from +interactive games to literacy programs. And I have been +particularly pleased to occasionally hear of first-year Computer +Science assignments specifying a 12dicts list rather than +/usr/dict/words for their input. Keep up the good work, and do let +me know what you're doing. (Oh, and please put "12dicts" in +the +subject line when you email me. This will allow me to easily +notice your mail even if it is misclassified by an overzealous filter +as spam. Speaking of +spam, the publication of my email address in this package has led to a +marked increase in the amount of spam I receive and, ironically, much +of it contains subject lines which appear to have been +extracted at random from my own lists. This is a use of 12dicts of +which I +do not approve!)
+ + + + + + + + ++The 12dicts lists were compiled by Alan Beale. I explicitly release +them to the public domain, but request acknowledgment of their use. +(Actually, the dependency of the 2of12inf list and the 2+2+3 lists on +AGID prevents their +release into the public domain. However, I do not impose any additional +requirements on their use beyond those imposed by AGID and its sources, +as described in agid.txt.)
+ + + + + + + + +- Alan Beale - +
+ + + + + + + + + +