[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: vocab sizes
> Can anyone help me with the following questions and/or
> furnish me with references: What languages have the 
> largest vocabularies, what are the sizes,
The problem with these questions is that they are very vague.  When
you make them specific enough to be answerable, the questions
themselves tend to lose interest.
An early problem you run into is defining the meaning of a "word."  Is
"jump" the "same word" as "jumps" and "jumped"?  If you count them the
same, then what about "swim", "swam", and "swum"?  If you count every
lexicographically different word as different, then doesn't that
artificially inflate the number of words in a language?  And how do
you compare the counts with languages in which the words are not
similarly inflected?  For example, since Japanese has no noun plurals,
by one counting method it probably has about half as many nouns as
English.
Is a "bank" that holds money the same word as a "bank" that holds a
river?  How many different "words" in the sentence: "The hot dog ate a
hot dog."?
>  how are
> vocabulary sizes measured (spoken words, literary sources,
> official dictionaries, etc.), and how reliable are the
> measures?
Yes.  Any of these.  Of course, they will all give different answers.
That's a consequence of the fact that a measurement without defining
the measuring instrument is meaningless.
> 	From what I had read, English was thought to have in
> excess of 500,000 words and Classical Arabic was second
> with about 350 K. Recently, a friend told me that English
> is now thought to have about 1.5 M words, Russian nearly 1 M
> and French about 500 K.  She also stated that the vocabulary
> sizes of all speakers of natural languages are measured in the 
> 100's of thousands of words.  Even including transparent compounds,
> derivatives and inflected forms, these numbers seem up to an
> order of magnitude larger than the estimates I have seen in the 
> older literature from my college days.   Is my knowledge 
> out of date?  
Webster's Third International Dictionary has about 450,000 entries.
The OED has substantially more, but half the entries would have been
long since forgotten if it were not for the OED itself.  Is a word
"in" a language if no one uses it any more?
I know a substantial fraction of the words in Webster's 3rd--I'm not
sure what the fraction is, but half is probably within an order of
magnitude.  (This would be an interesting experiment....)  So my
vocabulary is maybe 200,000 words.  This is a VERY rough estimate.
I also know a few hundred words that are not in Webster's.  Many are
from my profession, many are from other areas of interest to me, and
some are just new.  (I have absolutely no idea why the word "geas" is
not in there.)  So we can add at least a few tens of thousands of words
that are not in Webster's, but are in common use in some segment of
society.  There are also some words used only within my small circle
of family and friends, and nowhere else--should these count?
So essentially, I think the question as it stands is not meaningful.
It can be replaced by questions such as "How many lexicographically
distinct words occur in a random sample of one million words from the
most popular newspaper in the given language?"  This question still
needs a few ambiguities cleaned out of it before it can be answered
properly--but it just doesn't have the zing of the original question.
Next week:  How big is a ball?
- References:
- vocab sizes
- From: cbmvax!uunet!violet.berkeley.edu!chalmers (John H. Chalmers Jr.)