Sunday, April 12, 2009

How many words are there in a language?

In a recent discussion, the question came up of whether a language's vocabulary could be tallied (briefly addressed at Language Log a while back, and at FEL.) I have no firm answer to that (and it's logically independent of whether or not you can estimate the proportion of the vocabulary coming from a given language - that's a sampling problem.) But, notwithstanding the bizarre if occasionally entertaining acrimony of that discussion, it's actually a rather interesting question.

Clearly, any given speaker of a language - and hence any finite set of speakers - can know only a finite number of morphemes, even if you include proper names, nonce borrowings, etc. ("Words" is a different matter - if you choose to define compounds as words, some languages in principle have productive systems defining potentially infinitely many words. The technical vocabulary of chemists in English is one such case, if I recall rightly.) Equally clearly, it's practically impossible to be sure that you've enumerated all the morphemes known by even a single speaker, let alone a whole community; even if you trust (say) the OED to have done that for some subset of English speakers (which you probably shouldn't), you're certainly not likely to find any dictionary that comprehensive for most languages. Does that mean you can't count them?

Not necessarily. You don't always have to enumerate things to estimate how many of them there are, any more than a biologist has to count every single earthworm to come up with an earthworm population estimate. Here's one quick and dirty method off the top of my head (obviously indebted to Mandelbrot's discussion of coastline measurement):
  • Get a nice big corpus representative of the speech community in question. ("Representative" is a difficult problem right there, but let's assume for the sake of argument that it can be done.)
  • Find the lexicon size required to account for the 1st page, then the first 2 pages, then the first 3, and so on.
  • Graph the lexicon size for the first n pages against n.
  • Find a model that fits the observed distribution.
  • See what the limit as n tends to infinity of the lexicon size, if any, would be according to this model.


A bit of Googling reveals that this rather simplistic idea is not original. On p. 20 of An Introduction to Lexical Statistics, you can see just such a graph. An article behind a pay wall (Fan 2006) has an abstract indicating that for large enough corpora you get a power law.

But if it's a power law, then (since the power obviously has to be positive) that would predict no limit as n tends to infinity. How can that be, if, for the reasons discussed above, the lexicon of any finite group of speakers must be finite? My first reaction was that that would mean the model must be inapplicable for sufficiently large corpus sizes. But actually, it doesn't imply that necessarily: any finite group of speakers can also only generate a finite corpus. If the lexicon size tends to infinity as the corpus size does, then that just means your model predicts that, if they could talk for infinitely long, your speaker community would eventually make up infinitely many new morphemes - which might in some sense be a true counterfactual, but wouldn't help you estimate what the speakers actually know at any given time. In that case, we're back to the drawing board: you could substitute in a corpus size corresponding to the estimated number of morphemes that all speakers in a given generation would use in their lifetimes, but you're not going to be able to estimate that with much precision.

The main application for a lexicon size estimate - let's face it - is for language chauvinists to be able to boast about how "ours is bigger than yours". Does this result dash their hopes? Not necessarily! If the vocabulary growth curve for Language A turns out to increase faster with corpus size than the vocabulary growth curve for Language B, then for any large enough comparable pair of samples, the Language A sample will normally have a bigger vocabulary than the Language B one, and speakers of Language A can assuage their insecurities with the knowledge that, in this sense, Language A's vocabulary is larger than Language B's, even if no finite estimate is available for either of them. Of course, the number of morphemes in a language says nothing about its expressive power anyway - a language with a separate morpheme for "not to know", like ancient Egyptian, has a morpheme for which English has no equivalent morpheme, but that doesn't let it express anything English can't - but that's a separate issue.

OK, that's enough musing for tonight. Over to you, if you like this sort of thing.

21 comments:

D. Sky Onosson said...

Sidenote of limited interest: Korean also has a morpheme meaning "to not know", as well as one expressing "to not exist". I find that interesting, though of what significance, I'm not sure...

Glen Gordon said...

In the article, Lameen states: "In a recent discussion, the question came up of whether a language's vocabulary could be tallied [...] I have no firm answer to that (and it's logically independent of whether or not you can estimate the proportion of the vocabulary coming from a given language - that's a sampling problem.)"

Self-contradiction. I'm amazed that you still can't admit that there are an infinite number of ways to "count" a language's vocabulary and all these methods are purely arbitrary. Thus, so too must be the statements built on this randomness, like "90% of Tok Pisin is of English origin". These are phrases that somehow find themselves in academic literature unchallenged, perhaps because logic is so often confused with diplomacy and an often unmerited confucianist respect for academic hierarchies.

Further, your own suggestion on how this may be done is quite impossible for protolanguages like Proto-Germanic where, yet again, careless authors will state unquantifiable things like "About one-third of the Proto-Germanic vocabulary is of non-IE origin" without it possible to logically substantiate the stated quantity by any real statistical method at our disposal. Surely you don't believe that a proto-language can be sampled too. Surely you must see how arbitrary it all is.

Such things are far too subjective and unacademic to be taken seriously. No one will ever succeed in fully obfuscating the problem with impressive but irrelevant statistical terms and procedures either.

It's like the old adage goes: "40% of all statistics are made up."

Glen Gordon said...

D. Sky Onosson, FYI, Cantonese mou (冇) "there isn't; not have" or Finnish et "you ... not". Negative morphemes come in all sorts of varieties. A fun topic in itself, for sure.

Lameen Souag الأمين سواق said...

As I said from the start, this post addresses the question of how and whether you can non-arbitrarily measure the size of the lexicon. How best to estimate the proportion of words of a given origin in it is an equally interesting but logically independent problem (for example, 50% of all integers are even*, even though there are infinitely many numbers), which I'm still thinking about. However, if you want to stick to discussing how people actually measure this - namely, by checking the proportion in a given dictionary - then that's a much simpler issue. The problem there is that we know that the proportion of loanwords is normally lowest in common words and gets higher and higher as more uncommon words get included; thus it's not appropriate to compare proportions for two languages based on dictionaries of different sizes. But that doesn't mean the whole idea is invalid; if the goal is to compare different languages to each other (as it was in the original discussion), it just means that you have to make it clear in advance which arbitrary size threshold you're using. Anybody familiar with both languages can see that, for example, Icelandic contains far fewer loanwords than English; that means that if you pick any pair of Icelandic and English dictionaries each about the same size (or, strictly speaking, with about the same frequency cutoff), you expect the English one to contain more loans than the Icelandic one, even though the proportion of loans in the English one might be (say) 40% for a small dictionary and 70% for a large one. No doubt you could frustrate this by creating a "dictionary" of Icelandic containing only hand-picked loanwords, and another "dictionary" of English containing only Germanic words (just as you could make any poll inaccurate by picking only people that answer "yes" and ignoring all the rest); but as long as the words in your dictionary were selected based on frequency, not on etymology, that's not an issue.


* In case any maths geeks are reading this, I know it's loose phrasing - what I mean by it is that the limit as n->infinity of [the number of even numbers <= n] / n is 1/2.

David Marjanović said...

I find that interesting, though of what significance, I'm not sure...

Several Greek philosophers would have had their minds blown by it. :-)

I'm amazed that you still can't admit that there are an infinite number of ways to "count"

Estimate.

a language's vocabulary and all these methods are purely arbitrary.

They all oversimplify the issue, but by no means are they all purely arbitrary.

Thus, so too must be the statements built on this randomness, like "90% of Tok Pisin is of English origin".

Depends on what error bar you ascribe to that number. If it's supposed to mean, say, "between 85 and 95 % of Tok Pisin is of English origin", I can't see a problem. If it's supposed to mean "between 89.95 and 90.05 % of Tok Pisin is of English origin", and I were asked to peer-review a paper that contained this statement, I'd require the authors to explain the method before resubmission (and of course I'd expect the method to turn out to be flawed).

Further, your own suggestion on how this may be done is quite impossible for protolanguages like Proto-Germanic where, yet again, careless authors will state unquantifiable things like "About one-third of the Proto-Germanic vocabulary is of non-IE origin" without it possible to logically substantiate the stated quantity by any real statistical method at our disposal.

Isn't it completely obvious how this is meant?

Obviously, it's shorthand for "about one-third of those Proto-Germanic lexemes that have been reconstructed so far appear to lack an IE origin based on evidence I'll hopefully provide in the next sentence, assuming of course that the reconstructions are correct in the first place". And that's not a statement I have a problem with. Yes, I agree that such sloppy wording could still be misunderstood and should therefore still be cleared up before publication. But automatically taking it literally is just silly.

Glen Gordon said...

Without a conscious awareness of weasel words those who profess to understand linguistics are in fact lost.

Here are the facts:
1. The term language is arbitrary.
2. Determining vocabulary size is arbitrary.
3. Geographical boundaries of a language are arbitrary.
4. Dialectal boundaries of a language are arbitrary.
5. Related terms like dialect and code are equally arbitrary.
6. Wave Theory proponents already understand this and have adapted. (So should you.)
7. The term even number is defined by math-logic, but vocabulary cannot be.
8. An infinite set of even numbers differs in nature from an infinite set of vocabulary, thus comparison between the two is invalid.

---

"90% of Tok Pisin vocabulary is derived from English."In the above statement, the terms Tok Pisin and English describe two kinds of language and are thus arbitrary (ie. incapable of being logically defined in terms of geographical scope, clear dialectal boundaries, etc.). The term vocabulary, also without defined scope, is vague. Since the above statement already contains three confirmed weasel words, "90%" is completely meaningless in numerous ways. (In other words, David doesn't understand that whether taken "literally" or not, this statement is void of any informative value whatsoever.)

Honest authors and logicians shun invalid statistics, choosing instead to explicitly convey to the reader any opinions or assumptions on which further hypotheses might be based.

We should overtly reframe the above into a statement of opinion. For example "*In my opinion* the *overwhelming majority* of Tok Pisin vocabulary is of English origin." thereby avoiding the abuse of semantic vagueries to support an untenable point of view, just as the sophist Gongsun Long attempted to do over 2000 years ago (ie. 白馬論).

Lameen Souag الأمين سواق said...

Arbitrariness - by which you seem to mean the existence of unclear boundary cases - is really not a problem; you simply state explicitly somewhere what you are choosing to count as part of the language and how you're calculating your results, and then other authors wanting to compare their figures with yours can see whether the figures are comparable or are based on different criteria. Certainly an article saying that "90% of Tok Pisin vocabulary is derived from English" ought to say which dictionary or corpus this was based on, and how big it was; but even without that background, knowing that a (presumably frequency-based) dictionary of Tok Pisin exists 90% of whose entries are of English origin is more informative than just saying "a lot". (For one thing, I can't imagine finding 90% Germanic entries in any frequency-based dictionary of English large enough to be called a dictionary, whereas I can certainly imagine finding "a lot".)

If you want to deal with arbitrariness in this sense more systematically, I suggest reading up on fuzzy set theory (the article that got the ball rolling is Zadeh 1965), or Fuzzy Logic and its Uses seems like an OK introduction.) Concepts not having well-defined boundaries can be dealt with logically too - they're just messier.

Oh yeah, good White Horse link - I always liked that paradox, though it sure doesn't translate well.

David Marjanović said...

Concepts not having well-defined boundaries can be dealt with logically too - they're just messier.That, of course, was my point; I should have stated it more clearly.

Glen Gordon said...

On my own blog on this subject, I already ever-so-clearly cited Suzanne Romaine's Language, education, and development (1992), p.145 which explains concisely why these statistics are effectively useless. Suzanne Romaine is Merton Professor of English Language in the University of Oxford. Take it or leave it.

Also note Raymond Hickey in Legacies of Colonial English on page 474, a professor of linguistics at the Universität Duisberg-Essen, who suggests the same on the subject. Nothing at all suggests that these learned authors are somehow mad, but everything in these permitted comments, including the recent troll attack against me by Carlos Quiles on a previous blog entry, indicates rather the irrationality of the blog author, Lameen Souag.

So far be it from me to "mandate" to people careful reasoning and sound judgement. No doubt, this dishevelled commentbox mob cowardly hiding for the most part in anonymity will suggest that Romaine and Hickey don't know what they're talking about simply because their educated understanding of the subject doesn't fit into the narrow views of this particular expression of nutty groupThink. Please, do carry on with your collective insanity. :-)

D. Sky Onosson said...

Huh? There are *two* comments by *one* poster in this thread who doesn't have a blogger profile, and *zero* anonymous comments. I don't see anyone attacking anyone else here, why not keep that "discussion" tied to the one thread that's already gone off the rails?

Glen Gordon said...

And why don't you address the references I just cited? The ones by Romaine and Hickey on "unreliable" statistics? Why don't you try that?

D. Sky Onosson said...

Perhaps I might, if I have time to get to them (3 jobs, 2 kids, and school doesn't usually leave me much time). I can certainly appreciate your useful citations and references, but it's hard to take seriously when you follow them up with a rant aimed at no one and everyone at the same time.

I'm just trying to encourage a little civility on this thread, if possible. I for one don't enjoy being thrown into a "collective" "dishevelled ... mob" for no good reason.

Glen Gordon said...

"Perhaps I might, if I have time to get to them (3 jobs, 2 kids, and school doesn't usually leave me much time)."But you have the time to make pointless comments though? How strange. I repeat, comment on my references to Hickey and Romaine's above. Your attempt to distract the topic with more shallow complaints about me are transparent.

D. Sky Onosson said...

I haven't made any shallow complaints about you, nor do I plan to.

I'll refrain from making any more off-topic posts for the benefit of all.

Lameen Souag الأمين سواق said...

"Not entirely reliable" is a very long way from "effectively useless", and in any case both sources are referring to the unreliability of the etymologies themselves, not of the idea of counting the proportion of words from a given source in a dictionary. Tok Pisin is a bit inconvenient in that respect, since German and English are so closely related; the obvious solution would be to state what percentage of words could equally plausibly come from either language, rather than assign them arbitrarily to one or the other. That said, neither of the examples of ambiguous etymologies that Romaine gives strike me as ambiguous; Tok Pisin contrasts d with t and e with a, so both those items (at least in their present forms) are more likely English in origin.

Glen Gordon said...

If my toaster is "not entirely reliable" and has the potential of either making normal toast or exploding into flames and burning my house down, I hardly would call it "useful".

Maybe your idea of mathematics and science is "fuzzier" than mine but I remember a time when consistency had value and a lack of consistency did not. There is simply no means to verify a proposed etymology without a time machine, so these statistics as you well know by now are, yes, completely utterly useless.

It's the "unverifiable" part that tips us off that we're not dealing with real science. For similar reasons, glottochronology is not respected either but it, of course, doesn't stop the lexicostatistic cult from pushing their unproven/unprovable views.

Even sadder, there are so many other ways of proving a point through careful deductive reasoning so that lexicostatistics doesn't even need to be evoked in the first place.

Glen Gordon said...

PS, Tok Pisin has different dialects and forms, to add to the uselessness of Mihalic and others' statistics. Some forms are well known to be more anglicized than others and great variation exists to the point of mutual unintelligibility.

David Marjanović said...

If my toaster is "not entirely reliable" and has the potential of either making normal toast or exploding into flames and burning my house down, I hardly would call it "useful"."Not everything that doesn't work is a metaphor."

("Nicht alles, was hinkt, ist ein Vergleich." One of the only two or three intelligent things Wolfgang Schüssel has ever said.)

David Marjanović said...

Why has blogger.com started to eat my empty lines behind <i> tags?

test
test

David Marjanović said...

OK, so <i< permanently switches the automatic insertion of <p> tags by the software off, and <br> is then required (<p> being forbidden, stupidly enough).

test

test

test

mcgees.org said...

But if it's a power law, then (since the power obviously has to be positive) that would predict no limit as n tends to infinity

It can probably be easily bounded, though. Presumably, one will reach a threshold at which a thousand new words would not appear at any given rate of page generation before the predicted heat death of the universe.

But that's really beside the point, as what we are looking at is a model, not a measure. It is likely meaningless to even find the cardinality of one person's vocabulary -- for instance, I had an exchange recently in which someone claimed not to know a word, then found he had used it several times in his writing.

But at a certain point, just find a number and say "less than this number plus a thousand". :-p