Wednesday, September 29, 2010

Small vocabularies, or lazy linguists?

In Guy Deutscher's new book The Language Glass (which I'll be reviewing on this blog sometime soon) he claims (p. 110) that "Linguists who have described languages of small illiterate societies estimate that the average size of their lexicons is between three thousand and five thousand words." This would be rather interesting, if verified - but this statement is not sourced at the back, and is in any case too vague (what counts as "small"?) to be relied on as it stands. Does anyone have any idea where he might have got this figure?

I haven't found his source, but Bonny Sands et al's paper "The Lexicon in Language Attrition: The Case of N|uu" gives a nice table of Khoisan dictionaries' sizes, ranging from 1,400 for N|uu to < 6,000 for Khwe and 24,500 for Khoekhoegowab. She prudently concludes "The correlation between linguist-hours in the field and lexicon size is so close that no conclusions about lexical attrition can be drawn" - the outlier, Khoekhoegowab, is not only the biggest of the lot (with over 250,000 speakers), but had its dictionary written by a team including a native speaker over the course of twenty years. Given that "2,000 - 5,000 word forms (in English) may cover 90-97% of the vocabulary used in spoken discourse (Adolphs & Schmitt 2004)", it is not surprising that it should take disproportionately long to move beyond the 5,000 word range. However, she also points out that "Gravelle (2001) reports finding only 2,300 dictionary entries in Meyah (Papuan) after 16 years of study", suggesting that some languages may simply have unusually small vocabularies. Along similar lines, Gertrud Schneider-Blum's talk Don’t waste words – some aspects of the Tima lexicon suggested that the Tima language of Kordofan had an unusually small number of nouns due to extensive polysemy and use of idioms (I can't remember any figures, nor indeed whether she gave any.)

I'd be interested to see other discussions of the issue of differences in lexicon size and explanations for them. My Kwarandzyey dictionary (in progress) so far stands at about 2000 words - it would be encouraging to think that I might already have done more than half the vocabulary, but I very much doubt it!


GamesWithWords said...

You'd have to define "word," right?

But you would expect a language spoken in a wide geographic distribution by cultures with an awful lot of cultural and technological complexities (e.g., lots of stuff to talk about) would have more words than languages spoken in a geographically-restricted range with relatively few things to talk about.

I'm not calling anyone unsophisticated or primitive, but we've probably got an order of magnitude more job titles than many of these languages have people. We've probably got more fictitious nationalities in science fiction than these language have people. We probably have more plant names (this comes from the geographic distribution) than they have plants. There's nothing really mysterious here.

But the correlation between the size of a lexicon and the amount of time spent studying that language is well-taken.

anggarrgoon said...

Andy Pawley gave a talk about this once.
Time spent on the language was part of the equations, but there were many other things too.

. What count as a 'headword' in the dictionary?
. How much (productive) derivational morphology is there? [and how are those forms listed in the dictionary?]
. How much polysemy is there?
. How much use is made of compounding (as opposed to other word formation processes, or phrasal descriptor compounds)?
. How much use is made of specialised technical vocabulary? (for example, in English, musical terms tend to be their own lexemes, but in Bardi they are special senses of more general words.)
. Is there lots of loan (near-)synonymy?

Caspar Jordan said...

Are you aware of Kenneth Hale and Davis Nashs research on Damin, the "secret" register of Lardil, an Australian aboriginal language, spoken by initiated men? I have no idea how trustworthy the research is but it suggests that this register uses only some several hundred lexical items to express more or less anything expressed by the non-secret register. This is said to be achieved through extreme polysemy.
Definitely interesting!

Lameen Souag said...

GamesWithWords: that all sounds very well, and literacy in particular probably allows for a much bigger notional vocabulary simply by making the transmission of rare words no longer depend on actually knowing someone who uses them. But how many "things" there are to talk about is very much in the eye of the beholder - there are few limits on how finely you can divide up the body or the natural world if you feel like it, and none on how many imaginary entities you can decide to talk about.

Claire: thanks! That talk sounds really interesting - wish I had been there.

Caspar: I've heard a little bit about that, yes - don't know much about it, but I can well believe that what's effectively an artificial language would have a much smaller vocabulary. Getting it to express everything sounds like the hard part!

marie-lucie said...

Caspar Jordan: "secret register"

I have read the same thing about other "secret languages". One aspect is that secret languages (which are usually secret vocabularies rather than whole languages) are used for specific purposes, and therefore they don't contain vocabulary which is irrelevant to those purposes. Another aspect is that they often use regular words with different meanings. These are not the only features of secret languages, but these features can be observed in slang, for instance.

marie-lucie said...

I have been studying an "exotic" language for many years. When I started I spoke to a linguist who had done a little work on it, who told me that there were very few minimal pairs in the language. As time went on, I found dozens of minimal pairs, including threesomes and sometimes foursomes. The previous linguist just had not had the opportunity to learn too many words.

David Marjanović said...

Did that previous linguist make grandiose claims about the sound system of that language...?

leoboiko said...

Haspelmath has pointed that the word/morpheme distinction doesn’t work the same for all languages

I can picture a lexicographer claiming a language has few words when he’s looking at objects closer to morphemes.

leoboiko said...

Blogspot eats that URL in my browser so here’s it as a link: The indeterminacy of word segmentation
and the nature of morphology and syntax