Showing posts with label vocabulary. Show all posts
Showing posts with label vocabulary. Show all posts

Saturday, March 26, 2016

Lexical gaps in diglossia: When you can't write what you know

"Write what you know" is what they tell aspiring writers. If you're an English speaker who made it through high school, you should be able to do just that, on any topic that you know anything about (although the spelling might need work.) In Algeria (as in other Arabic-speaking countries) diglossia makes it a little more complicated. You may have mastered the grammar perfectly, and gotten a great score on your high school exams. You may be an excellent plumber, or a great fisherman, or an expert carpenter - and you certainly have no problem talking about any of these things in Darja (dialectal Arabic). But try to write about any of those fields, and you're almost guaranteed to run into the limits of your Fusha vocabulary (standard Arabic).

You don't even have to get all that specialised to run into difficulties. If you're Algerian, all of the items listed below should be familiar to you from daily life - some of the words might be different in your region, but you almost certainly still know a Darja word with the appropriate meaning. But how many of them can you name in Fusha? (No fair using a dictionary, especially since you're unlikely to have a Darja-Fusha dictionary.)

  • تورنيفيس (screwdriver)
  • لومبرياج (clutch of a car)
  • زربوط (top)
  • حرّايق (nettle)
  • مشيمشة (loquat)
  • بلاّرج (stork)
  • أكليل، أزير (rosemary)
  • رعف (to have a nosebleed)
  • زبر (to prune)
  • ددّش (to toddle)
  • هترف (to sleep-talk)
In the very unlikely event that you did know all of these in Fusha, ask yourself for each one: if you used this word in an article, how many readers do you think would understand? Granted, a couple of them are trick questions - cases where the Fusha word is basically the same as the Darja one. But the main point stands even in those cases: you probably didn't know that before checking it, and for at least one of those words, I can confirm from personal experience that there are professors teaching in Arabic, and journalists working in Arabic, who didn't know that either. In Algeria (though not necessarily in other countries, like Egypt), the default assumption is always that a Darja word is wrong until proven otherwise.

It's understandable that Algerians (and quite likely other Arabic speakers) tend not to know these words in Fusha. How often do any of them come up in journalism, or religion, or poetry, or any of the other contexts in which people are most frequently exposed to Fusha? But what it means is that even well-educated Algerians don't know enough Fusha to adequately describe their daily life, much less to write all they know. In effect, compared to their Darja abilities, they're suffering from a Fusha-specific language deficiency that limits what they can write about. If you agree with me that it would be nice to see more good Algerian novels, or even more Algerian DIY handbooks, then that's a problem.

Sunday, February 14, 2016

Gravitational waves and lexical diffusion

Recently, the detection of gravitational waves made headlines all over the world. These waves were only hypothesised a century ago, and have literally never been consciously experienced by a human being before. Apart from a few physics fans, most people had (still have?) never heard of them. That means that, this month, millions of people all over the world are learning, for the first time, how to say "gravitational waves" in their own language, entirely as a result of media coverage. For the official languages of First World countries, determining how to say "gravitational waves" was simply a matter of looking it up in a dictionary, or consulting a physicist; the groundwork had been laid long since for terms such as the following (morphemes separated by dots). In most European languages, even the term for "gravity" and/or "gravitational" had been borrowed wholesale from Latin, and all that had needed doing was to translate "wave" and add appropriate inflectional morphology: Of course, Latin is not the only classical language; Japanese, for one, had opted to coin a term out of morphemes borrowed from Chinese:
  • Japanese: 重力波 (jū.ryoku.ha; weight-force-wave)
In the Third World, the naming problem is a little less straightforward. There are plenty of physicists speaking Arabic, for example, but it cannot even automatically be assumed that an Arabic-speaking physicist will be capable of talking about physics in Arabic; many not only work but even teach in a foreign language. Nevertheless, the mere fact of a language being extensively used in media and teaching guarantees that it already contains expressions for "gravity" and "gravitational", not to speak of "wave", and makes it probable that they had already been combined, as in the following expressions: For languages not lucky enough to enjoy official status, the issue poses more difficulties. The BBC's Hausa service heroically managed to coin or find a term for "gravitational waves" in Hausa - a language with no specific word for "wave" - but one wonders how physicists, to say nothing of ordinary speakers, feel about it... What about Berber? Well, in principle the relevant words have been coined, probably more than once. If we go with Mazed's (2003) Amawal amatu n tfizikt tatrart, "gravitational waves" should be timdeswalin tizayzayanin. You may not be unduly surprised to hear that this gets zero hits on Google. There are dictionaries of proposed terminology for Tamazight (pan-Berber), but there are no fully Berber-language newspapers, and nobody teaching physics in Berber. Very likely the Berber-language radio/TV stations have spoken about this news, but if so, my experience of Algeria's Radio 2 suggests that they probably just switched into French to express it - and if they did use the neologisms, chances are virtually none of their listeners understood them.

What about Siwi, or Korandje? Come on - who are we kidding? If a speaker of either wanted to speak about gravitational waves, they would simply use the Arabic term (or possibly the French or English one). Nothing in the structure of these languages prevents them from coining the terminology for this - but the fact that these languages have no media or educational system of their own, and are spoken by communities too small to include any professional physicists, makes it extremely unlikely that their speakers will do so, and even less likely that any such coinages will be successful.

The moral is obvious: for a language's speakers to effectively be able to talk about the full range of topics associated with the modern world without resorting to code-switching or nonce borrowing, they need mass schooling and mass media in that language.

Which brings me to another recent news item: it appears that Morocco's Minister of Education, Rachid Belmokhtar, plans to start teaching scientific and technical subjects in French, even in secondary school (1 2). The most obvious disadvantage of such a policy is that it makes it impossible for students doing badly in French to understand these subjects, thus reducing even further their already limited chances. But its implications for Standard Arabic in Morocco bear considering too: this decision condemns an important part of its vocabulary to local oblivion.

Monday, July 14, 2014

Northern Songhay comparative wordlists

Linguistically, the northern and southern shores of the Sahara have remained surprisingly distinct, and most Saharan groups are easily identifiable as outposts of one or the other. Occasionally, however, a greater degree of language mixture is found. Nowhere is trans-Saharan language mixture more prominent than in Northern Songhay, a group of languages spoken in Niger, Mali, and Algeria combining a Songhay base with an enormous Berber superstratum, including Korandjé, a southwestern Algerian language I've been working on for a few years now.

Following an inquiry I recently received, I've been comparing Korandjé data to the Northern Songhay comparative wordlist in Rueck and Christiansen (1999). In the spirit of open data, you can view the wordlist (with a few remaining gaps to be filled) here: Korandjé 380-word list for Northern Songhay lexical comparison. Draft version, 14 July 2014. The results should be treated as provisional, since the Tasawaq part of this wordlist in particular appears a bit unreliable and since a few gaps remain in the Korandjé and even Tadaksahak lists, but are nevertheless interesting.

Counting cognates makes it very clear that Korandjé is the outlier, as might be expected based on geography:

KorandjéTadaksahakTagdalTabarogTasawaq
Korandjé139140141152
Tadaksahak139242238214
Tagdal140242304237
Tabarog141238304229
Tasawaq152214237229

The other three Northern Songhay varieties (treating Tagdal+Tabarog as one variety) form a linkage, which, following Wolff and Alidou's suggestion, we might label Azawagh Songhay - from west to east: Tadaksahak, Tagdal+Tabarog, then Tasawaq. On this wordlist Korandjé is clearly closest to Tasawaq, but that's only because Korandjé and Tasawaq have both kept more Songhay vocabulary, a fact irrelevant for subgrouping. The only innovation in vocabulary that Korandjé and Tasawaq share to the exclusion of the rest is the borrowing of numerals from 5 up from Arabic, and if you look at the sound correspondences it's clear that Tasawaq and Korandjé each borrowed their current numerals separately from different dialects of Arabic. Tadaksahak, Tagdal, and Tabarog all show almost the same number of items shared with Korandjé due to common borrowing from Berber, and most of that is due to shared borrowings of widespread Berber words that could easily have happened independently. The use of a Berber form originally meaning "weaver" for "spider" in Korandjé and Tadaksahak alone is striking, but very likely coincidental.

Another way to look at this is to note that 188 of the 332 items are shared across all of Azawagh Songhay, whereas only 108 are shared across all of Azawagh Songhay plus Korandjé. Of the latter, only 9 are Berber or Arabic loans, while 99 are Songhay retentions:

eye, ear, mouth, head, hair, neck, milk, belly, foot, hand, skin, blood, urine, liver, person, man, woman, owner, name, dog, cow, donkey, (venomous) snake, louse, meat, fat, stick, grass, rope, salt, pot, pit (hole), iron, fire, smoke, ashes, night, sun, day, yesterday, wind, water, stone, one, two, hot, cold, long, old, lots, red, black, white, dry, full, what, where, near, far, and, sit down, stand up, lie down, sleep, bite, eat, drink, suck, laugh, cry, see, hear, know, love, give, steal, hide, give birth, die, kill, walk, run, fall, wash, pierce, hit, tie, do, sew, bury, sandals, horse, truth, falsehood, finish, dig, stand, find.
This list is dominated by basic, rarely loaned words: nearly half of it overlaps with the Leipzig-Jakarta list. However, more culturally specific shared retentions such as "iron", "owner", "cow", "donkey", "horse", "pot", "sew", and "sandals" remind us that the split of Northern Songhay is after all rather recent (much more so, in fact, than these words alone might suggest).

These pan-Northern retentions, however, by no means exhaust the Songhay lexicon of Northern Songhay. Korandjé alone retains some 183 list items of Songhay origin, at least 135 of them shared with Tasawaq, while for many words (eg "four", "green"), only Tasawaq has kept Songhay forms. Well over 227 items have Songhay equivalents in at least one Azawagh Songhay variety, and more than 241 have equivalents either in the Azawagh or in Korandje. If the even more conservative (but extinct) Emghedesie variety were added to the list, that number would no doubt be even larger. Proto-Northern Songhay certainly had a significantly larger Songhay lexicon than any of its descendants does.


[Later addendum]: Removing all words with Arabic-derived Korandje forms from the list makes no difference to the classification; the table ends up like this:

KorandjéTadaksahakTagdalTabarogTasawaq
Korandjé135136138142
Tadaksahak135188186174
Tagdal136188231188
Tabarog138186231181
Tasawaq142174188181

Thursday, December 26, 2013

Does Arabic have the most words? Don't believe the hype.

For some time, I've been hearing rumours (from Arabs, of course) that Arabic has the largest number of words of any language. Recently I found one vector for this rumour: Comparison of the Number of Words in Languages of the World, a poster put together by Azzam Aldakhil which has the merit of at least giving the sources for its figures, namely Muʕjam ʕAjā'ib al-Lughah by Shawqī Ḥamādah, 2000. (In a follow-up comment he gives the page numbers, 83-84.) This poster claims that "Arabic has 25 times as many words as English".

Unfortunately for this claim, if you go to the book cited, what you actually find is a calculation of the number of possible roots in Arabic, without regard to whether or not the root actually has a meaning. Such a count includes huge numbers of unused roots such as بزح bzḥ or قذب qḏb, while at the same time lumping together all words derived from the same root; كتاب book, كاتب writer, and مكتب office are three words, but only one root. The result of such a calculation might tell us something about the potential for expanding Arabic, but absolutely nothing about the state of the Arabic language. And since in practice both Arabic and the languages it is being compared to on that poster allow arbitrary long words without real roots, if only in loanwords, it doesn't even tell us much about its potential.

Both the number of Classical Arabic roots with actual meanings and the number of words can be estimated from the classic dictionaries: according to Sakhr's statistics, there seem to be around 10,000 roots, and up to 200,000 distinct words. Roots don't play such a major role in the lexicography of most non-Semitic languages, so it's difficult to compare the number of roots cross-linguistically. But in terms of words, that would be slightly fewer than English (250,000 in the OED, although the poster cites 600,000) and slightly higher than French (over 100,000 excluding proper nouns, according to the Académie Française).

However, such comparisons can hardly fail to be misleading. For one thing, English is much more hospitable towards dialectal and colloquial usages than Arabic is – the OED is full of words marked as Scottish or Northern or slang or whatnot, the equivalents of which would never be accepted by an Arabic dictionary. For another thing, the whole enterprise of counting words across languages runs into apparently insuperable problems, especially when it comes to compounds, which Arabic dictionaries do not normally treat as words. If you include compounds, then compound-friendly languages like German or Turkish or Inuktitut are automatically going to beat all the rest – and all the available statistics that I've seen for, say, English happen to include compounds.

So the best answer is that we don't really know, and that word count, even if we could measure it better, is not a very good measure of a language's expressive power anyway. Some missing words make a genuine difference, as I've discussed here before. But is English really missing out by not having distinct words for male camels (جمل) vs. female camels (ناقة)? Is Arabic really missing out by not having a special word for cornpone, or for scones?

Tuesday, November 05, 2013

APiCS online, ASJP

Any readers interested in pidgins, creoles, or mixed languages (one of those things is not like the others!) will want to know that the data for the Atlas of Pidgin and Creole Languages, APiCS, is finally online and publicly browsable. Think of it as WALS for pidgins and creoles, basically – lots of pretty maps, with the nice bonus that language-internal variation in features like word order can be represented proportionally by a pie graph instead of having to choose a single value per language.

Also released lately is the data underlying the ASJP (Automated Similarity Judgement Program). The program's results itself remain thoroughly unreliable as a guide to classification – as of the latest version, it auto-classifies Songhay with Masa (Chadic), Berber with East Chadic, Kanuri with various Biu-Mandara (Chadic) languages (and not with Teda-Daza), Turkic with some New Guinea language named Kuot, and Hebrew with Tigre and Tigrinya against the rest of Semitic. For low-level subgroupings they aren't always too bad, though – their Berber tree has become surprisingly plausible. In any event, having the data, you can analyse it yourself, or try running your own algorithms if you feel up to it...

Sunday, April 12, 2009

How many words are there in a language?

In a recent discussion, the question came up of whether a language's vocabulary could be tallied (briefly addressed at Language Log a while back, and at FEL.) I have no firm answer to that (and it's logically independent of whether or not you can estimate the proportion of the vocabulary coming from a given language - that's a sampling problem.) But, notwithstanding the bizarre if occasionally entertaining acrimony of that discussion, it's actually a rather interesting question.

Clearly, any given speaker of a language - and hence any finite set of speakers - can know only a finite number of morphemes, even if you include proper names, nonce borrowings, etc. ("Words" is a different matter - if you choose to define compounds as words, some languages in principle have productive systems defining potentially infinitely many words. The technical vocabulary of chemists in English is one such case, if I recall rightly.) Equally clearly, it's practically impossible to be sure that you've enumerated all the morphemes known by even a single speaker, let alone a whole community; even if you trust (say) the OED to have done that for some subset of English speakers (which you probably shouldn't), you're certainly not likely to find any dictionary that comprehensive for most languages. Does that mean you can't count them?

Not necessarily. You don't always have to enumerate things to estimate how many of them there are, any more than a biologist has to count every single earthworm to come up with an earthworm population estimate. Here's one quick and dirty method off the top of my head (obviously indebted to Mandelbrot's discussion of coastline measurement):
  • Get a nice big corpus representative of the speech community in question. ("Representative" is a difficult problem right there, but let's assume for the sake of argument that it can be done.)
  • Find the lexicon size required to account for the 1st page, then the first 2 pages, then the first 3, and so on.
  • Graph the lexicon size for the first n pages against n.
  • Find a model that fits the observed distribution.
  • See what the limit as n tends to infinity of the lexicon size, if any, would be according to this model.


A bit of Googling reveals that this rather simplistic idea is not original. On p. 20 of An Introduction to Lexical Statistics, you can see just such a graph. An article behind a pay wall (Fan 2006) has an abstract indicating that for large enough corpora you get a power law.

But if it's a power law, then (since the power obviously has to be positive) that would predict no limit as n tends to infinity. How can that be, if, for the reasons discussed above, the lexicon of any finite group of speakers must be finite? My first reaction was that that would mean the model must be inapplicable for sufficiently large corpus sizes. But actually, it doesn't imply that necessarily: any finite group of speakers can also only generate a finite corpus. If the lexicon size tends to infinity as the corpus size does, then that just means your model predicts that, if they could talk for infinitely long, your speaker community would eventually make up infinitely many new morphemes - which might in some sense be a true counterfactual, but wouldn't help you estimate what the speakers actually know at any given time. In that case, we're back to the drawing board: you could substitute in a corpus size corresponding to the estimated number of morphemes that all speakers in a given generation would use in their lifetimes, but you're not going to be able to estimate that with much precision.

The main application for a lexicon size estimate - let's face it - is for language chauvinists to be able to boast about how "ours is bigger than yours". Does this result dash their hopes? Not necessarily! If the vocabulary growth curve for Language A turns out to increase faster with corpus size than the vocabulary growth curve for Language B, then for any large enough comparable pair of samples, the Language A sample will normally have a bigger vocabulary than the Language B one, and speakers of Language A can assuage their insecurities with the knowledge that, in this sense, Language A's vocabulary is larger than Language B's, even if no finite estimate is available for either of them. Of course, the number of morphemes in a language says nothing about its expressive power anyway - a language with a separate morpheme for "not to know", like ancient Egyptian, has a morpheme for which English has no equivalent morpheme, but that doesn't let it express anything English can't - but that's a separate issue.

OK, that's enough musing for tonight. Over to you, if you like this sort of thing.