Monday, May 07, 2018

Some notes on PCA

(Exploratory notes, written to be readable to linguists but posted in the hope of feedback from geneticists and/or statisticians - in my previous incarnation as a mathmo, I was much more interested in pure than applied....)

Given the popularity of Principal Component Analysis (PCA) in population genetics, it's worth a historical linguist's while to have some idea of how it works and how it's applied there. This popularity might also suggest at first glance that the method has potential for historical linguistics; that possibility may be worth exploring, but it seems more promising as a tool for investigating synchronic language similarity.

Before we can do PCA, of course, we need a data set. Usually, though not always, population geneticists use SNPs - single nucleotide polymorphisms. The genome can be understood as a long "text" in a four-letter "alphabet"; a SNP is a position in that text where the letter used varies between copies of the text (ie between individuals). For each of m individuals, then, you check the value of each of a large number n of selected SNPs. That gives you an m by n data matrix of "letters". You then need to turn this from letters into numbers you can work with. As far as I understand, the way they do that (rather wasteful, but geneticists have such huge datasets they hardly care) is to pick a standard value for each SNP, and replace each letter with 1 if it's identical to that value, and 0 if it isn't. For technical convenience, they sometimes then "normalize" this: for each cell, subtract the mean value of its (SNP) row (so that the row mean ends up as 0), then rescale so that each column has the same variance.

Using this data matrix, you then create a covariance matrix by multiplying the data matrix by its own transposition, divided by the number of markers: in the resulting table, each cell gives a measure of the relationship between a pair of individuals. Assuming simple 0/1 values as described above, each cell will in fact give the proportion of SNPs for which the two individuals both have the same value as the chosen standard. Within linguistics, lexicostatistics offers fairly comparable tables; there, the equivalent of SNPs is lexical items on the Swadesh list, but rather than "same value as the standard", the criterion is "cognate to each other" (or, in less reputable cases, "vaguely similar-looking").

Now, there is typically a lot of redundancy in the data and hence in the relatedness matrix too: in either case, the value of a given cell is fairly predictable from the value of other cells. (If individuals X and Y are very similar, and X is very similar to Z, then Y will also be very similar to Z.) PCA is a tool for identifying these redundancies by finding the covariance matrix's eigenvectors: effectively, rotating the axes in such a way as to get the data points as close to the axes as possible. Each individual is a data point in a space with as many dimensions as there are SNP measurements; for us 3D creatures, that's very hard to visualise graphically! But by picking just the two or three eigenvectors with the highest eigenvalues - ie, the axes contributing most to the data - you can graphically represent the most important parts of what's going on in just a 2D or 3D plot. If two individuals cluster together in such a plot, then they share a lot of their genome - which, in human genetics, is in itself a reliable indicator of common ancestry, since mammals don't really do horizontal gene transfer. (In linguistics, the situation is rather different: sharing a lot of vocabulary is no guarantee of common ancestry unless that vocabulary is particularly basic.) You then try to interpret that fact in terms of concepts such as geographical isolation, founder events, migration, and admixture - the latter two corresponding very roughly to language contact.

The most striking thing about all this, for me as a linguist, is how much data is getting thrown away at every stage of the process. That makes sense for geneticists, given that the dataset is so much bigger and simpler than what human language offers comparativists: one massive multi-gigabyte cognate per individual, made up of a four-letter universal alphabet! Historical linguists are stuck with a basic lexicon rarely exceeding a few thousand words, none of which need be cognate across a given language pair, and an "alphabet" (read: phonology) differing drastically from language to language - alongside other clues, such as morphology, that don't have any immediately obvious genetic counterpart but again have a comparatively small information content.

Nevertheless, there is one obvious readily available class of linguistic datasets to which one could be tempted to apply PCA, or just eigenvector extraction: lexicostatistical tables. For Semitic, someone with more free time than I have could readily construct one from Militarev 2015, or extract one from the supplemental PDFs (why PDFs?) in Kitchen et al. 2009. Failing that, however, a ready-made lexicostatistical similarity matrix is available for nine Arabic dialects, in Schulte & Seckinger 1985, p. 23/62. Its eigenvectors can easily be found using R; basically, the overwhelmingly dominant PC1 (eigenvalue 8.11) measures latitude longitude, while PC2 (eigenvalue 0.19) sharply separates the sedentary Maghreb from the rest. This tells us two interesting things: within this dataset, Arabic looks overwhelmingly like a classic dialect continuum, with no sharp boundaries; and insofar as it divides up discontinuously at all, it's the sedentary Maghreb varieties that stand out as having taken their own course. The latter point shows up clearly on the graphs: plotting PC2 against PC1, or even PC3, we see a highly divergent Maghreb (and to a lesser extent Yemen) vs. a relatively homogeneous Mashriq. (One might imagine that this reflects a Berber substratum, but that is unlikely here; few if any Berber loans make it onto the 100-word Swadesh list.) All of this corresponds rather well to synchronic criteria of mutual comprehensibility, although a Swadesh list is only a very indirect measure of that. But it doesn't tell us much about historical events, beyond the null hypothesis of continuous contact in rough proportion to distance; about all you need to explain this particular dataset is a map.

(NEW: and with PC3:)

Wednesday, April 04, 2018

Songhay crows and Korandje ravens

In Niamey, where I went last week for a workshop on Songhay as a cross-border language, the crows do something I've never seen them do in any other country: they come to the window and start tapping on the glass, like something out of Edgar Allen Poe. The reaction of my fellow attendees taught me a new Songhay word - gaaru-gaaru "pied crow" (Heath 1998) - which in turn revealed a new Korandje etymology. In Korandje, "raven" is gạḍi. The shift of intervocalic *d to r in mainstream Songhay is well-established (Nicolaï 1981). But the vowels are more interesting.

Korandje usually derives from *ar or *or. In several inherited Songhay words, however, seems to derive from *a not followed by *r: thus kạṣ-əw "rough" < kas-ow, bạzu "skin bucket, waterbag" < baasu, hạmu "meat" < *hamu, kə̣kkạbu "key" < *karkabu. Yet *a otherwise usually yields a in similar contexts: contrast gani "louse" < *gani, akama "wheat" < *alkama, dzam-a "do it" < *dam-a. It looks as though the vowel in the following syllable is what makes the difference: if it's rounded, you get , otherwise you get a (though one or two exceptions suggest that the story may be more complicated: notably, "difficult" is gab-ə̣w < *gab-ow.) Assuming this rule, *gaadu should regularly have yielded gaaru in mainstream Songhay and gạḍu in Korandje.

What we actually get, however, is gạḍi. Why? Well, Korandje has a rule of final high vowel deletion phrase-internally: if a word ends in i or u, its final vowel will be deleted unless it comes before a pause, ie most of the time. (Basically the opposite of Classical Arabic.) In a number of words, this seems to have led to confusion between original -i, -u, and consonant-final words. For instance, ạṣạnkri "skink" comes from Berber asrmkal, which should regularly have yielded ạṣạmkər; the i is unetymological (Souag 2015). In effect, speakers must have been hypercorrecting final high vowels - a fact which suggests that, if Korandje survives, it may be on its way towards phonologically losing them altogether, much as Classical Arabic did with final short vowels.

Monday, March 19, 2018

English spelling traces in Algerian placenames

Going east of Algiers along the coast, the names of two little port towns stand out. Their inhabitants know them as جنّات /d͡ʒənnat/ (sometimes جنّاد /d͡ʒənnad/) and دلّس /dalləs/ (or الدّلّس /ddalləs/). Those names would normally be transcribed in French as *Djennat (if not *Djennette) and *Delless. Yet in French - and hence, given the region's colonial history, in most Western languages - they are in fact written as Djinet and Dellys; the latter at least is very often even (mis)pronounced accordingly as /dɛlis/. French i and y are both normally pronounced /i/; why on earth would Frenchmen write the schwa /ə/ of these names in this way, when French has a schwa and normally writes it as e?

The most likely answer is that they didn't. Rather, they adopted or adapted these placenames' spelling from English - specifically, from the widely translated work of Thomas Shaw, an English reverend and Oxford fellow who spent several years in Algeria in the early 1700s, a century before France occupied Algiers. He spelt the two towns' names as Jinnett and Dellys respectively - a spelling which, in English, yields the almost exactly correct pronunciations /d͡ʒɪnɛt/ and /dɛlɪs/.

Shaw's book was translated into French by 1743, and the translator retained the English spellings of both names. In a later edition no doubt prompted by the French invasion (1830), Jinnett got amended to Djinnett - someone had finally got around to noticing that English j is pronounced like French dj, not like French j. The doubled letters, useful for indicating vowel quality in English but serving no purpose in French, were lost within a decade, as seen in Eyriès (1839). But the i of Djinet, and the y of Dellys, remained to testify to a period when French geographers relied on an English traveller to tell them about Algeria - and to confirm most colonists' lack of interest in how the locals pronounced these names.

Saturday, March 17, 2018

Good speaking is not good writing

There's an article by Nathan Robinson that's been going around recently titled "Jordan Peterson: The Intellectual We Deserve". After pages of apparently reasonable criticisms of his subject, the author delivers what he seems to think is his coup de grâce:
Even now, however, I am being too generous to Jordan Peterson’s intellect. I have been presenting him at his most comprehensible and polished. I have not been giving you the full experience of actually listening to him talk. Sitting through a Jordan Peterson lecture is very different to watching a rapid-fire television interview. Below, please find a fully-transcribed portion of 17 minutes of Peterson’s speech.[...] (NOTE: UNDER NO CIRCUMSTANCES ATTEMPT TO READ THE ENTIRETY OF THE FOLLOWING PASSAGE. READ AS MUCH AS YOU CAN BEFORE YOU BEGIN TO FEEL WEARY, THEN SCROLL QUICKLY TO THE END.)
Just to stack the scales a bit further, the transcription features no paragraphing. Nevertheless, I did read it - much quicker than watching some random video for 17 minutes! -and, rather anticlimactically, found a perfectly coherent and reasonably entertaining (if very likely unfair) parenting anecdote, obviously intended to illustrate the importance of setting boundaries. I rubbed my eyes and thought "How is it that an intelligent, well-educated native speaker of English can apparently not only see this transcript as an incoherent mess but also assume all his readers will? Am I crazy, or is he?"

The answer is simple: good speaking is not the same thing as good writing. Take a great talk, one that keeps a non-academic audience riveted, and transcribe it verbatim; it will almost always look rambling and repetitive on the page, unless you're already accustomed to reading such transcripts (part of the job for a descriptive linguist, but a rare experience for most people). That's simply the nature of the medium, and adequately explains the expected audience reaction. Maybe it even explains the author's reaction, if the only context he ever encounters long talks in is academia.

One of the author's main points - a valid one, I think - is that academics need to communicate better with the public for everyone's sake:

[...] he is popular partly because academia and the left have failed spectacularly at helping make the world intelligible to ordinary people, and giving them a clear and compelling political vision.
If so, the first step is to learn appropriate discourse strategies. You don't talk to confused young people on YouTube as if you were addressing a learned seminar, much less writing a article. Nathan Robinson surely realises this himself - but, by going for cheap laughs at the expense of a perfectly ordinary example of spoken language, he's not only weakening his main point but encouraging the very blindness to orality that makes it difficult for many academics to communicate with the public. Academics can surely do better - let a thousand learned YouTube channels bloom! - but not without (re)learning how to talk to the people they want to talk to.

Monday, March 12, 2018

Qaswarah revisited: a Qur'anic hapax in Modern South Arabian

A long time ago, I posted some rather speculative musings on the minor mystery of the allegedly Ethiopic word qaswarah قسورة in the Qur'ān, usually considered to mean "lion". An anonymous commenter years later came up with a much better but still rather speculative idea:
Research substantiates that both “lion” and “hunter” are plausible according to analyses of Proto-Highland Eastern Cushitic wherein “kas” is to stab, pierce or cut and the suffix of “wara” creates “agent nouns”. In modern “Ethiopic” languages such as Tigrinya and Ge’ez (as well as in some other African languages) the word “Wagatwara” means “hunter” and in earlier etymons of this word the “g” is rendered a “q” and the “t” is rendered an “s”.

But just now, looking through a Hobyot vocabulary (Nakano 2013:215), I came across an entry that makes all this discussion unnecessary. In Hobyot, "panther" is ḳáyṣ̂ər, with a plural ḳaṣ̂áwrət - clearly related to the term used in the Qur'ān, and clearly (given the ṣ̂) not borrowed from Arabic. The meaning corresponds closely enough to most commentators' consensus on qaṣwarah, while the location - in the extreme south of Arabia - helps explain why the term might have been associated in their minds with Ethiopia. In fact, the irregular correspondence of Hobyot ṣ̂ to Arabic s would suggest a loan into Arabic, rather than common inheritance, even if we didn't know how much this word puzzled the commentators.

Incidentally, the minority interpretation "archers" is presumably based on Persian, where -var added to a noun means "possessor of" - presumably, Arabic qaus "bow" + Persian -var would yield "bowman", and the feminine suffix -ah would form the plural as so often with nouns of profession. In light of the Hobyot form, it also should be clear that the majority of commentators were right to reject this interpretation.

Thursday, February 15, 2018

"Don't impose on me a language that isn't a vehicle of science": the Salhi scandal

Two years ago, the Algerian state finally decided to make Tamazight (Berber) an official language. In practice, this has not by any means implied giving it the same status as Arabic (much less as French). It has encouraged an expansion of Tamazight teaching, which is being extended to all wilayas (provinces) rather than just the ones with large numbers of Berber speakers. But Tamazight lessons - unlike Arabic, French, or English - remain completely optional. Most parents have no desire for their children to learn Tamazight, and were regularly complaining even before the question arose that the curriculum was too packed. Nevertheless, the very idea that Tamazight might someday be a required school subject seems to have been enough to drive at least one MP - the now-notorious Naima Salhi - into a ranting fury.

I've been reluctant to post about the Naima Salhi scandal, since it's obviously being used by this nonentity as a way to inflate her public profile. But when I heard the actual words of her paranoid rant against Berber, I realized I had to. Her words, thankfully, have been overwhelmingly repudiated by her peers. But her "reasoning" is a perfect specimen of a linguistic ideology that many people all over the world subscribe to, with a few instructive twists coming from the diglossic context of Algeria. As such, it's worth a closer look. Here's what she said, translated from - dialectal - Algerian Arabic into English:

"So don't impose on me a language - it's not a language anyway - don't impose on me a language that isn't a vehicle of science; don't impose on me a language that isn't recognized, isn't understood by people outside; what good is it to me? Study science with it? It doesn't have - it isn't a vehicle of science. Study technology with it? It isn't a vehicle of technology. Go abroad with it, to speak to people abroad? They don't know it and don't understand it. For God's sake, what good is it to us?
When it comes to the Arabic language - and oh, what a language! - which is the world language, which more than a billion people speak, they say we won't study it; a language which has billions of books, and billions of manuscripts, and billions of - everything - you say you won't study it and don't need it. Then you bring me a dead language, which doesn't have letters, and doesn't have meanings, and doesn't have words - you want to hold me back with it so you can make progress - and you go off, and eventually you get to the point, and you tell me: Me, I'm studying English, and I'm studying German, and Spanish, and Turkish, and you all don't know them. You're going to hold me back with this?
My little daughter was studying in a private school where most of them were Kabyles. She naturally learned the language with them, because her classmates' parents taught them to speak Kabyle, so it would continue and spread. So my daughter, with the best of intentions, learned with them. She'd come and speak it, and I never asked her "Why?" I didn't shut her up; I left her free to do as she likes. But now that we've gotten to the point where it's obligatory, I told her: Say another word in Kabyle (Berber) and I'll kill you, I'll discipline you if you say another word.
And I'm saying it plainly and challenging everyone: When we were going by intentions / naive, we didn't say a thing; now that it's become "push me and I'll step on you", don't push me and I won't step on you. Now we're going to make it about who's stronger? And the most for the stronger one? The majority is stronger. You'd have been better off leaving it down to intentions. Now that you think you're so smart and coming out with insults against us, now I'll insult you.
People like me, and people who are real men, and those who don't accept humiliation and aren't used to it, and whose family aren't used to it, won't accept from you something like this. And I now forbid my children from pronouncing a single word in Tamazight. I mean the Frenchified Kabyle made by the MAK and the treasonous terrorist MAK movement. And we need to demand that the MAK is a terrorist movement."
Let's pass over the bizarre misconceptions and factual errors for now (it doesn't have words???), and go to the heart of the matter. It's not an unusual phenomenon anywhere to find speakers of a majority language objecting to having to learn a supposedly useless minority language - look at Swedish in Finland, or Welsh in Wales, or even Irish in Ireland. In this case, however, diglossia introduces a further twist, making her very examples undermine her ideas.

She presents Kabyle as useless for what seem like bluntly utilitarian reasons: it's only spoken by other Algerians and it won't help you study science and technology. Yet most Algerians spend most of their lives in Algeria, and most people anywhere don't study science and technology past high school. By her own testimony, Kabyle is widely enough spoken that her daughter could pick it up in a private school even in a non-Kabyle area. Had her daughter failed to do so, she would presumably have had fewer friends, and found herself excluded from routine social interactions. Yet somehow, for Salhi, that fact doesn't even register as relevant to the question of the language's usefulness. The dialectal Arabic she's speaking is not taught in any school, and the idea of teaching it would no doubt drive her to even greater fury. Dialectal Arabic is by far the most widely used language in Algeria, without which she would find herself deaf and dumb in her own country - just ask any Kabyle outside Kabylie whether it's worth learning - yet that doesn't enter into her definition of "useful" either. A language is "useful", in fact, only if its presence in daily life is so limited as to make it useless in most contexts. Only then can speaking it be a valuable accomplishment that gives you access to coveted jobs, rather than a routine ability that remains invisible until you run into someone who lacks it. Only then is it an appropriate subject for study.

But Tamazight activism threatens to upset that basic rule. If Tamazight ever does become part of compulsory education, that would lead to children studying and getting graded on a language that some of them already speak. How hideously unfair! The Kabyle-speaking children won't need it, and the Arabic-speaking children won't want it. Clearly the only possible explanation for such a move is that Kabyle speakers want to give themselves an unfair advantage at school, and handicap the Arabic speakers. (/sarcasm) The idea that there might be another side to this - that Kabyle speakers would still have to learn dialectal Arabic on their own as they always have, getting no extra credit for that effort, whereas Arabic speakers would be getting government help in learning Kabyle - doesn't even seem to cross her mind.

Monday, February 12, 2018

Shocked by Arabic?

In the course of the recent media furore in France sparked by Mennel Ibtissem's rendition of "Hallelujah", a TV journalist named Isabelle Morini-Bosc managed to spark her own micro-furore by remarking:
«Pas le voile, pas la chanson en arabe, même si je trouve que par les temps qui courent, ça ne s'imposait peut-être pas nécessairement, mais en revanche ce qu'elle a posté oui, ça me choque, sur les attentats de Nice, ça me choque». (video)

"Not [Mennel's] veil, not the song in Arabic - even though I find that, in these times, it may not necessarily have been essential - but what she posted, yes, that shocks me, on the Nice attacks, that shocks me."

The controversy was, of course, over the parenthetical remark and the scope of its implications. Most listeners understood "these times" as an allusion to the threat of terrorism, and the whole remark as asserting that singing in Arabic was inappropriate because Arabic is associated with terrorism - an implication which naturally provoked some outrage. She responded that "I like French songs on the programme because phonetically that's how you can tell whether someone is articulating or not [...] She could have been Serbo-Croatian and singing in Serbo-Croatian and I'd have said the same". A plausible-sounding justification on its own, but difficult to reconcile with her original wording - why "in these times"? And why was she commenting specifically on the Arabic, when the song had been in both English and Arabic? All things considered, it seems rather more likely that the listeners' interpretation was correct.

However, the really interesting thing about her original sentence is not so much the parenthetical remark as the contrastive focus. She explicitly asserts that Mennel's veil and her singing in Arabic do not shock her; apparently, she is too broad-minded to worry (much) about those little things. But in the process of making that assertion, she presupposes that her audience, less cosmopolitan than herself, might reasonably expect her to be shocked by both of those things. The implicit message has two sides to it: it's better not to let yourself be shocked by people singing in Arabic on a French TV show - but it's also perfectly normal to be shocked by it. Hmm...

Thursday, January 11, 2018

Tokenistic Tifinagh #fail 2

The Algerian government recently decided to make the Amazigh New Year (really the Julian New Year) - coming up tomorrow - an official holiday. This holiday is actually traditional in a lot of Arabic-speaking areas too, in Algeria and across North Africa - and its origins are of course Roman - but over the past few decades it has been reinterpreted as an Amazigh holiday rather than a North African one, and the government made it official specifically as a gesture towards Amazigh identity. In non-Amazigh areas, this creates some quandaries, as illustrated by the announcement below by the government of the wilaya (province) of Blida...
No automatic alt text available.
The Algerian flag in the middle is flanked on all sides by easily recognizable signs of Amazigh identity - the letter aza, the abzim pins, etc. - none of which are particularly associated with Blida (even though there are still small Berber communities in the mountains above Blida, not to mention Kabyle migrants.)  The main text is in Arabic, but there is one line of Berber in Arabic script - تفاسكا ن يناير tfaska n Yennayer "holiday of Yennayer", using a word for "holiday" that in a Kabyle context amounts to a modern neologism - and two lines written in Tifinagh, whose geometric shapes add yet another easily recognizable symbol of Berber identity.  If you try to read those lines, though, they turn out in each case to be simple transcriptions (not translations) of the line of Arabic above them:

"Celebration of the Amazigh New Year"
احتفالية رأس السنة الأمازيغية iḥtifāliyyat ra's as-sanah al-'amāzīɣiyyah
 ⴰⵃⵜⴼⴰⵍⵉⴰ ⵔⴰⵙ ⴰⵍⵙⵏⴰ ⴰⵍⴰⵎⴰⵣⵉⵖⵉⴰ aḥtfalia ras alsna alamaziɣia

"Algerian and proud of my Amazigh identity"
جزائري وبأمازيغيتي أفتخر jazā'irī wabi'amāzīɣiyyatī 'aftaxir
ⵊⵣⴰⵉⵔⵉ ⵡⴱⴰⵎⴰⵣⵉⵖⵉⵜⵉ ⴰⴼⵜⵅⵔ jzairi wbamaziɣiti aftxr

It's arguably not quite as bad as the Oran case we saw last time; at least this transcription doesn't randomly discard letters.  Nevertheless, the message it sends is once again clear: nobody involved in the making of this official, centralized celebration of Amazigh identity speaks Berber, or thought it would be worthwhile to get someone who does speak it to help them out.  If the Algerian government seriously wants to make Tamazight official throughout the country, it's got a long way to go...


PS (update 19/01/2018): Not worth a whole post, but I just came across yet another example:
العمال يطالبو... | وزارة الفقر والسّعادة has:
ارحل ...ارحل ....ارحل
بالعربية : ارحل
بالامازيغية : ⴷⴹⴳⴰⴳⴹ
بالفرنسية : Dégage
بالانجليزية : Get out
ⴷⴹⴳⴰⴳⴹ is dḍgagḍ, where ḍ happens to look just like an e; explanation is hopefully superfluous...

Thursday, January 04, 2018

Taleb unintentionally proves Lebanese comes from Arabic

So Taleb has jumped back on his hobbyhorse with yet another post on Lebanese not being Arabic; see my previous posts Why "Levantine" is Arabic, not Aramaic: Part 1, Part 2, Part 3, Zombie hypotheses and the Zeitgeist, On finding the sources of shared items. The funniest thing about this one is that he's been helpful enough to provide a wordlist (for his dialect, I presume) that - despite a number of typos, almost all of which increase the apparent similarity between Levantine and non-Arabic Semitic languages - should be enough all by itself to prove to anyone in doubt that Lebanese is clearly descended primarily from Arabic, with very little Aramaic influence and even less from Canaanite/Phoenician. Unfortunately, he wasn't as helpful on the grammar, not bothering to include equivalents from other Semitic languages for the pronouns and verbal conjugations...
But I don't have all day to spend beating this dead horse, and doing etymology properly takes time. So let's just have a quick look at the first page of his wordlist (well, probably the second one - the real first one seems to be missing), and leave the other pages as an exercise for the reader.

Out of these 39 words, 18 seem to be unambiguously Arabic in origin - either they share specific sound changes with Arabic to the exclusion of the rest of Semitic, or they use a root not used in the appropriate meaning elsewhere in Semitic. Only two look like being Aramaic rather than Arabic in origin (and the evidence in both cases is fairly weak): "hand" and the patently non-basic vocabulary word "image". (Taleb would add a third, zalame "man", but this word has an at least equally plausible Arabic etymology, making it ambiguous at best.) The remaining 19 words are ambiguous, and could in principle derive from any of more than one Semitic languages - but even there, the situation is not symmetrical; all 19 could derive from Arabic, whereas no more than 11 of them could derive from Aramaic. The unambiguous cases give the following ratio: 18 Arabic : 2 Aramaic : 0 everything else. On that basis, we should therefore expect 90% of the ones ambiguous between Arabic and Aramaic (ie all but one) to derive from Arabic, not from Aramaic, and all of the ones ambiguous between Arabic and another Semitic language but not Aramaic to derive from Arabic. For details, see the following table:

1 goat Arabic does not share Canaanite+Aramaic+Ugaritic *nC > CC; does not share Akkadian *ʕa > e
2 god Arabic / Aramaic shows innovative gemination of the l, attested only in Arabic and some dialects of Syriac
3 good innovative the Arabic etymology is obvious, but the root is pan-Semitic so we may generously assume that it could in principle have derived from some other branch
4 grass Arabic does not share Aramaic and Phoenician *ś > s ; does share Arabic *ś > š
5 grind Arabic / Canaanite does not share Akkadian *aħa > ê ; does not share Aramaic CaCVC > CCVC
6 hair Arabic / Ugaritic does not share Aramaic and Phoenician *ś > s ; does share Arabic *ś > š ; does not share Akkadian loss of *ʕ
7 hand Aramaic although a change of *yad > *īd is natural enough that it could easily have happened independently in Arabic...
8 hare Arabic / Canaanite / Aramaic / Akkadian no distinctive innovations
9 he-goat Arabic / Canaanite / Aramaic no distinctive innovations
10 head Arabic / Ugaritic does not share Canaanite *aʔ > *ā > ō nor Aramaic *aʔ > ī nor Akkadian *aʔ > ē ; the form rās (with loss of the glottal stop) is well-attested in early Arabic dialects
11 hear Arabic does not share Aramaic and Phoenician *s > š (I'm going with Huehnergard's reconstruction of proto-Semitic sibilants here). Note that the correct Syriac form is šmaʕ, not sma3 ; likewise the Hebrew
12 heart Arabic The initial glottal stop (still pronounced q in, for example, Alawite dialects) can only be explained from the Arabic form, which is a lexical innovation replacing original *libb
13 honey Arabic 3asal is clearly Arabic, and – as I've pointed out before – dabs is attested in Classical Arabic as well as in Hebrew and Aramaic
14 horn Arabic / Canaanite / Aramaic / Akkadian / Ugaritic no distinctive innovations
15 horse Arabic Syriac ḥsan 'strong' has s, not ṣ, but even if it were cognate, the Classical Arabic and Levantine form still share a semantic shift unattested in Aramaic
16 house Arabic / Canaanite / Aramaic / Ugaritic Akkadian can be ruled out, since it shows a shift *ay > ī which never happened in Levantine.
17 hundred Arabic / Canaanite / Aramaic / Akkadian / Ugaritic The only innovation here, ʔ > y, is not shared with any of the ancient language in question
18 hunger Arabic Even assuming jūʕ has cognates elsewhere in Semitic, the change g > j is specific to Arabic
19 hunt Arabic / Canaanite / Aramaic / Akkadian / Ugaritic The only innovation here, use of the D-stem, is not shared with any of the ancient languages
20 image Aramaic Since when is 'image' basic vocabulary? But yes, assuming we can trust the transcription, it shares the aw with Aramaic
21 inside Arabic / Aramaic Mixed signal here: the meaning looks like Aramaic, but the sound shift g > j is Arabic not Aramaic. In reality, the word *jaww must originally have meant 'inside' in Arabic too; it lost this meaning in Classical Arabic, but kept it in many of the dialects
22 iron Arabic
23 kidney Arabic / Canaanite / Aramaic / Akkadian / Ugaritic The only innovation here, *y > w, is not shared with any of the ancient languages (but _is_ shared with many other modern Arabic dialects...)
24 kill Arabic / Canaanite Does not share Aramaic CaCVC > CCVC
25 king Arabic / Canaanite / Aramaic / Ugaritic Since when is 'king' basic vocabulary?
26 knee Arabic Shares a unique innovation with Arabic – the metathesis brk > rkb
27 know Arabic
28 laugh Arabic Shares a unique innovation with Arabic – the sound shift *ɬ' > ḍ (which came relatively late in Arabic – later than Sibawayh, even – and never happened in any other Semitic language). I can't speak for Amioun, but in general Levantine has ḍaḥak; if Amioun does have ḍaḥaq, the fact that it didn't become *ḍaḥaʔ suggests that the *k > q happened there only after the regular shift *q > ʔ, and hence has nothing to do with the Canaanite or Ugaritic forms.
29 leg innovative The alleged Ugaritic form is nonsense – Ugaritic had no j sound, and the dictionary of Del Olmo Lete and Sanmartin reveals no appropriate Ugaritic form. It is true that the Levantine form seems to be shared with Ethiopic and some Yemeni dialects, but not with any ancient language of the Fertile Crescent.
30 lion Arabic A very problematic choice as 'basic vocabulary'.
31 live Arabic / Canaanite / Aramaic Except that the Levantine form is clearly 'alive', not 'live', making the whole comparison problematic....
32 love Arabic The Arabic is of course mistranscribed - in his terms, it should be 2a7abba, whereas the Hebrew and Aramaic forms really do have a h.
33 make Arabic
34 man innovative 'zalame' is etymologically problematic – both Arabic and Aramaic etymologies have been proposed. 'rejjel' is of course from Arabic. dakar is 'male', not 'man'.
35 many Arabic
36 meat Arabic This shares a specific semantic shift with Arabic to the exclusion of the rest of Semitic : « staple food » > « meat »
37 milk Arabic / Ugaritic The root is common to several Semitic languages, but the use of the passive pattern fa3īl in this word is unique to Arabic
38 month Arabic Pretty sure the normal Levantine form is shahr, not sha7r, not that it makes any difference to the etymology – and for sure Syriac 'moon' below is sahrā, not šahrā.
39 moon Arabic

Saturday, December 23, 2017

Tokenistic Tifinagh #fail

In Oran (Algeria) when I was there a few days ago, political party posters were everywhere, advertising the recent local elections. Oran is nowhere near any major Berber-speaking region (though it has attracted a significant Kabyle Berber minority), and such posters – along with a few telecom ads – were almost the only publicly visible mark of Berber on its linguistic landscape. Their bilingualism is a token gesture towards the government's pious aspiration to make Tamazight (Berber) a national language, emanating from the centre rather than from the regions where it's actually spoken.

Among these, the FLN posters in particular caught my attention. Right under the Arabic name of the party, they included a line in Tifinagh (the Berber “heritage” script) that I couldn’t make head or tail of: ⵔⴵⵏⵜⴷⴻ ⵉⴱⵔⴰⵜⵉⴵⵏ ⴰⵜⵉⴵⵏⴰⵍⴻ. Transcribed, this reads rğntde ibratiğn atiğnale – which makes no sense; it’s not even possible in Berber to have e (schwa) at the end of a word.

It wasn’t until I started looking at my pictures on the flight back that the penny dropped. Just substitute o for ğ, and you get rontde ibration ationale. Restoring the capital and accented letters (neither of which Tifinagh has), you get Front de Libération Nationale. When the order came from on high to add Tamazight to the poster, some supremely indifferent functionary in the local FLN office must have literally downloaded a Tifinagh keyboard, typed in the French name of the party, and stuck it on the poster.

Most likely, this functionary was an Arabic speaker. In fairness, though, plenty of Kabyle speakers would have little idea how to render “National Liberation Front” into Kabyle. The officially acceptable way of doing so relies on neologisms developed by activists and familiar mainly to other activists, despite the gradually expanding efforts of teachers and broadcasters – and such activists are especially unlikely to be members of the FLN, given its general reluctance to promote Tamazight. What everyone actually calls it in practice (in Kabyle and in Arabic alike) is “FLN”.

I didn’t notice any similarly clearcut fails on other parties’ posters – though some didn’t bother with Tamazight at all, and at least one, the PT, opted for Latin characters instead. I did see a similar case on a jewelry shop, though, which prominently advertises ⴰⵔⴳⴻⵏⵜ argent, next to a picture of a recognizably Kabyle earring:

It's striking that both cases are based on French, rather than Arabic - even though the normal Kabyle word for "silver" is actually an Arabic loan, lfeṭṭa from الفضة. For some, apparently, the only really important thing about Tamazight is that it's not Arabic...

Sunday, December 10, 2017

Jerusalem's suppletive gentilic

Jerusalem stands out among Arab cities today not only culturally and religiously, but morphologically as well. In Modern Standard Arabic, the city of Jerusalem is al-Quds القدس, and the gentilic suffix is (properly -iyy), but "Jerusalemite" is Maqdisī مقدسي rather than the expected *Qudsī (though the latter is attested as a personal name). As a general cross-linguistic rule of thumb, morphological irregularities are most likely with older, more basic words. Yet this type of irregularity is rather unusual, even among the region's oldest and most prominent cities: Dimashq (Damascus) yields Dimashqī (Damascene), Baghdād yields Baghdādī, Makkah (Mecca) yields Makkī... How did it arise?

It turns out that, in the early Muslim era, it was formed in a perfectly regular way. In his masterwork, the medieval geographer Al-Maqdisī (d. 991) calls his hometown Bayt al-Maqdis بيت المقدس ("house of holiness"), a title now largely supplanted by al-Quds ("the holy"). It survives to the present in certain religious contexts or as a poetic synonym, not only in Arabic but in Kabyle Berber as well: H. Genevois ("Croyances") notes a traditional popular belief that the souls of the dead gather in Bit Elmeqdes, corresponding exactly to Al-Maqdisī's boast that Jerusalem is "the site of the Day of Judgement, and from it is the Resurrection, and to it is the Gathering" (عرصة القيامة ومنها النشر وإليها الحشر).

A quick search of Alwaraq's heritage library suggests that the shorter name "al-Quds" became popular around the period of the Crusades, when Jerusalem was as much a subject of dispute as now. The earliest attestation I can spot on a cursory search (excluding a work falsely attributed to al-Wāqidī) is a mention by the Andalusi traveller Ibn Jubayr (1185), who notes that "between [Kerak] and al-Quds is a day's march or so, and it is the best location in Palestine" (بينه وبين القدس مسيرة يوم أو اشف قليلاً، وهو سرارة أرض فلسطين). Very likely a longer search would yield slightly older attestations. By the time of the next major Palestinian writer I notice in the collection - Al-Ṣafadī (d. 1363) - al-Quds had clearly become the unmarked term for the town; it recurs constantly in his work.

The name Bayt al-Maqdis was thus replaced in practice by the shorter and catchier name al-Quds a good 800 years ago, yet the corresponding gentilic continues to preserve the older name. Since 1967, the Israeli government has imposed a third name as its official term for the city in Arabic: Ūrshalīm, a transcription of the Syriac name used in Christian liturgical contexts which provoked "furious ridicule" from residents (Segev 2007:492). Since this usage remains entirely unknown to most Arabic speakers, it is unlikely to have much impact on Arabic usage. Yet the timing of the shift from Bayt al-Maqdis to al-Quds reminds us that political upheaval impacts placenames as well as people's lives.

Monday, December 04, 2017

Tifinagh and place of articulation

The order of the Latin alphabet we use is a matter of historical chance; if it ever made sense, the reasons behind it were lost millennia ago. Many other writing systems, however, have tried to order their letters in a less arbitrary fashion. The most prominent successes for this approach are found in and around India, where scripts are usually ordered by place of articulation - ie, by how far back in the mouth they are pronounced - as in Devanagari: a..., ka ga kha gha ŋa, ca cha ja jha ña, ṭa ṭha ḍa ḍha ṇa... (After a couple of sound changes, this order ultimately also yields that of the Japanese kana: a, ka, sa (< ca), ta na, ha (< pa) ma, ya ra wa n.) In Arabic, the normal order of letters reflects a partial reordering by shape rather than by sound (thus ب ت ث are all grouped together, whereas in the older order they were far apart from one another). However, for technical purposes such as traditional phonetics and Qur'an recitation, one occasionally also finds the place-of-articulation order: indeed, the earliest Arabic dictionary (Kitāb al-`Ayn) used it (ع ح هـ خ غ ق ك ج ش ض ص س ز ط ت د ظ ذ ث ر ل ن ف ب م و ي ا ء).

Tifinagh, the traditional script of the Tuareg people of the Sahara, seems not to have any established traditional ordering. However, if you organize its letters by place of articulation, an obvious pattern emerges:

This table represents Tifinagh as used at Imi-n-Taborăq in Mali, as recorded by Elghamis (2011:64-65). (Note that w is a labio-velar sound; for obvious reasons, I've chosen to place it in the velar column rather than the labial one. Also, the letter put in the laryngeal plosive slot actually just indicates the presence of a final vowel, although there are reasons to suspect that it once represented a glottal stop.) There is a lot of regional variation in Tifinagh, but one thing stands out: in every variety, everything on the right side of the thick line - ie, everything velar or further back - is consistently formed exclusively out of dots, except for g - and even that is often composed of a combination of dots and lines. Throughout much of Tuareg, original g tends to be palatalized to [ɟ], and some dialects - like this one - have lost the distinction altogether.

How this distribution emerged is unclear for the moment. It is noteworthy, however, that dot letters did not exist in Tifinagh's ancestor, Libyco-Berber as used in the pre-Roman and early Roman periods (with rare, doubtful exceptions). Two of the dot letters have clear Libyco-Berber origins; ⴾ (k, three dots in a triangle) was originally ⥤ (k, a rightwards open arrow), while : (w) was originally =. Based on these two alone, one might suppose a sort of regular form shift of = to :, in which case the development might simply be coincidental. ⵗ (ɣ) may derive from the rarely attested ÷, whose value (q?) is speculative, while ... (x) is simply a rotation of ɣ. :: (q) had no Libyco-Berber equivalent, and is perhaps historically a visual "ligature" of ɣ and + (t) - the word-final cluster *ɣt becomes qq in Tuareg. The final vowel sign · might derive from classical ☰, which had the same function; alternatively, one might derive it from or the dot occasionally used to separate words, and suppose that classical ☰ actually yielded ⵂ (h), in which case the extra dot needs to be explained.

It's not impossible that Tifinagh users at some stage made a conscious link between back consonants and dots. But even if the distribution is just a coincidence, it should still be useful for anyone seeking to memorise the script.

Sunday, October 29, 2017

Butterfly-collecting: the history of an insult

Chomsky's barb about butterfly-collecting has echoed in the ears of descriptive linguists for decades, and is sometimes blamed for the withering away of field linguistics over the late 20th century. The earliest published version I could track down via Google is:
"You can also collect butterflies and make many observations. If you like butterflies, that’s fine; but such work must not be confounded with research, which is concerned to discover explanatory principles of some depth and fails if it does not do so." (Chomsky 1979:57)
So I was surprised to find a similar statement attributed to the eminent early 20th century physicist Ernest Rutherford, quoted by Dyson (2006:179) as saying "Physics is the only real science; the rest are butterfly-collecting." How did this metaphor make its way into linguistics?

For a start, it appears that Dyson's version is somewhat inexact. The Rutherford quote appears to belong to the oral tradition of physics, rather than deriving from any publication of his; the earliest version that I can find on Google Books is from Baker (1942:96):

"These ideas are crystallized in the statement, attributed to Rutherford, that science consists of physics and stamp- collecting. This is an epigram intended to mean that particular objects are uninteresting : it is the extreme view-point of a general analytical scientist."
The shift from stamps to butterflies came decades later, first attested only in 1974. In fact, the derisive comparison to butterfly collecting seems likely to have seeped into linguistics not from physics but from, of all subjects, anthropology. Edmund Leach (1961:2) makes it the central metaphor of his assault of Radcliffe-Brown:
"Radcliffe-Brown maintained that the objective of social anthropology was the 'comparison of social structures'. [...] Comparison is a matter of butterfly collecting — of classification, of the arrangement of things according to their types and subtypes. The followers of Radcliffe-Brown are anthropological butterfly collectors and their approach to their data has certain consequences."
Anthropologists would reuse the metaphor in debates over the distinction between different types of comparison in linguistics itself, whether endorsing it like Lehman (1964:387) or rebutting the criticism like Sarana (1965:29). From there it seems to have been taken up by Chomskyan linguists as an argument against Bloomfield's "disovery procedures", if I am correctly interpreting the incomplete fragment of Ferber and Lynd (1971) that I can find on Google Books:
"These procedures, which are largely a matter of classification, have been uncharitably called "butterfly-collecting" in the manner of pre-Darwinian biology: they account for a detailed "external" description of each language (what Chomsky [...]"
Geoffrey Leech (1969:4) deploys the same metaphor against rhetoric:
"Connected to this is a second weakness of traditional rhetoric - what I am tempted to call its 'train-spotting' or 'butterfly-collecting' attitude to style. This is the frame of mind in which the identification, classification and labelling of specimens of given stylistic devices becomes an end in itself [...]"
The redeployment of this argument to belittle descriptive work in general, rather than particular approaches, seems to be attributable to David DeCamp (1971:158), criticizing sociolinguistics from a Chomskyan perspective:
"The weakest theory is a 'functional' model, which only relates outputs from the black box to inputs, e. g. a grammar which would generate all and only the sentences of a language; the goal of much scientific research is to replace such a functional model with a 'structural' model, one that makes the stronger claim of describing what is actually in the black box. Mendel's 'genes' were only a functional model of genetics; the research on the DNA and RNA molecules has yielded a model that is much more nearly structural. Thus one branch of biology has at last become a true science; general linguistics is approaching that status; sociolinguistics is still in the pre-theoretical, butterfly-collecting stage, with no theory of its own and uncertain whether it has any place in general linguistic theory."
He then clarifies (ibid:170) that:
"'Butterfly collecting' is simply the collection of a whole lot of information toward the day when somebody can produce a formal theory. Now this is valuable, this is useful. We need a lot of empirical data collection also. I certainly would not want to imply by this that in this I'm saying that there is not an importance to the kinds of things that the Urban Language Survey is doing at CAL, or Bill Labov's work in New York. This is immensely important. What I am saying is that although it is necessary, it is not sufficient. We've got enough data now; it is about time to guide further research by means of some sort of a theory."
So, if we have to blame one person for reducing descriptive linguistics to butterfly collecting, it looks like it would be David DeCamp, at least until someone tracks down an earlier citation. But that misses a broader point: the disparaging comparison of data gathering to butterfly collecting seems to have become rather pervasive across a variety of disciplines in the late 20th century - including biology itself, which may well be part of where DeCamp got it from. All the way back in 1964, Theodosius Dobzhansky - who had been an ardent butterfly collector before becoming a prominent evolutionary biologist - comments sarcastically that:
"The notion has gained some currency that the only worthwhile biology is molecular biology. All else is "bird watching" or "butterfly collecting." Bird watching and butterfly collecting are occupations manifestly unworthy of serious scientists!" (Dobzhansky 1964:443)
Had he lived to see molecular biology turn to such quintessentially descriptive, list-making pursuits as the Human Genome Project, he would surely have enjoyed having the last laugh.

(If you have any earlier citations bearing on the history of this metaphor in linguistics, please tell me below!)

Tuesday, October 24, 2017

Siwi on Wikipedia

I am not a big fan of Wikipedia, despite its usefulness. To contribute good material to it - and there is a lot of wonderful material there - is to make an article look reassuringly reliable. That appearance of reliability then makes the article prime prey for anybody with an ideological or even commercial agenda to push: one little edit, and their propaganda is integrated into the same text, gaining credibility from its context, and getting copied over and over and over. Nevertheless, the insistent niggling itch of knowing that "someone is wrong on the internet" eventually got to me, and last month I ended up massively expanding the article Siwi language - including a fairly extensive section on Siwi oral literature. Suggestions or comments are welcome, although I make no promises.

Thursday, October 12, 2017

Shoes in Songhay and West Chadic: towards an etymology

The proto-Songhay word for "(pair of) shoes, sandals" is *tàgmú (Zarma tà:mú, Kandi tà:mú, Gao taam-i, Hombori tà:mí, Kikara tă:m, Djenne taam, Tadaksahak taɣmú, Korandje tsaɣmmu). It is evidently related to a less widely attested verb *tàgmá "step on" (Zarma tà:mú, Gao taama, Hombori tà:mà, Djenne taam). (Velar stop codas are lost in all of Songhay except the Northern branch, leaving behind either compensatory lengthening or a w; see Souag 2012.)

In Hausa, the word for "shoe, boot, sandal" is tà:kàlmí: (borrowed directly into the Songhay (Dendi) variety of Djougou as tàkăm). Within Hausa, this likewise corresponds to a verb tá:kà: "step on". The two-way similarity is striking, but if there was borrowing, which way did it go? A cognate set in Schuh (2008) casts some light on the question.

Hausa belongs to the West Chadic family, in which the best comparison to Hausa "shoe" seems to be Bole tàkà(:), with no obvious cognates within its own subgroup, Bole-Tangale (Ngamo tà:hò looks similar, but Ngamo h seems normally to correspond to Bole p, not k.) For "step on", however, Schuh points to a potential cognate set in a slightly more distantly related West Chadic subgroup, Bade. In this subgroup, we have Gashua Bade tà:gɗú, Western Bade tàgɗú, Ngizim tàkɗú which Schuh analyses as *tàk- plus an unproductive verbal extension -ɗu supported by Bade-internal evidence, eg tə̀nkùku "press" vs. tə̀nkwàkùɗu "massage". Within Bole-Tangale, one might speculate that Gera tàndə̀- is cognate, but Gera seems to be known only from short wordlists, so that would be difficult to show.

So the comparative evidence provides some support for the idea that Hausa tá:kà: "step on" goes back to proto-West Chadic. If tà:kàlmí: "shoe" could be regularly derived from this verb within Chadic, then the answer would appear clear: Songhay borrowed it from Chadic. However, while Hausa frequently forms deverbal nouns with a suffix -i: (Newman (2000:157), there seems to be no plausible language-internal explanation for the -lm-. In Songhay, on the other hand, a suffix -mi forming nouns from verbs (sometimes -m-ey with a former plural suffix stuck on) is reasonably well-attested: Gao (Heath 1999:97) dey "buy" vs. dey-mi "purchase (n.)", key "weave" vs. key-mi "weaving", Kikara (Heath 2005:97-98) kà:rù "go up" vs. kàr-mɛ̂y "going up", húná "live" vs. hùnà-mɛ̀y "long life". A shift *-mi to *-mu seems natural enough, especially since a few Songhay varieties actually have reflexes of "shoe" with a final -i in any case; so the Songhay form looks kind of like it could be **tàg "step on" plus deverbal -mí̀. To top it off, deverbal noun-forming suffixes in -r- are widely attested in Songhay, and Zarma attests a combined suffix -àr-mì: zànjì "break" vs. zànjàrmì "shard", bágú "break" vs. bàgàrmì "piece of debris" (Tersis 1981:244). If we treat the Hausa form as a borrowing from Songhay, we can then analyse it as **tàg "step on" plus deverbal -àr-mí. But before we get carried away, we should note that within Songhay there's no motivation for analysing the -mu / -mi in "shoe" as a suffix; the verb and the noun differ (if at all) only in the final vowel.

So what to make of all this? So far, the scenario that suggests itself is something like the following:

  1. Songhay borrows a verb *tàk "step on" from West Chadic (or vice versa?).
  2. Songhay internally forms a deverbal noun *tàk-mí "shoe" (there is no reconstructible contrast between *k and *g in coda position in proto-Songhay), alongside a variant *tàk-àr-mí.
  3. Hausa borrows this as tà:kàlmí:.
  4. Songhay replaces *tàk with a denominal verb formed from "shoe" (which becomes internally unanalysable): *tàgm-á. This step has possible internal motivations: in most of Songhay, final velar stops disappeared leaving behind only compensatory lengthening on the preceding vowel, and the resulting form tà: would have been homophonous with the much commoner verb "receive, take".
  5. Djougou Dendi, a heavily Hausa-influenced, somewhat creolized Songhay variety spoken in Benin, borrows the Hausa form as tàkăm.

Further Chadic comparative data may yet turn out to bear upon this etymology, but one thing seems clear: these two families have been affecting each other for a long time.

Friday, September 15, 2017

Berber and not so Berber words in Tunisian Arabic

Not too long ago I finished reading Lotfi Sayahi's Diglossia and Language Contact: Language Variation and Change in North Africa. The book is a valuable contribution to the study of synchronic language contact between Tunisian Arabic, Standard Arabic, and French in Tunisia, with some coverage of the rest of the region as well. Unfortunately, when it briefly looks at Berber lexical influence on Arabic (pp. 135, 187), reflecting joint work with Zouhir Gabsi, its conclusions are rather over-hasty. Since this book is likely to become a standard point of departure for English speakers studying language contact in North Africa, I think it's worth correcting the record here even at the risk of being pedantic:
  • fakru:n "turtle" and ferzazzu "wasp" really are Berber, though the -u:n suffix in the former was first added in dialectal Arabic (almost all Berber varieties have forms similar to Kabyle ifker/ikfer).
  • garžu:ma "throat" is a very difficult word to etymologize, but may ultimately be Berber (compare Tuareg a-gurzăy), although it does bring to mind Romance forms such as French gorge.
  • karmu:s "fig" is clearly derived from karm-a "fig tree", which is definitely not Berber, and seems to come from a narrowing of the meaning of Classical Arabic كرم karm "orchard" (see the brief discussion in Behnstedt & Woidich 2011:491). The suffix -u:s might theoretically be Berber, I suppose, but probably not; it's not widely attested across Berber, and it fits well with the widespread dialectal Arabic pattern of augmentatives in -u:-.
  • sebsi: "pipe" is from Turkish sipsi.
  • bu-telli:s "monster/nightmare" ("sleep paralysis", to be precise) is a compound involving bu- "possessor of" (originally "father of") plus telli:s (a kind of rug). The latter is well-attested within Arabic in the Middle East as well as in North Africa; its etymology is controversial, but it may derive from Latin trilicium "triple-twilled fabric".
  • ḍabbu:ṭ "axilla" (ie "armpit") is evidently an expressive formation from Arabic إبط 'ibṭ. The widespread Berber word for this is rather taddeɣt (from which we get Maghrebi Arabic dəɣdəɣ "tickle").
  • dagdag "to shatter" is a reduplicated form from Arabic دقّ daqqa "pulverize".

I don't have the time to check the rest of the reduplicated verbs he cites (tartar "to mutter", dardar "to muddy", maxmax "to nibble", maṣmaṣ "to rinse", sɛksɛk "to flow", tɛftɛf "to graze", and wɛdwɛd "to talk nonsense"), but maxmax and maṣmaṣ include phonemes with no regular proto-Berber sources, and I doubt any of them is really Berber in origin.

I don't mean to pick on the authors; notwithstanding this brief lapse, it's a good book, and worth reading. But I do want to hammer home to every linguist the message that etymology needs to be done properly. If you want to do etymology in a North African dialect, don't just assume that any word you don't recognize from Modern Standard Arabic or French is a Berber loanword; check other regional languages (especially Turkish), check existing publications on the subject, check the distribution of the word across different Berber and Arabic varieties. Etymology may not be a very trendy subject, but that doesn't mean it's easy.

Monday, August 28, 2017

Street math and diglossia

In "Mathematics in the streets and in schools" (Carraher et al. 1985), child street vendors were given a paper and pencil and asked to calculate multiplications that they had, in fact, already done in their heads in the course of selling their wares. The results were often sobering, as in the following case:
Informal test
Customer: OK, I'll take three coconuts (at the price of Cr$ 40.00 each). How much is that?
Child: (Without gestures, calculates out loud) 40, 80, 120.

Formal test
Child solves the item 40 x 3 and obtains 70. She then explains the procedure 'Lower the zero; 4 and 3 is 7'.

As you can see, the children were perfectly capable of doing (some!) multiplication their own way, but when faced with school-style problems, this ability frequently deserted them. Confronted with a piece of paper, they attempted to apply the algorithm they had learned at school, without so much as checking their answers against the algorithm they had mastered as part of their daily life. In daily life, conversely, they presumably weren't getting much out of the multiplication algorithm they had learnt at school, even though it would let them tackle a much wider range of multiplication problems. School-learning that stays at school, and never affects real life despite having an obvious potential to be useful there: it's an educator's nightmare.

What this immediately reminded me of is diglossia. In a schoolroom or an essay, you obediently attempt to use Standard Arabic, and all the grammatical rules and vocabulary you learned for it. Almost anywhere else, you carefully avoid it, even while claiming to accept that Standard Arabic is correct and that what you actually make very sure to speak is wrong. To me, that seems to send a fundamentally problematic message: that what you learn in school is not supposed to be useful outside of some limited institutional contexts. I hope that's not the message most people get from it, but it would be great to know for sure. I don't suppose anyone knows of a study addressing the question?

Thursday, August 24, 2017

*-min-: an Algonquian morpheme that went global

American English was born in the clearing of the eastern woodlands, where British settlers encountered native Americans mostly speaking Algonquian languages. The same is true, mutatis mutandis, of Canadian French. If either language can be said to have a native American substratum at all, it's Algonquian. This substratum is hardly conspicuous, manifesting itself almost exclusively in loanwords. If the Algonquian languages had vanished without record, as most of the pre-Indo-European languages of Europe did, could anything at all be said about their morphology on the basis of this influence?

It turns out that there's at least one bound morpheme that shows up in quite a few loanwords: *-min- "berry, fruit". But it manifests itself more clearly in French than in English, where it has been obscured by a number of irregular developments.

Today, French barely survives in the upper Midwest; but before Jefferson's purchase of the Louisiana Territory, France claimed the whole of this vast area, and attempted to back up its ambitions with a handful of missionaries and settlers. There, up among the Illinois near Peoria, French speakers encountered two quite unfamiliar fruits, and adopted their names from the Myaamia-Illinois language:

English missed the chance to borrow a local term for the pawpaw - the English word derives from papaya, a fruit originating much further south - but adopted a reflex of the same word for "persimmon", along with several other terms containing this. Unfortunately, most are fairly obscure (although no more so than "asimine"), and no two show the same form of the morpheme:
  • persimmon; cf. Virginia Algonquian putchamins (Smith), pushenims (Strachey), apparently reconstructed by Siebert as pessi:min (cf. Skeat 1908; although that looks rather implausible given the Illinois form).
  • hominy (because it's made from corn); cf. Virginia Algonquian ustatahamen (Smith), vshvccohomen (Strachey) and other forms.
  • chinquapin (a kind of chestnut); cf. Virginia Algonquian chechinquamins (Smith), checinqwamins (Strachey).
  • saskatoon (a berry); cf. Cree misâskwatômin ᒥᓵᐢᑲᐧᑑᒥᐣ.
  • pembina (a kind of cranberry); cf. Cree nîpiniminân ᓃᐱᓂᒥᓈᐣ.
The prospects are not that encouraging, but combining the English and French evidence, an alert etymologist just might be able to spot the *-min- morpheme, and hence guess that Algonquian had head-final compounds. Thankfully, in North America, such hyper-speculative substrate chasing is hardly necessary; Algonquian is a fairly well-documented family. In other parts of the world, though, such approaches may occasionally prove effective.

Tuesday, August 22, 2017

What's wrong with the obvious analysis of waš bih واش بيه?

In the Algerian Arabic dialect I grew up speaking, "what's wrong with him?" is waš bi-h? واش بيه. (Further west, in Oran and in Morocco, it's the more classical sounding ma-leh? ما له.) When the object is a pronoun, as it usually is, waš bi-h? can readily be understood as waš "what?" and bi-, the form of "with" (otherwise b) used before pronominal suffixes (in this case, -h "him"). But substitute a noun, and this historically correct interpretation becomes synchronically untenable: we say waš bi jedd-ek? "what's wrong with you (lit. your grandfather)?" واش بي جدّك, whereas "with your grandfather" would be b-jedd-ek بجدّك. Nor can we cleft it with the relative/focus marker lli اللي: *waš lli bi jedd-ek? (*"what is it that's wrong with you?") is totally ungrammatical, while *waš lli b-jedd-ek? does not have the appropriate meaning (in fact, out of context, it makes no sense at all). This tells us that, whatever its origins, waš bi- can no longer be analysed as "what?" plus a preposition "with"; it has to be treated as a morphosyntactic unit in its own right. In particular, this bi- cannot be used to form an adverbial - it only forms a predicate - so it can hardly be treated as a preposition. Nevertheless, it continues to take the prepositional pronominal suffixes: "what's wrong with me?" is waš bi-yya? واش بيَّ, not *waš bi-ni.

The independent unity of waš bi-? becomes a lot clearer when the construction is borrowed into another language, as has happened in the Berber variety of Tamezret in southern Tunisia. The stories recorded there by Hans Stumme shortly before 1900 are a bit hard to read, but provide probably the single most extensive published corpus of material in Tunisian Berber. These texts furnish many examples of aš bi-, although Tamezret Berber neither has to mean "what?" (that would be matta) nor bi- to mean "with" (that would be s). Many of these look just like Arabic: aš bi-k "what's wrong with you? (m.)" (p. 14, l. 11); aš bi-kum "what's wrong with you (pl.)?" (p. 27, l. 26), aš bi-h "what's wrong with him?" (p. 14, l. 3); and even, with a noun, aš bi iryazen "what's wrong with men?" (p. 41, l. 5). But the similarity is somewhat deceptive; in some cases, this construction takes Berber rather than Arabic pronominal suffixes, as illustrated by aš bi-ṯ "what's wrong with her?" (p. 25, l. 21) instead of Arabic aš bi-ha, aš bi-m "what's wrong with you (f.)?" (p. 10, l. 5). Unfortunately, the texts do not provide a complete paradigm - further documentation is needed! But judging by the available data, all cells but 3m.sg. match well with the Berber paradigm:

Algerian ArabicTamezretTamezret, direct objectsTamezret, objects of prepositions
2m.sg.waš bi-kaš bi-k-ak-k
2f.sg.waš bi-kaš bi-m-am-m
2m.pl.waš bi-kumaš bi-kum-akum / -awem-kum
3m.sg.waš bi-haš bi-h-ṯ-s
3f.sg.waš bi-haaš bi-ṯ-ṯ-s

The 2m.sg. and 2m.pl. suffixes are quasi-identical between Tamezret Berber and Arabic, facilitating the borrowing; for the second person, neither language clearly distinguishes direct object forms from objects of prepositions. The third person, however, distinguishes the two in Berber but not in Arabic, and 3f.sg. suggests that the object in this construction is treated as a direct object, not as the object of a preposition, contrary to the situation seen for Arabic. This fits Berber-internal patterns; throughout Berber, nonverbal predicators (Aikhenvald's "semi-verbs") typically take the direct object pronominal paradigm, and assign absolutive case to their arguments. The perfect agreement of the most frequently used cells in this paradigm between Arabic and Berber surely facilitated the borrowing of this item, but within Berber the paradigm got rebuilt on a largely Berber basis. In morphology, etymology is not destiny!

Saturday, July 22, 2017

Can slur avoidance be taken too far?

I was rather flabberghasted to read an otherwise good post on Language Log seriously suggesting that racial slurs are so painful they should be coyly asterisked out even in careful lexicographical explanations of why they should not be used. I do not pretend to any expertise on the impact of the specific slur in question there - I'd prefer to hear more black linguists' comments on that - but much of the argument they make is general, not specific:
If you take the standard linguistic analysis of slurs, though, the word’s power does not come from mere taboo [...] The word literally has as part of its semantic content an expression of racial hate, and its history has made that content unavoidably salient. It is that content, and that history, that gives this word (and other slurs) its power over and above other taboo expressions. It is for this reason that the word is literally unutterable for many people, and why we (who are white [...]) avoid it here.

Yes, even here on Language Log. There seems to be an unfortunate attitude — even among those whose views on slurs are otherwise similar to our own — that we as linguists are somehow exceptions to the facts surrounding slurs discussed in this post. In Geoffrey Nunberg’s otherwise commendable post on July 13, for example, he continues to mention the slur (quite abundantly), despite acknowledging the hurt it can cause. We think this is a mistake. We are not special; our community includes members of oppressed groups (though not nearly enough of them), and the rest of us ought to respect and show courtesy to them.

Anglo culture has a long tradition of scrupulously avoiding certain words in order to respect and show courtesy towards, in particular, women and children - people who were thought of as weaker and more emotional than adult men, and in need of their protection. Politeness is great, but if you treat people like they're made of glass, you're not only patronizing them, you're excluding them - you're implying that there are some discussions they just can't handle. (The term "white knight" comes to mind.)

This is ironic in general - people who have made it through serious oppression tend to be pretty tough, though everyone has their vulnerabilities. It's doubly ironic within an academic context, in that a core academic skill is the ability to confront and (if necessary) rebut personally threatening arguments without getting carried away by one's immediate reactions. In order to master North African historical linguistics, I've had to read works by colonial generals and OAS terrorists who fought and killed to subjugate my ancestors, and whose attitudes often colour their work; most people working on marginalized languages will have had similar experiences. If I can deal with that, do you really expect me to be incapacitated by some professor's cautious mention of, say, the word "raghead"? Words certainly can hurt, but slurs have enough power as they stand without adding the power of absolute taboo on top.