Tuesday, May 29, 2018

Zenaga dialectal reflexes of ʔ, :

For the purposes of Berber historical linguistics, arguably the most important thing about Zenaga is its thoroughgoing retention of the glottal stop. Some Zenaga glottal stops derive from *q, corresponding to ɣ elsewhere in Berber, but many derive from *ʔ, lost without trace in most Berber varieties. When a rather carefully transcribed new source of dialectal Zenaga data comes to light, it thus seems logical to start by seeing how the glottal stop is reflected there. For convenience, I restrict this first pass to two of Ahmadou Ismail's wordlists: body parts, and herding vocabulary. The results are fairly clear.

In general, Taine-Cheikh's Vʔ corresponds regularly to Ismail's V:, with the length clearly marked, as distinct from Taine-Cheikh's short V, which Ismail consistently transcribes short. Thus:

Ismail Taine-Cheikh
young camel awāra äwaʔräh
waterbag āga äʔgäh
moustache āya aʔyäh
donkey m. ājji aʔž(ž)iy
donkey f. tājil taʔž(ž)əL
beard tāmmart taʔmmärt
camels īyman iʔymän
cows tiššīđan ətšiʔđaʔn / ətšiʔđän
lamb hīmmar iẕ̌iʔmär
donkey foal īgiyu iʔgiyi
shoulder(blade) tūṛiḍ toʔṛuḌ
donkeys ūjjayan uʔž(ž)äyän
shoulder(blade)s tūrdin tuʔṛäđän

There are only two contexts where this correspondence does not hold.  In the context / _C#, if C is a stop or fricative, Ismail retains the glottal stop; if C is a sonorant, it disappears without affecting vowel length.  (More examples of this context would be useful to confirm the exact conditioning.)

spring taniʔđ täniʔḏ
cow taššiʔđ täšši
head iʔf iʔf
camel ayyim äyiʔm
camel f. tayyimt täyi(ʔ)mt

Word-finally, the variety Taine-Cheikh describes has no overtly realised glottal stops (*ʔ > Ø / _#); the contrast, however, is maintained, since all originally vowel-final words now end in h (*V > Vh / _#). In Ismail's dialect, the latter change never happened:

waterbag āga äʔgäh
moustache āya aʔyäh
young camel awāra äwaʔräh
stomach taxṣa taḫs(s)äh
goat tikši təkših
ewe tīyyi tīyih

Nevertheless, the two classes have not completely merged; final *i remains i, but final *iʔ becomes u:

billy-goat ahayu äẕ̌äyi
mouth immu əmmi
tooth awkšu äwkši
tongue itšu ətši
donkey foal īgiyu iʔgiyi
calf īrku īrki
In the variety Taine-Cheikh describes, long vowels derive not from *Vʔ but from *Vh (ultimately *Vβ). Given that vowel length can be a reflex of a former glottal stop in Ismail's dialect, the next thing we need to check is what happens to *Vh there; it turns out that there too it yields long vowels:

small cattle tākšin tākšən
calf īrku īrki
ewe tīyyi tīyih
nostril tīnhart tīnẕ̌ärt
nose tīnharin tīnẕ̌ärän

The regularity of these correspondences is a testimony to the accuracy of both parties' work, and confirms the value of Zenaga as a data source for Berber historical phonology.

Monday, May 28, 2018

A "crazy rule" in Zenaga

As part of what seems to be a solo documentation effort, Ahmadou Ismail has been posting some very interesting tidbits on Zenaga (in Arabic). The dialect reflected differs in some ways from the one reflected in Catherine Taine-Cheikh's publications. One of the more conspicuous differences is in the fate of proto-Berber *z. For Taine-Cheikh, *z > ẕ̌ in general (a slightly lowered ž), but *zt > Z (a tautosyllabic geminate zz). In Ahmadou Ismail's dialect, *zt > zz as with Taine-Cheikh, but otherwise *z > h, eg tihigrarin "tarawih prayers" vs. Taine-Cheikh's təẕ̌əgrärən, hīmmar "lamb" vs. Taine-Cheikh's iẕ̌iʔmär, awahiđ̣ "rooster" vs. äwäẕ̌uđ̣, yahinha "he sold" vs. yäžžənẕ̌äh. This leads to systematic alternations between h and zz; synchronically, Ismail's dialect of Zenaga has the "crazy rule" ht > zz. This is nicely illustrated by "he knew" (Taine-Cheikh: yuʔgäẕ̌) plus the direct object personal pronoun clitics:
  • "he knew me": yūgah-i
  • "he knew you m.": yūgah-ku
  • "he knew you f.": yūgah-kam
  • "he knew him": yūgaz-zu
  • "he knew her": yūgaz-zað
  • "he knew us": yūgah-ānag
  • "he knew you m.pl.": yūgah-kūn
  • "he knew you f.pl.": yūgah-kimmið
  • "he knew them m.": yūgaz-zin
  • "he knew them f.": yūgaz-zincað (maybe; not quite sure how چَّٰ is supposed to be read)
For forms without assimilation, compare, as posted by someone else on the same group (Omar Sidi Mohamed), "he was owned by" (Taine-Cheikh yənšäg):
  • "he was owned by me": yiššag-i
  • "he was owned by you m.": yiššak-ku
  • "he was owned by you f.": yiššak-kam
  • "he was owned by him": yiššak-tu
  • "he was owned by her": yiššak-tað
  • "he was owned by us": yiššag-ānag
  • "he was owned by you m.pl.": yiššak-kūn
  • "he was owned by you f.pl.": yiššak-kamað
  • "he was owned by them m.": yiššak-tan
  • "he was owned by them f.": yiššak-tinyað

Tuesday, May 22, 2018


Ever since she got interviewed on TV ten days ago, the 19-year-old president of the student union at Université Paris-Sorbonne, Maryam Pougetoux, has been making headlines - not for anything she said, but simply for wearing a hijab while she said it. In the name of defending freedom and feminism, the Minister of the Interior himself had the gall to criticise this brave young Frenchwoman as "marking her difference from French society". But as a historical linguist watching all this, I found myself wondering: where does the name "Pougetoux" come from? It turns out it can be traced several thousand years back:

In the course of this long history, no less than three different diminutive suffixes have been accreted on to the original root (although I'm not quite sure about the identity of that -oux.) I wonder whether that generalizes; do words meaning "hill" tend to accrete more and more diminutive suffixes as they develop over time?

Tuesday, May 08, 2018

Songhay viewed through PCA

Playing around a bit more with PCA, I decided to apply the method* to a dataset I've worked with more extensively: Songhay, a compact language family spoken mainly in Niger and Mali. On a hundred-word list (Swadesh with a few changes), randomly choosing one form in cases of synonymy and including borrowings, I get the following table of lexical cognate percentages:

Tabelbala Tadaksahak Tagdal In-Gall Timbuktu Djenne Kikara Hombori Zarma Djougou
Tabelbala 1 0.678 0.67 0.687 0.636 0.667 0.625 0.622 0.616 0.602
Tadaksahak 0.678 1 0.857 0.8 0.63 0.635 0.567 0.576 0.58 0.586
Tagdal 0.67 0.857 1 0.857 0.632 0.649 0.579 0.588 0.582 0.588
In-Gall 0.687 0.8 0.857 1 0.65 0.667 0.598 0.606 0.6 0.606
Timbuktu 0.636 0.63 0.632 0.65 1 0.979 0.773 0.808 0.79 0.778
Djenne 0.667 0.635 0.649 0.667 0.979 1 0.753 0.789 0.771 0.768
Kikara 0.625 0.567 0.579 0.598 0.773 0.753 1 0.835 0.814 0.823
Hombori 0.622 0.576 0.588 0.606 0.808 0.789 0.835 1 0.838 0.867
Zarma 0.616 0.58 0.582 0.6 0.79 0.771 0.814 0.838 1 0.808
Djougou 0.602 0.586 0.588 0.606 0.778 0.768 0.823 0.867 0.808 1

Running this through R again to get its eigenvectors, the first two principal components are easily interpretable:
  • PC1 (eigenvalue=7.3) separates Songhay into three low-level subgroups - Western, Eastern, and Northern, in that order - with an obvious longitude effect: it traces a line eastward all the way down the Niger river, jumps further east to In-Gall, and then proceeds back westward through the Sahara.
  • PC2 (eigenvalue=1.1) measures the level of Berber/Tuareg influence.
All the other eigenvectors have eigenvalues lower than 0.4, and are thus much less significant.

The resulting cluster patterns have a strikingly shallow time depth; as in the Arabic example in my last post, this method's results correspond well to criteria of synchronic mutual intelligibility (Western Songhay is much easier for Eastern Songhay speakers to understand than Northern is), but it completely fails to pick up on the deeper historic tie between Northern Songhay and Western Songhay (they demonstrably form a subgroup as against Eastern). It's nice how the strongest contact influence shows up as a PC, though; it would be worth exploring how good this method is at identifying contact more generally.

* Strictly speaking, this may not quite count as PCA - I'm starting from a similarity matrix generated non-numerically, rather than turning the lexical data into binary numeric data and letting that produce a similarity matrix.

Update, following Whygh's comment below: here's what SplitsTree gives based on the same table:

Monday, May 07, 2018

Some notes on PCA

(Exploratory notes, written to be readable to linguists but posted in the hope of feedback from geneticists and/or statisticians - in my previous incarnation as a mathmo, I was much more interested in pure than applied....)

Given the popularity of Principal Component Analysis (PCA) in population genetics, it's worth a historical linguist's while to have some idea of how it works and how it's applied there. This popularity might also suggest at first glance that the method has potential for historical linguistics; that possibility may be worth exploring, but it seems more promising as a tool for investigating synchronic language similarity.

Before we can do PCA, of course, we need a data set. Usually, though not always, population geneticists use SNPs - single nucleotide polymorphisms. The genome can be understood as a long "text" in a four-letter "alphabet"; a SNP is a position in that text where the letter used varies between copies of the text (ie between individuals). For each of m individuals, then, you check the value of each of a large number n of selected SNPs. That gives you an m by n data matrix of "letters". You then need to turn this from letters into numbers you can work with. As far as I understand, the way they do that (rather wasteful, but geneticists have such huge datasets they hardly care) is to pick a standard value for each SNP, and replace each letter with 1 if it's identical to that value, and 0 if it isn't. For technical convenience, they sometimes then "normalize" this: for each cell, subtract the mean value of its (SNP) row (so that the row mean ends up as 0), then rescale so that each column has the same variance.

Using this data matrix, you then create a covariance matrix by multiplying the data matrix by its own transposition, divided by the number of markers: in the resulting table, each cell gives a measure of the relationship between a pair of individuals. Assuming simple 0/1 values as described above, each cell will in fact give the proportion of SNPs for which the two individuals both have the same value as the chosen standard. Within linguistics, lexicostatistics offers fairly comparable tables; there, the equivalent of SNPs is lexical items on the Swadesh list, but rather than "same value as the standard", the criterion is "cognate to each other" (or, in less reputable cases, "vaguely similar-looking").

Now, there is typically a lot of redundancy in the data and hence in the relatedness matrix too: in either case, the value of a given cell is fairly predictable from the value of other cells. (If individuals X and Y are very similar, and X is very similar to Z, then Y will also be very similar to Z.) PCA is a tool for identifying these redundancies by finding the covariance matrix's eigenvectors: effectively, rotating the axes in such a way as to get the data points as close to the axes as possible. Each individual is a data point in a space with as many dimensions as there are SNP measurements; for us 3D creatures, that's very hard to visualise graphically! But by picking just the two or three eigenvectors with the highest eigenvalues - ie, the axes contributing most to the data - you can graphically represent the most important parts of what's going on in just a 2D or 3D plot. If two individuals cluster together in such a plot, then they share a lot of their genome - which, in human genetics, is in itself a reliable indicator of common ancestry, since mammals don't really do horizontal gene transfer. (In linguistics, the situation is rather different: sharing a lot of vocabulary is no guarantee of common ancestry unless that vocabulary is particularly basic.) You then try to interpret that fact in terms of concepts such as geographical isolation, founder events, migration, and admixture - the latter two corresponding very roughly to language contact.

The most striking thing about all this, for me as a linguist, is how much data is getting thrown away at every stage of the process. That makes sense for geneticists, given that the dataset is so much bigger and simpler than what human language offers comparativists: one massive multi-gigabyte cognate per individual, made up of a four-letter universal alphabet! Historical linguists are stuck with a basic lexicon rarely exceeding a few thousand words, none of which need be cognate across a given language pair, and an "alphabet" (read: phonology) differing drastically from language to language - alongside other clues, such as morphology, that don't have any immediately obvious genetic counterpart but again have a comparatively small information content.

Nevertheless, there is one obvious readily available class of linguistic datasets to which one could be tempted to apply PCA, or just eigenvector extraction: lexicostatistical tables. For Semitic, someone with more free time than I have could readily construct one from Militarev 2015, or extract one from the supplemental PDFs (why PDFs?) in Kitchen et al. 2009. Failing that, however, a ready-made lexicostatistical similarity matrix is available for nine Arabic dialects, in Schulte & Seckinger 1985, p. 23/62. Its eigenvectors can easily be found using R; basically, the overwhelmingly dominant PC1 (eigenvalue 8.11) measures latitude longitude, while PC2 (eigenvalue 0.19) sharply separates the sedentary Maghreb from the rest. This tells us two interesting things: within this dataset, Arabic looks overwhelmingly like a classic dialect continuum, with no sharp boundaries; and insofar as it divides up discontinuously at all, it's the sedentary Maghreb varieties that stand out as having taken their own course. The latter point shows up clearly on the graphs: plotting PC2 against PC1, or even PC3, we see a highly divergent Maghreb (and to a lesser extent Yemen) vs. a relatively homogeneous Mashriq. (One might imagine that this reflects a Berber substratum, but that is unlikely here; few if any Berber loans make it onto the 100-word Swadesh list.) All of this corresponds rather well to synchronic criteria of mutual comprehensibility, although a Swadesh list is only a very indirect measure of that. But it doesn't tell us much about historical events, beyond the null hypothesis of continuous contact in rough proportion to distance; about all you need to explain this particular dataset is a map.

(NEW: and with PC3:)