Jabal al-Lughat: genetics

Monday, May 07, 2018

Some notes on PCA

(Exploratory notes, written to be readable to linguists but posted in the hope of feedback from geneticists and/or statisticians - in my previous incarnation as a mathmo, I was much more interested in pure than applied....)

Given the popularity of Principal Component Analysis (PCA) in population genetics, it's worth a historical linguist's while to have some idea of how it works and how it's applied there. This popularity might also suggest at first glance that the method has potential for historical linguistics; that possibility may be worth exploring, but it seems more promising as a tool for investigating synchronic language similarity.

Before we can do PCA, of course, we need a data set. Usually, though not always, population geneticists use SNPs - single nucleotide polymorphisms. The genome can be understood as a long "text" in a four-letter "alphabet"; a SNP is a position in that text where the letter used varies between copies of the text (ie between individuals). For each of m individuals, then, you check the value of each of a large number n of selected SNPs. That gives you an m by n data matrix of "letters". You then need to turn this from letters into numbers you can work with. As far as I understand, the way they do that (rather wasteful, but geneticists have such huge datasets they hardly care) is to pick a standard value for each SNP, and replace each letter with 1 if it's identical to that value, and 0 if it isn't. For technical convenience, they sometimes then "normalize" this: for each cell, subtract the mean value of its (SNP) row (so that the row mean ends up as 0), then rescale so that each column has the same variance.

Using this data matrix, you then create a covariance matrix by multiplying the data matrix by its own transposition, divided by the number of markers: in the resulting table, each cell gives a measure of the relationship between a pair of individuals. Assuming simple 0/1 values as described above, each cell will in fact give the proportion of SNPs for which the two individuals both have the same value as the chosen standard. Within linguistics, lexicostatistics offers fairly comparable tables; there, the equivalent of SNPs is lexical items on the Swadesh list, but rather than "same value as the standard", the criterion is "cognate to each other" (or, in less reputable cases, "vaguely similar-looking").

Now, there is typically a lot of redundancy in the data and hence in the relatedness matrix too: in either case, the value of a given cell is fairly predictable from the value of other cells. (If individuals X and Y are very similar, and X is very similar to Z, then Y will also be very similar to Z.) PCA is a tool for identifying these redundancies by finding the covariance matrix's eigenvectors: effectively, rotating the axes in such a way as to get the data points as close to the axes as possible. Each individual is a data point in a space with as many dimensions as there are SNP measurements; for us 3D creatures, that's very hard to visualise graphically! But by picking just the two or three eigenvectors with the highest eigenvalues - ie, the axes contributing most to the data - you can graphically represent the most important parts of what's going on in just a 2D or 3D plot. If two individuals cluster together in such a plot, then they share a lot of their genome - which, in human genetics, is in itself a reliable indicator of common ancestry, since mammals don't really do horizontal gene transfer. (In linguistics, the situation is rather different: sharing a lot of vocabulary is no guarantee of common ancestry unless that vocabulary is particularly basic.) You then try to interpret that fact in terms of concepts such as geographical isolation, founder events, migration, and admixture - the latter two corresponding very roughly to language contact.

The most striking thing about all this, for me as a linguist, is how much data is getting thrown away at every stage of the process. That makes sense for geneticists, given that the dataset is so much bigger and simpler than what human language offers comparativists: one massive multi-gigabyte cognate per individual, made up of a four-letter universal alphabet! Historical linguists are stuck with a basic lexicon rarely exceeding a few thousand words, none of which need be cognate across a given language pair, and an "alphabet" (read: phonology) differing drastically from language to language - alongside other clues, such as morphology, that don't have any immediately obvious genetic counterpart but again have a comparatively small information content.

Nevertheless, there is one obvious readily available class of linguistic datasets to which one could be tempted to apply PCA, or just eigenvector extraction: lexicostatistical tables. For Semitic, someone with more free time than I have could readily construct one from Militarev 2015, or extract one from the supplemental PDFs (why PDFs?) in Kitchen et al. 2009. Failing that, however, a ready-made lexicostatistical similarity matrix is available for nine Arabic dialects, in Schulte & Seckinger 1985, p. 23/62. Its eigenvectors can easily be found using R; basically, the overwhelmingly dominant PC1 (eigenvalue 8.11) measures ~~latitude~~ longitude, while PC2 (eigenvalue 0.19) sharply separates the sedentary Maghreb from the rest. This tells us two interesting things: within this dataset, Arabic looks overwhelmingly like a classic dialect continuum, with no sharp boundaries; and insofar as it divides up discontinuously at all, it's the sedentary Maghreb varieties that stand out as having taken their own course. The latter point shows up clearly on the graphs: plotting PC2 against PC1, or even PC3, we see a highly divergent Maghreb (and to a lesser extent Yemen) vs. a relatively homogeneous Mashriq. (One might imagine that this reflects a Berber substratum, but that is unlikely here; few if any Berber loans make it onto the 100-word Swadesh list.) All of this corresponds rather well to synchronic criteria of mutual comprehensibility, although a Swadesh list is only a very indirect measure of that. But it doesn't tell us much about historical events, beyond the null hypothesis of continuous contact in rough proportion to distance; about all you need to explain this particular dataset is a map.

(NEW: and with PC3:)

Friday, September 06, 2013

Y-chromosomes and language shift in North Africa

The other day I finally came across an easy-to-follow comparative presentation of North African genetic data, on Wikipedia of all things: Y-DNA haplogroups by populations of North Africa. I'm no geneticist, and welcome input from better-informed readers, but here's what that data looks like at first glance to a historical linguist.

As you might know, a man gets his Y-chromosome exclusively from his father (his mother doesn't have one). In North Africa, your ethnic/tribal/familial/etc identity – an important predictor of your language – is likewise traditionally supposed to be inherited from your father, not your mother. So it's illuminating to compare them.

A haplotype called E-M81 (or E1b1b, E3b) is frequent in Northwest Africa, and is held by large majorities of the Berber-speaking populations examined in Morocco or in the western/central Sahara; it is much less frequent in the Middle East. It seems reasonable to associate this haplotype with the spread of Berber. By contrast, haplotype J1 is very frequent in the Arabian Peninsula, but gets rarer and rarer as you go west; it seems reasonable to associate this haplotype with the Arab expansion. (Neither Berbers nor Arabs were ever completely homogeneous, so other, less frequent haplotypes may also be associated with one or the other of these events.)

The table gives four Algerian populations: Oran, Algiers, Tizi-Ouzou (Kabyle), and Mozabites. Mozabites, as might be expected, have a really high frequency of E-M81 (87%) and a really low frequency of J1 (1.5%). The other three, however, all have about 45% E-M81 (45%, 43%, 47% respectively) – in terms of the frequency of this presumably Berber marker, there is almost no difference between the Arabic speakers of Algiers and Oran and the Berber speakers of Tizi-Ouzou. In terms of the frequency of the originally Arab J1, the difference is hardly greater – 23% in Oran and Algiers vs. 16% in Tizi-Ouzou. Since we aren't sure about the historical interpretation of the rest of the haplotypes found, it may be more useful to consider the ratios of "Berber" E-M81 to "Arab" J1: 2:1 for Oran and Algiers vs. 3:1 for Tizi-Ouzou (and 29:1 for Mozabites).

What this tentatively tells us, in brief, is that:

In Algeria, plenty of Berber fathers adopted Arabic; if you are an Arabic speaker, you're very likely patrilineally Berber. (No surprise there!)
In Kabylie, a fair number of Arab fathers adopted Berber; if you are a Kabyle speaker, you may well nonetheless be patrilineally Arab. (Many readers will be surprised by this, but they shouldn't be: read about the history of the Sebaou valley in and after the Turkish period sometime, for example, let alone the more controversial example of the maraboutic families.)
Arabic was more likely to be adopted where more Arabs had come in, even though genetically, Arabs remained a minority. (In other words, Arabisation wasn't just about language shift.)
It's really rare for an outsider man to become Mozabite. (No surprise there either.)

A slightly different language shift situation is indicated by the comparison of Arab and Berber groups on Djerba (southern Tunisia). They do indeed differ on the frequency of J1 – the "Arabs" have it at 8.7%, while the Berbers have none at all. The Arabic speakers of Djerba appear to be genetically less Arab than the Kabyle speakers of Tizi-Ouzou! But, more importantly, we have what looks like a classic case of elite-led language shift: in this case, unlike Kabylie, the groups that incorporated Arab men simply ended up considering themselves Arab, while the ones that didn't stayed Berber. (I almost said kept speaking Berber, but actually, many Berber speakers of Djerba have been shifting to Arabic.)

Finally, one Berber-speaking population stands out radically in this table: Siwa. There is no significant presence of E-M81 there, and not much J1 either. The haplotypes best represented there are R1b – usually associated with Western Europe and, for some reason, with Chadic speakers – and B2a1a, usually associated with central and eastern sub-Saharan Africa. R1b has a reasonable frequency in Kabylie and Niger Tuareg, and to a lesser extent in Egypt, so we might suppose that it reflects the oasis' Berber roots, or that it reflects immigration from the east; we'd need non-Tuareg Libyan Berber genetic data to test that hypothesis. B, however, isn't common anywhere else in North Africa; does it derive from the slave trade, or from some older population of the region? Again, I think more data from Libya will be needed to make sense of this.