Tuesday, May 08, 2018

Songhay viewed through PCA

Playing around a bit more with PCA, I decided to apply the method* to a dataset I've worked with more extensively: Songhay, a compact language family spoken mainly in Niger and Mali. On a hundred-word list (Swadesh with a few changes), randomly choosing one form in cases of synonymy and including borrowings, I get the following table of lexical cognate percentages:

Tabelbala Tadaksahak Tagdal In-Gall Timbuktu Djenne Kikara Hombori Zarma Djougou
Tabelbala 1 0.678 0.67 0.687 0.636 0.667 0.625 0.622 0.616 0.602
Tadaksahak 0.678 1 0.857 0.8 0.63 0.635 0.567 0.576 0.58 0.586
Tagdal 0.67 0.857 1 0.857 0.632 0.649 0.579 0.588 0.582 0.588
In-Gall 0.687 0.8 0.857 1 0.65 0.667 0.598 0.606 0.6 0.606
Timbuktu 0.636 0.63 0.632 0.65 1 0.979 0.773 0.808 0.79 0.778
Djenne 0.667 0.635 0.649 0.667 0.979 1 0.753 0.789 0.771 0.768
Kikara 0.625 0.567 0.579 0.598 0.773 0.753 1 0.835 0.814 0.823
Hombori 0.622 0.576 0.588 0.606 0.808 0.789 0.835 1 0.838 0.867
Zarma 0.616 0.58 0.582 0.6 0.79 0.771 0.814 0.838 1 0.808
Djougou 0.602 0.586 0.588 0.606 0.778 0.768 0.823 0.867 0.808 1

Running this through R again to get its eigenvectors, the first two principal components are easily interpretable:
  • PC1 (eigenvalue=7.3) separates Songhay into three low-level subgroups - Western, Eastern, and Northern, in that order - with an obvious longitude effect: it traces a line eastward all the way down the Niger river, jumps further east to In-Gall, and then proceeds back westward through the Sahara.
  • PC2 (eigenvalue=1.1) measures the level of Berber/Tuareg influence.
All the other eigenvectors have eigenvalues lower than 0.4, and are thus much less significant.

The resulting cluster patterns have a strikingly shallow time depth; as in the Arabic example in my last post, this method's results correspond well to criteria of synchronic mutual intelligibility (Western Songhay is much easier for Eastern Songhay speakers to understand than Northern is), but it completely fails to pick up on the deeper historic tie between Northern Songhay and Western Songhay (they demonstrably form a subgroup as against Eastern). It's nice how the strongest contact influence shows up as a PC, though; it would be worth exploring how good this method is at identifying contact more generally.


* Strictly speaking, this may not quite count as PCA - I'm starting from a similarity matrix generated non-numerically, rather than turning the lexical data into binary numeric data and letting that produce a similarity matrix.

Update, following Whygh's comment below: here's what SplitsTree gives based on the same table:

2 comments:

Whygh said...

Here's what a .nex file looks like for this data. I converted the similarities to distances by substracting them from 1, and opened the nex file in SplitsTree. It shows something like four branches: ( (E,W), (Tabelbala, other N) ), without much reticulation.

#nexus

BEGIN Taxa;
DIMENSIONS ntax=10;
TAXLABELS
[1] 'Tabelbala'
[2] 'Tadaksahak'
[3] 'Tagdal'
[4] 'In-Gall'
[5] 'Timbuktu'
[6] 'Djenne'
[7] 'Kikara'
[8] 'Hombori'
[9] 'Zarma'
[10] 'Djougou'
;
END; [Taxa]

BEGIN Distances;
DIMENSIONS ntax=10;
FORMAT labels=left diagonal triangle=both;
MATRIX
[1] 'Tabelbala' 0 0.322 0.33 0.313 0.364 0.333 0.375 0.378 0.384 0.398
[2] 'Tadaksahak' 0.322 0 0.143 0.2 0.37 0.365 0.433 0.424 0.42 0.414
[3] 'Tagdal' 0.33 0.143 0 0.143 0.368 0.351 0.421 0.412 0.418 0.412
[4] 'In-Gall' 0.313 0.2 0.143 0 0.35 0.333 0.402 0.394 0.4 0.394
[5] 'Timbuktu' 0.364 0.37 0.368 0.35 0 0.021 0.227 0.192 0.21 0.222
[6] 'Djenne' 0.333 0.365 0.351 0.333 0.021 0 0.247 0.211 0.229 0.232
[7] 'Kikara' 0.375 0.433 0.421 0.402 0.227 0.247 0 0.165 0.186 0.177
[8] 'Hombori' 0.378 0.424 0.412 0.394 0.192 0.211 0.165 0 0.162 0.133
[9] 'Zarma' 0.384 0.42 0.418 0.4 0.21 0.229 0.186 0.162 0 0.192
[10] 'Djougou' 0.398 0.414 0.412 0.394 0.222 0.232 0.177 0.133 0.192 0
;
END; [Distances]

BEGIN st_Assumptions;
disttransform=NeighborNet;
splitstransform=EqualAngle;
SplitsPostProcess filter=dimension value=4;
autolayoutnodelabels;
END; [st_Assumptions]

Lameen Souag الأمين سواق said...

Thanks! I'll have to play around with SplitsTree. I think the main reason it's missing the N-W link is the loanwords in N.