Here’s an interesting recent controversy. While I know absolutely nothing about the subject, the ideas and questions raised are interesting, so here’s a quick summary of the different opinions.
In one corner – Rao et al.
In a recent high-profile Brevia published in Science a week ago, Rao et al. suggest that
- the degree of randomness in linguistic systems is significantly different from that of nonlinguistic systems, and
- the degree of randomness of the script of the Indus civilization is similar to that of other linguistic systems – in particular, Sumerian and Old Tamil. (The similarity to Old Tamil is particularly striking because it seems to support the somewhat controversial opinion of some “that the Indus peoples spoke and wrote a Dravidian language” – here I’m quoting from Farmer et al.’s rebuttal.)
Based on this, their claim is that the Indus script encodes some kind of linguistic structure, in stark contradiction to some well-known work by Farmer et al. arguing that the Indus script is “a simple nonlinguistic sign system common in the ancient world”. Unsurprisingly, this has set off a number of critical responses, and it’s always fun to see discussion and debate of this sort go on.
The way Rao et al. quantify the degree of randomness of any given sequence of units, or “tokens” (for example, words or characters in English) is by computing the conditional entropy, a standard measure of randomness in information theory. Simplistically, this quantity is a measure of how flexibly different tokens can be ordered: in a nonlinguistic system where different tokens are ordered at random – what Rao et al. call a “Type 1 nonlinguistic system” – the conditional entropy is high, while in a nonlinguistic system where a given token must be followed by another specific token – a “Type 2 nonlinguistic system” – the conditional entropy is low. Intuitively, it is perhaps not surprising that linguistic systems fall somewhere in between: Rao et al. verify this by computing the conditional entropy for a few different linguistic systems, as well as two synthetic nonlinguistic systems (type 1 and type 2). They use this to support their first claim. Furthermore, they compute the conditional entropy for sequences of signs from the Indus script and – surprise, surprise – find that it falls somewhere in between the type 1 and type 2 nonlinguistic systems, just like the other linguistic systems they studied. They use this to support their second claim.
In the other corner – Farmer et al., Liberman, Pereira, Shalizi, Sproat, and others.
Farmer et al. – whose work Rao et al.’s contradicts – have written a pretty strong response to Rao et al.’s paper. Among other things, Farmer et al. claim that their original work from 2004 “awakened resistance from Indian nationalists and researchers whose entire careers have been linked to the Indus-script thesis, one of whom is listed as coauthor of [Rao et al.'s] study”; and, “if [Rao et al.'s] paper had been properly peer reviewed it would not have been published.” Ouch. Here are their main critiques of this work:
- Rao et al. used “synthetic” type 1 and type 2 nonlinguistic data in their calculations – that is, they created it according to certain rules. In a sense, these are designed to represent two different extremes on the “conditional entropy spectrum”, and as such it is not surprising that linguistic systems fall somewhere in between. Other nonlinguistic systems might, as well – so, claim #1 is unsubstantiated.
- The idea that the Indus signs are in some linguistic way related to Old Tamil does not make sense historically: for example, “the first attestation
of Old Tamil came nearly two thousand years after the Indus civilization disappeared”.
Others have weighed in on this as well, including Mark Liberman, Fernando Pereira, Cosma Shalizi, and Richard Sproat. In particular, Liberman, Shalizi and Sproat have come up with simple counter-examples to Rao et al.’s data, showing instances of nonlinguistic datasets that show at least qualitatively similar behavior to Rao et al.’s linguistic datasets. It appears that at least for now, Pereira’s comment that language is “a system… carrying lots of specific information that cannot be captured by a single statistic” seems to hold.



