metadatta.

Entries categorized as ‘Statistics’

Determining Linguistic Structures Using Entropy?

May 2, 2009 · 2 Comments

Here’s an interesting recent controversy. While I know absolutely nothing about the subject, the ideas and questions raised are interesting, so here’s a quick summary of the different opinions.

In one corner – Rao et al.
In a recent high-profile Brevia published in Science a week ago, Rao et al. suggest that

  1. the degree of randomness in linguistic systems is significantly different from that of nonlinguistic systems, and
  2. the degree of randomness of the script of the Indus civilization is similar to that of other linguistic systems – in particular, Sumerian and Old Tamil. (The similarity to Old Tamil is particularly striking because it seems to support the somewhat controversial opinion of some “that the Indus peoples spoke and wrote a Dravidian language” – here I’m quoting from Farmer et al.’s rebuttal.)

Based on this, their claim is that the Indus script encodes some kind of linguistic structure, in stark contradiction to some well-known work by Farmer et al. arguing that the Indus script is “a simple nonlinguistic sign system common in the ancient world”. Unsurprisingly, this has set off a number of critical responses, and it’s always fun to see discussion and debate of this sort go on.

The way Rao et al. quantify the degree of randomness of any given sequence of units, or “tokens” (for example, words or characters in English) is by computing the conditional entropy, a standard measure of randomness in information theory. Simplistically, this quantity is a measure of how flexibly different tokens can be ordered: in a nonlinguistic system where different tokens are ordered at random – what Rao et al. call a “Type 1 nonlinguistic system” – the conditional entropy is high, while in a nonlinguistic system where a given token must be followed by another specific token – a “Type 2 nonlinguistic system” – the conditional entropy is low. Intuitively, it is perhaps not surprising that linguistic systems fall somewhere in between: Rao et al. verify this by computing the conditional entropy for a few different linguistic systems, as well as two synthetic nonlinguistic systems (type 1 and type 2). They use this to support their first claim. Furthermore, they compute the conditional entropy for sequences of signs from the Indus script and – surprise, surprise – find that it falls somewhere in between the type 1 and type 2 nonlinguistic systems, just like the other linguistic systems they studied. They use this to support their second claim.

In the other corner – Farmer et al., Liberman, Pereira, Shalizi, Sproat, and others.
Farmer et al. – whose work Rao et al.’s contradicts – have written a pretty strong response to Rao et al.’s paper. Among other things, Farmer et al. claim that their original work from 2004 “awakened resistance from Indian nationalists and researchers whose entire careers have been linked to the Indus-script thesis, one of whom is listed as coauthor of [Rao et al.'s] study”; and, “if [Rao et al.'s] paper had been properly peer reviewed it would not have been published.” Ouch. Here are their main critiques of this work:

  • Rao et al. used “synthetic” type 1 and type 2 nonlinguistic data in their calculations – that is, they created it according to certain rules. In a sense, these are designed to represent two different extremes on the “conditional entropy spectrum”, and as such it is not surprising that linguistic systems fall somewhere in between. Other nonlinguistic systems might, as well – so, claim #1 is unsubstantiated.
  • The idea that the Indus signs are in some linguistic way related to Old Tamil does not make sense historically: for example, “the first attestation
    of Old Tamil came nearly two thousand years after the Indus civilization disappeared”.

Others have weighed in on this as well, including Mark Liberman, Fernando Pereira, Cosma Shalizi, and Richard Sproat. In particular, Liberman, Shalizi and Sproat have come up with simple counter-examples to Rao et al.’s data, showing instances of nonlinguistic datasets that show at least qualitatively similar behavior to Rao et al.’s linguistic datasets. It appears that at least for now, Pereira’s comment that language is “a system… carrying lots of specific information that cannot be captured by a single statistic” seems to hold.

Categories: Academia · Artificial Intelligence · Interdisciplinary · Mathematics · Models · Papers · Physics · Science · Social Science · Statistics

Funny Journal Content

January 29, 2007 · 1 Comment

1. A candidate for the funniest journal title/paper graphic…
Here’s a cute paper: rolling a single molecular at the atomic scale. The authors look at C44H24, a molecule possessing two triptyene ‘wheels’ (with three ‘paddles’, each) and thus two intramolecular degrees of freedom when adsorbed on a metal surface (the independent rotation of each wheel), and push it along with an STM tip. Interestingly, the STM current is a good indicater of what kind of motion the molecule is undergoing (‘rolling’ versus ‘hopping’). What I find most amusing is that the molecule was previously used to construct a ‘molecular wheelbarrow’, a result which was published in Tetrahedron Letters – probably the funniest journal title I’ve come across – and includes the following priceless graphic:

0.gif

2. Can a biologist fix a radio? Or, what one scientist learned while studying apoptosis
Speaking of funny papers, this paper by Yuri Lazebnick (via Structure+Strangeness) is great. Here’s an excerpt, dealing with the question of how would a biologist fix a radio, knowing only that it is a box meant to play music?

How would we begin? First, we would secure funds to obtain a large supply of identical functioning radios in order to dissect and compare them to the one that is broken. We would eventually find how to open the radios and will find objects of various shape, color, and size. We would describe and classify them into families according to their appearance. We would describe a family of square metal objects, a family of round brightly colored objects with two legs, round-shaped objects with three legs and so on. Because the objects would vary in color, we will investigate whether changing the colors affects the radio’s performance. Although changing the colors would have only attenuating effects (the music is still playing but a trained ear of some people can discern some distortion), this approach will produce many publications and result in a lively debate.

3. Formation of a nematic fluid at high fields in Sr3Ru2O7:
I had quite a lengthy post on electronic liquid crystals in 2-dimensional electron gases (e.g. GaAs/AlGaAs heterostructures) a while back, and briefly noted that:

Scientists in Europe have measured a large magnetoresistive anisotropy in the correlated electron oxide strontium ruthenate (Sr3Ru2O7) near the ‘metamagnetic quantum critical point’, indicating the formation of a new quantum nematic phase. This is strikingly similar to the tranport anisotropy in 2DEGs I’ve been talking about… in particular, both show strong sensitivity to disorder – and the authors claim that the formation of this phase is tuned by the divergence in the quasiparticle effective mass near this critical point. One can only wonder what other kinds of systems could yield such behavior as well.

This European work is now one of the feature papers for the online Journal Club for Condensed Matter Physics, with a far more in-depth (yet very readable) commentary by Catherine Kallin of McMaster University in Canada.

(Click for more…)

Categories: Academia · Biophysics · Carbon Nanotubes · Condensed Matter Physics · Electronic Liquid Crystals · Interdisciplinary · Nanoscale Science · Nanotechnology · Papers · Physics · Quantum Mechanics · STM · Science · Statistics · Technology · Websites

Pseudo-book review: Edward Tufte

January 26, 2007 · 4 Comments

tufte.jpg
(from ‘The Visual Display of Quantitative Information’, Edward Tufte)

Perhaps best known in some circles for his scathing critique of Microsoft Powerpoint, Edward Tufte is the Leonardo da Vinci of data, as the New York Times put it, and his self-published books (the newly released Beautiful Evidence or the all-time classic The Visual Display of Quantitative Information) are quite elegant.

cover_vdqi.gifTufte isn’t just about making things look pretty – the epilogue of the latter book (excerpted above) says it best: “what is to be sought in designs for the display of information is the clear portrayal of complexity… that is, the revelation of the complex.” There are more books, too, but those are the two that I came across recently, and the thing is, he really means it. This man is in the business of taking data, getting rid of everything extraneous, superfluous, and distracting, presenting it in the most honest and unassuming form possible, and doing it in as accessible and user-friendly a way as possible. And you know what? Among other things, this is the business of science, too – to take good data, and force it reveal its secrets. Although Tufte comes from a social sciences background, I think his work is invaluable to any experimentalist, at the very least.

Categories: Book Review · Design · Interdisciplinary · Media · People · Skepticism · Statistics

Bayesian Humor

January 6, 2007 · Leave a Comment

I was googling Bayes’ Theorem, loosely motivated by this post (think Bayesian spam filtering) and came across this website. Here are some excerpts:

Your friends and colleagues are talking about something called “Bayes’ Theorem” or “Bayes’ Rule”, or something called Bayesian reasoning. They sound really enthusiastic about it, too, so you google and find a webpage about Bayes’ Theorem and…

It’s this equation. That’s all. Just one equation. The page you found gives a definition of it, but it doesn’t say what it is, or why it’s useful, or why your friends would be interested in it. It looks like this random statistics thing.

So you came here. Maybe you don’t understand what the equation says. Maybe you understand it in theory, but every time you try to apply it in practice you get mixed up trying to remember the difference between p(a|x) and p(x|a), and whether p(a)*p(x|a) belongs in the numerator or the denominator. Maybe you see the theorem, and you understand the theorem, and you can use the theorem, but you can’t understand why your friends and/or research colleagues seem to think it’s the secret of the universe. Maybe your friends are all wearing Bayes’ Theorem T-shirts, and you’re feeling left out. Maybe you’re a girl looking for a boyfriend, but the boy you’re interested in refuses to date anyone who “isn’t Bayesian”. What matters is that Bayes is cool, and if you don’t know Bayes, you aren’t cool.

or this fabulous Q&A:

Q. What is the Bayesian Conspiracy?
A. The Bayesian Conspiracy is a multinational, interdisciplinary, and shadowy group of scientists that controls publication, grants, tenure, and the illicit traffic in grad students. The best way to be accepted into the Bayesian Conspiracy is to join the Campus Crusade for Bayes in high school or college, and gradually work your way up to the inner circles. It is rumored that at the upper levels of the Bayesian Conspiracy exist nine silent figures known only as the Bayes Council.

Categories: General · Humor · Statistics · Websites