metadatta.

Entries from May 2009

Managing Information on the Web, Wolfram-Style

May 15, 2009 · Leave a Comment

While Steven Wolfram may not the most, um, orthodox figure in the scientific community (see, for example Steven Levy’s bio, or Cosma Shalizi’s review of the modestly-titled A New Kind of Science), I don’t think anyone doubts the usefulness of Mathematica and the various things associated with it (e.g. MathWorld and the Demonstrations Project). And now apparently his latest production – WolframAlpha, Wolfram’s new Mathematica-based search engine – will be released to the public this Monday. It looks quite interesting.

Finding useful information on the internet can be difficult and incredibly annoying, particularly for scientists or anyone in search of statistics of some sort. Google and Wikipedia, while useful, can often be inefficient or yield inadequate results. Many new search engines tailored to various interests seem to have emerged recently, but I am not aware of any current tools that satisfactorily tackle this particular (non-trivial) problem. One solution for anyone interested in biology is bionumbers, a searchable database of useful biological facts and data taken straight from the literature — but I think it’s quite clear that a more general and comprehensive solution (which WolframAlpha purports to be) would be very cool.

Judging from Wolfram’s promo video and reviews on pcworld, techreview and semantic universe, Alpha seems to be bionumbers made significantly more powerful and comprehensive. You probably won’t want to use it over google to find movie times or track your favorite celebrities’ lovelives; but you will want to use it to find various kinds of quantitative information: various metrics of the weather in Springfield, MA on the day David Ortiz was born, the location and sequence of some gene, the flowfield over a particular airfoil, the current position of the International Space Station, or data on blood cholesterol and potassium levels of middle-aged male smokers, for example. I look forward to pushing the limits of this tool, but it looks very useful.

Not be outmatched, Google recently announced plans to implement a similar kind of service using publicly-available data. I’m not sure when they will be releasing it, though, or how it will compare to WolframAlpha.

Categories: Artificial Intelligence · Computing · General · Media · Technology · Websites

Determining Linguistic Structures Using Entropy?

May 2, 2009 · 2 Comments

Here’s an interesting recent controversy. While I know absolutely nothing about the subject, the ideas and questions raised are interesting, so here’s a quick summary of the different opinions.

In one corner – Rao et al.
In a recent high-profile Brevia published in Science a week ago, Rao et al. suggest that

  1. the degree of randomness in linguistic systems is significantly different from that of nonlinguistic systems, and
  2. the degree of randomness of the script of the Indus civilization is similar to that of other linguistic systems – in particular, Sumerian and Old Tamil. (The similarity to Old Tamil is particularly striking because it seems to support the somewhat controversial opinion of some “that the Indus peoples spoke and wrote a Dravidian language” – here I’m quoting from Farmer et al.’s rebuttal.)

Based on this, their claim is that the Indus script encodes some kind of linguistic structure, in stark contradiction to some well-known work by Farmer et al. arguing that the Indus script is “a simple nonlinguistic sign system common in the ancient world”. Unsurprisingly, this has set off a number of critical responses, and it’s always fun to see discussion and debate of this sort go on.

The way Rao et al. quantify the degree of randomness of any given sequence of units, or “tokens” (for example, words or characters in English) is by computing the conditional entropy, a standard measure of randomness in information theory. Simplistically, this quantity is a measure of how flexibly different tokens can be ordered: in a nonlinguistic system where different tokens are ordered at random – what Rao et al. call a “Type 1 nonlinguistic system” – the conditional entropy is high, while in a nonlinguistic system where a given token must be followed by another specific token – a “Type 2 nonlinguistic system” – the conditional entropy is low. Intuitively, it is perhaps not surprising that linguistic systems fall somewhere in between: Rao et al. verify this by computing the conditional entropy for a few different linguistic systems, as well as two synthetic nonlinguistic systems (type 1 and type 2). They use this to support their first claim. Furthermore, they compute the conditional entropy for sequences of signs from the Indus script and – surprise, surprise – find that it falls somewhere in between the type 1 and type 2 nonlinguistic systems, just like the other linguistic systems they studied. They use this to support their second claim.

In the other corner – Farmer et al., Liberman, Pereira, Shalizi, Sproat, and others.
Farmer et al. – whose work Rao et al.’s contradicts – have written a pretty strong response to Rao et al.’s paper. Among other things, Farmer et al. claim that their original work from 2004 “awakened resistance from Indian nationalists and researchers whose entire careers have been linked to the Indus-script thesis, one of whom is listed as coauthor of [Rao et al.'s] study”; and, “if [Rao et al.'s] paper had been properly peer reviewed it would not have been published.” Ouch. Here are their main critiques of this work:

  • Rao et al. used “synthetic” type 1 and type 2 nonlinguistic data in their calculations – that is, they created it according to certain rules. In a sense, these are designed to represent two different extremes on the “conditional entropy spectrum”, and as such it is not surprising that linguistic systems fall somewhere in between. Other nonlinguistic systems might, as well – so, claim #1 is unsubstantiated.
  • The idea that the Indus signs are in some linguistic way related to Old Tamil does not make sense historically: for example, “the first attestation
    of Old Tamil came nearly two thousand years after the Indus civilization disappeared”.

Others have weighed in on this as well, including Mark Liberman, Fernando Pereira, Cosma Shalizi, and Richard Sproat. In particular, Liberman, Shalizi and Sproat have come up with simple counter-examples to Rao et al.’s data, showing instances of nonlinguistic datasets that show at least qualitatively similar behavior to Rao et al.’s linguistic datasets. It appears that at least for now, Pereira’s comment that language is “a system… carrying lots of specific information that cannot be captured by a single statistic” seems to hold.

Categories: Academia · Artificial Intelligence · Interdisciplinary · Mathematics · Models · Papers · Physics · Science · Social Science · Statistics