The Enron Corpus

Here’s a good one from the department of unintended consequences.

In 2003, the Federal Energy Regulatory Commission released a million and a half Enron email messages to the public. FERC was very rightly taken to task by many for releasing the entire set of emails, because a large number were personal emails of innocent employees, some of which contained information like social security numbers, and only a small fraction of which contained anything incriminating. FERC did backtrack a bit and took down the messages containing social security numbers and employee performance evaluations. But the rest remain publicly accessible.

Putting the ethics of how the messages were obtained and released aside, the entry of this corpus into the public record has really been a boon to various branches of computer science, particularly information retrieval and knowledge discovery. There has been lots of interesting academic work with the Enron corpus; everything from spam filters to social netork analysis to information retrieval to linguistic analysis. I’ve also seen demos of various commercial enterprise search products using it, and have read about some other commercial products which have used it in R&D. Really, researchers have never had anything like this, a full corporate email database captured in the wild.

And it should also be a cautionary tale to not put anything in your work email that you wouldn’t want anyone in the world to read. (My previous barometer for to put in work email was to ask myself if I would mind one of the IT guys reading it).

Obscure Economic Indicators

Slate has a great but infrequent series which features obscure economic indicators, like the number of shipping containers in L.A., or regional parking rates. (I posted another favorite last year, about how the number of Harvard B-School grads on Wall Street indicates that the stock market will do poorly).

A quick search turns up a similar story in Inc Magazine a few years ago; their examples are not as good, but they do quote the famous Providence Mayor Buddy Cianci.

A friend who went to a data mining conference last week mentioned that someone there had done some work indicating that the frequency of blog postings can be correlated to the unemployment rate. The last number he came up with was within .3% of the Fed’s number, and he had it 6 weeks before the Fed released theirs.