The Enron Corpus

Here’s a good one from the department of unintended consequences.

In 2003, the Federal Energy Regulatory Commission released a million and a half Enron email messages to the public. FERC was very rightly taken to task by many for releasing the entire set of emails, because a large number were personal emails of innocent employees, some of which contained information like social security numbers, and only a small fraction of which contained anything incriminating. FERC did backtrack a bit and took down the messages containing social security numbers and employee performance evaluations. But the rest remain publicly accessible.

Putting the ethics of how the messages were obtained and released aside, the entry of this corpus into the public record has really been a boon to various branches of computer science, particularly information retrieval and knowledge discovery. There has been lots of interesting academic work with the Enron corpus; everything from spam filters to social netork analysis to information retrieval to linguistic analysis. I’ve also seen demos of various commercial enterprise search products using it, and have read about some other commercial products which have used it in R&D. Really, researchers have never had anything like this, a full corporate email database captured in the wild.

And it should also be a cautionary tale to not put anything in your work email that you wouldn’t want anyone in the world to read. (My previous barometer for to put in work email was to ask myself if I would mind one of the IT guys reading it).

3 thoughts on “The Enron Corpus”

  1. Always assume that anything you say or write is potentially accessible to anyone in the world. (Anything said in a room with a window, for example, can be picked up from outside using a laser inferometer reflected off the window.)

    The only privacy anywhere is inside your own head, and even then, you may well talk in your sleep.

  2. Ah, paranoia. The world gets bigger, and each person’s personal world gets more and more insular in reaction. Maybe this is not making sense. It seems I have this constant battle raging lately between a) freeing myself to express and just be who I am and b) pressures to be extraordinarily careful about what I say and do, and–perhaps more importantly–how I say/do it. It certainly complicates connections with other people, even if the expanding world makes it more and more possible to connect with all sorts of different people. I guess the result is more connections, but more superficial ones. Maybe I should have started this comment with a “wandering tangent alert.” Paranoia!!

  3. Terri, you make perfect sense. Ten years ago — or really, even fewer than that — I probably wouldn’t have been recognized by people who’ve gotten to know me more recently. The false sense of security of the Internets, a climate that was less Patriot Act-informed, and a general sense of recklessness and what-the-hecklessness emboldened me to take chances I wouldn’t consider today. And, I think (this is certainly no-duh-ish), deciding to keep an online journal can render one hypersensitive to what the results of any given public confession might be. It’s possible that the me of today feels she has a lot more to lose by not at least weighing the consequences of my blatherings before each post. It’s also probable that I felt bolder in my early blog attempts because I didn’t take the risk of telling anyone I actually knew that I was doing such a thing.

    Bluh bluh bluh — who’s for a round of le cadavre equise? (Me! Me!)

