How search engines work

In a fit of further narcisissm, I was looking at my word stats (the core narcisissm here of course is simply having a blog to have word stats on), and I realized that it’s a good demonstration of how most search engines work. Since search is my specialty and it’s somehow one of those things that’s ubiquitous but still esoteric, like how TVs work, or the subjunctive mood, I thought I’d explain.

What are the important words in my blog? The ones I use most frequently, right? Nope. Exactly the opposite. The words I use most frequently are “the, to, a, of, and, i, in, that”, etc. It’s the ones I use least that mean the most: “secularist, cuttlefish, smugness, chowdah, lexicon, tearjerker, bowtie”. So, most search engines try to exploit this in some way, and the rarity of words across the whole document set is factored into the (basically bogus) relevance scores you sometimes see accompanying your results.

This approach ultimately sucks, though, and it’s worth noting that Google more or less threw this way of doing things out. But for document sets smaller than, say, the whole web, it usually works as well as anything else. And Google’s way of doing things doesn’t really work outside of the public web.

Terri just told the cat “let’s have some furry devotion”. I think that means it’s bedtime.


Just a note on the snottiness of my last post: I wasn’t really suggesting that it’s common knowlege, even for Cantabridgians, that e.e. cummings wrote a novel called The Enormous Room, but I would think that if you were writing a review of a restaurant, you might inquire into the source of the name…