Does anyone know of a good way to search for old blog entries on the web? I’ve just been looking at some of the available blog search resources and few of them appear to provide any serious advanced search options. The couple of major ones I’ve found that do (after an admittedly cursory look) are Google and Ice Rocket. Both, however, appear to be broken, at least when it comes to dates. I’ve tried them on three different browsers, on Mac and PC, and in each case the date menus seem to be frozen. It’s very weird. They give you the option of entering a specific time range but won’t accept the actual dates. Maybe I’m just having a bad tech day, but it’s as if there’s some conceptual glitch across the web vis a vis blogs and time.
Most blog search engines are geared toward searching the current blogosphere, but there should be a way to research older content. My first thought was that blog search engines crawl RSS feeds, most of which do not transmit the entirety of a blog’s content, just the more recent. That would pose a problem for archival search.
Does anyone know what would be the best way to go about finding, say, old blog entries containing the keywords “new orleans superdome” from late August to late September 2005? Is it best to just stick with general web search and painstakingly comb through for blogs? If we agree that blogs have become an important kind of cultural document, than surely there should be a way to find them more than a month after they’ve been written.
Category Archives: blog_search
useful rss
Hi. I’m Jesse, the latest member to join the staff here at the Institute. I’m interested in network effects, online communities, and emergent behavior. Right now I’m interested in the tools we have available to control and manipulate RSS feeds. My goal is to collect a wide variety of feeds and tease out the threads that are important to me. In my experience, mechanical aggregation gives you quantity and diversity, but not quality and focus. So I did a quick investigation of the tools that exist to manage and manipulate feeds.
Sites like MetaFilter and Technorati skim the most popular topics in the blogosphere. But what sort of tools exist to help us narrow our focus? There are two tools that we can use right now: tag searches/filtering, and keyword searching. Tag searches (on Technorati) and tag filtering (on Metafilter) drill down to specific areas, like “books” or “books and publishing.” A casual search on MetaFilter was a complete failure, but Technorati, with its combination of tags and keyword search results produced good material.
There is also the Google Blog search. As Google puts it, you can ‘find blogs on your favorite topics.’ PageRank works, so PageRank applied to blogs should work too. Unfortunately it results in too many pages that, while higher ranked in the whole set of the Internet, either fail to be on topic or exist outside of the desired sub-spheres of a topic. For example, I searched for “gourmet food” and found one of the premier food blogs on the fourth page, just below Carpundit. Google blog search fails here because it can’t get small enough to understand the relationships in the blogosphere, and relies more heavily on text retrieval algorithms that sabotage the results.
Finally, let’s talk about aggregators. There is more human involvement in selecting sites you’re interested in reading. This creates a personalized network of sites that are related, if only by your personal interest. The problem is, you get what they want to write about. Managing a large collection of feeds can be tiresome when you’re looking for specific information. Bloglines has a search function that allows you to find keywords inside your subscriptions, then treat that as a feed. This neatly combines hand-picked sources with keyword or tag harvesting. The result: a slice of from your trusted collection of authors about a specific topic.
What can we envision for the future of RSS? Affinity mapping and personalized recommendation systems could augment the tag/keyword search functionality to automatically generate a slice from a small network of trusted blogs. Automatic harvesting of whole swaths of linked entries for offline reading in a bounded hypertext environment. Reposting and remixing feed content on the fly based on text-processing algorithms. And we’ll have to deal with the dissolving identity and trust relationships that are a natural consequence of these innovations.