Category Archives: google

six blind men and an elephant

Thomas Mann, author of The Oxford Guide to Library Research, has published an interesting paper (pdf available) examining the shortcomings of search engines and the continued necessity of librarians as guides for scholarly research. It revolves around the case of a graduate student investigating tribute payments and the Peloponnesian War. A Google search turns up nearly 80,000 web pages and 700 books. An overwhelming retrieval with little in the way of conceptual organization and only the crudest of tools for measuring relevance. But, with the help of the LC Catalog and an electronic reference encyclopedia database, Mann manages to guide the student toward a manageable batch of about a dozen highly germane titles.
Summing up the problem, he recalls a charming old fable from India:

Most researchers – at any level, whether undergraduate or professional – who are moving into any new subject area experience the problem of the fabled Six Blind Men of India who were asked to describe an elephant: one grasped a leg and said “the elephant is like a tree”; one felt the side and said “the elephant is like a wall”; one grasped the tail and said “the elephant is like a rope”; and so on with the tusk (“like a spear”), the trunk (“a hose”) and the ear (“a fan”). Each of them discovered something immediately, but none perceived either the existence or the extent of the other important parts – or how they fit together.
Finding “something quickly,” in each case, proved to be seriously misleading to their overall comprehension of the subject.
In a very similar way, Google searching leaves remote scholars, outside the research library, in just the situation of the Blind Men of India: it hides the existence and the extent of relevant sources on most topics (by overlooking many relevant sources to begin with, and also by burying the good sources that it does find within massive and incomprehensible retrievals). It also does nothing to show the interconnections of the important parts (assuming that the important can be distinguished, to begin with, from the unimportant).

Mann believes that books will usually yield the highest quality returns in scholarly research. A search through a well tended library catalog (controlled vocabularies, strong conceptual categorization) will necessarily produce a smaller, and therefore less overwhelming quantity of returns than a search engine (books do not proliferate at the same rate as web pages). And those returns, pound for pound, are more likely to be of relevance to the topic:

Each of these books is substantially about the tribute payments – i.e., these are not just works that happen to have the keywords “tribute” and “Peloponnesian” somewhere near each other, as in the Google retrieval. They are essentially whole books on the desired topic, because cataloging works on the assumption of “scope-match” coverage – that is, the assigned LC headings strive to indicate the contents of the book as a whole….In focusing on these books immediately, there is no need to wade through hundreds of irrelevant sources that simply mention the desired keywords in passing, or in undesired contexts. The works retrieved under the LC subject heading are thus structural parts of “the elephant” – not insignificant toenails or individual hairs.

If nothing else, this is a good illustration of how libraries, if used properly, can still be much more powerful than search engines. But it’s also interesting as a librarian’s perspective on what makes the book uniquely suited for advanced research. That is: a book is substantial enough to be a “structural part” of a body of knowledge. This idea of “whole books” as rungs on a ladder toward knowing something. Books are a kind of conceptual architecture that, until recently, has been distinctly absent on the Web (though from the beginning certain people and services have endeavored to organize the Web meaningfully). Mann’s study captures the anxiety felt at the prospect of the book’s decline (the great coming blindness), and also the librarian’s understandable dread at having to totally reorganize his/her way of organizing things.
It’s possible, however, to agree with the diagnosis and not the prescription. True, librarians have gotten very good at organizing books over time, but that’s not necessarily how scholarship will be produced in the future. David Weinberg ponders this:

As an argument for maintaining human expertise in manually assembling information into meaningful relationships, this paper is convincing. But it rests on supposing that books will continue to be the locus of worthwhile scholarly information. Suppose more and more scholars move onto the Web and do their thinking in public, in conversation with other scholars? Suppose the Web enables scholarship to outstrip the librarians? Manual assemblages of knowledge would retain their value, but they would no longer provide the authoritative guide. Then we will have either of two results: We will have to rely on “‘lowest common denominator'”and ‘one search box/one size fits all’ searching that positively undermines the requirements of scholarly research”…or we will have to innovate to address the distinct needs of scholars….My money is on the latter.

As I think is mine. Although I would not rule out the possibility of scholars actually participating in the manual assemblage of knowledge. Communities like MediaCommons could to some extent become their own libraries, vetting and tagging a wide array of electronic resources, developing their own customized search frameworks.
There’s much more in this paper than I’ve discussed, including a lengthy treatment of folksonomies (Mann sees them as a valuable supplement but not a substitute for controlled taxonomies). Generally speaking, his articulation of the big challenges facing scholarly search and librarianship in the digital age are well worth the read, although I would argue with some of the conclusions.

cache me if you can

Over at Teleread, David Rothman has a pair of posts about Google’s new desktop RSS reader and a couple of new technologies for creating “offline web applications” (Google Gears and Adobe Apollo), tying them all together into an interesting speculation about the possibility of offline networked books. This would mean media-rich, hypertextual books that could cache some or all of their remote elements and be experienced whole (or close to whole) without a network connection, leveraging the local power of the desktop. Sophie already does this to a limited degree, caching remote movies for a brief period after unplugging.
Electronic reading is usually prey to a million distractions and digressions, but David’s idea suggests an interesting alternative: you take a chunk of the network offline with you for a more sustained, “bounded” engagement. We already do this with desktop email clients and RSS readers, which allow us to browse networked content offline. This could be expanded into a whole new way of reading (and writing) networked books. Having your own copy of a networked document. Seems this should be a basic requirement for any dedicated e-reader worth its salt, to be able to run rich web applications with an “offline” option.

the people’s card catalog (a thought)

New partners and new features. Google has been busy lately building up Book Search. On the institutional end, Ghent, Lausanne and Mysore are among the most recent universities to hitch their wagons to the Google library project. On the user end, the GBS feature set continues to expand, with new discovery tools and more extensive “about” pages gathering a range of contextual resources for each individual volume.
Recently, they extended this coverage to books that haven’t yet been digitized, substantially increasing the findability, if not yet the searchability, of thousands of new titles. The about pages are similar to Amazon’s, which supply book browsers with things like concordances, “statistically improbably phrases” (tags generated automatically from distinct phrasings in a text), textual statistics, and, best of all, hot-linked lists of references to and from other titles in the catalog: a rich bibliographic network of interconnected texts (Bob wrote about this fairly recently). Google’s pages do much the same thing but add other valuable links to retailers, library catalogues, reviews, blogs, scholarly resources, Wikipedia entries, and other relevant sites around the net (an example). Again, many of these books are not yet full-text searchable, but collecting these resources in one place is highly useful.
It makes me think, though, how sorely an open source alternative to this is needed. Wikipedia already has reasonably extensive articles about various works of literature. Library Thing has built a terrific social architecture for sharing books. There are a great number of other freely accessible resources around the web, scholarly database projects, public domain e-libraries, CC-licensed collections, library catalogs.
Could this be stitched together into a public, non-proprietary book directory, a People’s Card Catalog? A web page for every book, perhaps in wiki format, wtih detailed bibliographic profiles, history, links, citation indices, social tools, visualizations, and ideally a smart graphical interface for browsing it. In a network of books, each title ought to have a stable node to which resources can be attached and from which discussions can branch. So far Google is leading the way in building this modern bibliographic system, and stands to turn the card catalogue of the future into a major advertising cash nexus. Let them do it. But couldn’t we build something better?

belgian news sites don cloak of invisibility

In an act of stunning shortsightedness, a consortium of 19 Belgian newspapers has sued and won a case against Google for copyright infringement in its News Search engine. Google must now remove all links, images and cached pages of these sites from its database or else face fines. Similar lawsuits from other European papers are likely to follow soon.
The main beef in the case (all explained in greater detail here) is Google’s practice of deep linking to specific articles, which bypasses ads on the newspapers’ home pages and reduces revenue. This and Google’s caching of full articles for search purposes, copies that the newspapers contend could be monetized through a pay-for-retrieval service. Echoes of the Book Search lawsuits on this side of the Atlantic…
What the Belgians are in fact doing is rendering their papers invisible to a potentially global audience. Instead of lashing out against what is essentially a free advertising service, why not rethink your own ad structure to account for the fact that more and more readers today are coming through search engines and not your front page? While you’re at it, rethink the whole idea of a front page. Or better yet, join forces with other newspapers, form your own federated search service and beat Google at its own game.

national archives sell out

This falls into the category of deeply worrying. In a move reminiscent of last year’s shady Smithsonian-Showtime deal, the U.S. National Archives has signed an agreement with Footnote.com to digitize millions of public domain historical records — stuff ranging from the papers of the Continental Congress to Matthew B. Brady’s Civil War photographs — and to make them available through a commercial website. They say the arrangement is non-exclusive but it’s hard to see how this is anything but a terrible deal.
Here’s a picture of the paywall:

nationalarchivespaywall.jpg

Dan Cohen has a good run-down of why this should set off alarm bells for historians (thanks, Bowerbird, for the tip). Peter Suber has: the open access take: “The new Democratic Congress should look into this problem. It shouldn’t try to undo the Footnote deal, which is better than nothing for readers who can’t get to Washington. But it should try to swing a better deal, perhaps even funding the digitization and OA directly.” Absolutely. (Actually, they should undo it. Scrap it. Wipe it out.) Digitization should not become synonymous with privatization.
Elsewhere in mergers and acquisitions, the University of Texas Austin is the newest partner in the Google library project.

unbound – google publishing conference at NYPL

Interesting bit of media industry theater here. I’m here in the New York Public Library, one of the city’s great temples to the book, where Google has assembled representatives of the book business to undergo a kind of group massage. A pep talk for a timorous publishing industry that has only barely dipped its toes in the digital marketplace and can’t decide whether to regard Google as savior or executioner. The speaker roster is a mix of leading academic and trade publishers and diplomatic envoys from the techno-elite. Chris Anderson, Cory Doctorow, Seth Godin, Tim O’Reilly have all spoken. The 800lb. elephant in the room is of course the lawsuits brought against Google (still in the discovery phase) by the Association of American Publishers and the Authors’ Guild for their library digitization program. Doctorow and O’Reilly broached it briefly and you could feel the pulse in the room quicken. Doctorow: “We’re a cabal come over from the west coast to tell you you’re all wrong!” A ripple of laughter, acid-laced. A little while ago Michael Holdsworth of Cambridge University Press pointed to statistics that suggest that Book Search is driving up their sales… Some grumble that the publishers’ panel was a little too hand-picked.
Google’s tactic here seems simultaneously to be to reassure the publishers while instilling an undercurrent of fear. Reassure them that releasing more of their books in a greater variety of forms will lead to more sales (true) — and frightening them that the technological train is speeding off without them (also true, though I say that without the ecstatic determinism of the Google folks. Jim Gerber, Google’s main liason to the publishing world, opened the conference with a love song to Moore’s Law and the gigapixel camera, explaining that we’re within a couple decades’ reach of having a handheld device that can store all the content ever produced in human history — as if this fact alone should drive the future of publishing). The event feels more like a workshop in web marketing than a serious discussion about the future of publishing. It’s hard to swallow all the marketing speak: “maximizing digital content capabilities,” “consolidation by market niche”; a lot of talk of “users” and “consumers,” but not a whole lot about readers. Publishers certainly have a lot of catching up to do in the area of online commerce, but barely anyone here is engaging with the bigger questions of what it means to be a publisher in the network era. O’Reilly sums up the publisher’s role as “spreading the knowledge of innovators.” This is more interesting and O’Reilly is undoubtedly doing more than almost any commercial publisher to rethink the reading experience. But most of the discussion here is about a culture industry that seems more concerned with salvaging the industry part of itself than with honestly rethinking the cultural part. The last speaker just wrapped up and it’s cocktail time. More thoughts tomorrow.

has google already won?

Rich Skrenta, an influential computer industry insider, currently co-founder and CEO of Topix.net and formerly a big player at Netscape, thinks it has, crowning Google king of the “third age of computing” (IBM and Microsoft being the first and second). Just the other day, there was a bit of discussion here about whether Google is becoming a bona fide monopoly — not only by dint of its unrivaled search and advertising network, but through the expanding cloud of services that manage our various personal communication and information needs. Skrenta backs up my concern (though he mainly seems awed and impressed) that with time, reliance on these services (not just by individuals but by businesses and oranizations of all sizes) could become so total that there will effectively be no other choice:

Just as Microsoft used their platform monopoly to push into vertical apps, expect Google to continue to push into lucrative destination verticals — shopping searches, finance, photos, mail, social media, etc. They are being haphazard about this now but will likely refine their thinking and execution over time. It’s actually not inconceivable that they could eventually own all of the destination page views too. Crazy as it sounds, it’s conceivable that they could actually end up owning the entire net, or most of what counts.

The meteoric ascendance of the Google brand — synonymous in the public mind with best, quickest, smartest — and the huge advantage the company has gained by becoming “the start page for the Internet,” means that its continued dominance is all but assured. “Google is the environment.” Others here think these predictions are overblown. To me they sound frighteningly plausible.

the ambiguity of net neutrality

The Times comes out once again in support of network neutrality, with hopes that the soon to be Democrat-controlled Congress will make decisive progress on that front in the coming year.
Meanwhile in a recent Wired column, Larry Lessig, also strongly in favor of net neutrality but at the same time hesitant about the robust government regulation it entails, does a bit of soul-searching about the landmark antitrust suit brought against Microsoft almost ten years ago. Then too he came down on the side of the regulators, but reflecting on it now he says might have counseled differently had he known about the potential of open source (i.e. Linux) to rival the corporate goliath. He worries that a decade from now he may arrive at similar regrets when alternative network strategies like community or municipal broadband may by then have emerged as credible competition to the telecoms and telcos. Still, seeing at present no “Linus Torvalds of broadband,” he decides to stick with regulation.
Network neutrality shouldn’t be trumpeted uncritically, and it’s healthy and right for leading advocates like Lessig to air their concerns. But I think he goes too far in saying he was flat-out wrong about Microsoft in the late 90s. Even with the remarkable success of Linux, Microsoft’s hegemony across personal and office desktops seems more or less unshaken a decade after the DOJ intervened.
Allow me to add another wrinkle. What probably poses a far greater threat to Microsoft than Linux is the prospect of a web-based operating system of the kind that Google is becoming, a development that can only be hastened by the preservation of net neutrality since it lets Google continue to claim an outsized portion of last-mile bandwidth at a bargain rate, allowing them to grow and prosper all the more rapidly. What seems like an obvious good to most reasonable people might end up opening the door wider for the next Microsoft. This is not an argument against net neutrality, simply a consideration of the complexity of getting what we wish and fight for. Even if we win, there will be other fights ahead. United States vs. Google?

people-powered search (part 1)

Last week, the London Times reported that the Wikipedia founder, Jimbo Wales, was announcing a new search engine called “Wikiasari.” This search engine would incorporate a new type of social ranking system and would rival Google and Yahoo in potential ad revenue. When the news first got out, the blogosphere went into a frenzy; many echoing inaccurate information – mostly in excitement – causing lots confusion. Some sites even printed dubious screenshots of what they thought was the search engine.
Alas, there were no real screenshots and there was no search engine… yet. Yesterday, unable to make any sense what was going on by reading the blogs, I looked through the developer mailing list and found this post by Jimmy Wales:

The press coverage this weekend has been a comedy of errors. Wikiasari was not and is not the intended name of this project… the London Times picked that off an old wiki page from back in the day when I was working on the old code base and we had a naming contest for it. […] And then TechCrunch ran a screenshot of something completely unrelated, thus unfortunately perhaps leading people to believe that something is already built about about to be unveiled. No, the point of the project is to build something, not to unveil something which has already been built.

And in the Wikia search webpage he explains why:

Search is part of the fundamental infrastructure of the Internet. And, it is currently broken. Why is it broken? It is broken for the same reason that proprietary software is always broken: lack of freedom, lack of community, lack of accountability, lack of transparency. Here, we will change all that.

So there is no Google-killer just yet, but something is brewing.
From the details that we have so far, we know that this new search engine will be funded by Wikia Inc, Wales’ for-profit and ad-driven MediaWiki hosting company. We also know that the search technology will be based on Nutch and Lucene – the same technology that powers Wikipedia’s search. And we also know that the search engine will allow users to directly influence search results.
I found interesting that in the Wikia “about page”, Wales suggests that he has yet to make up his mind on how things are going to work, so suggestions appear to be welcome.
Also, during the frenzy, I managed to find many interesting technologies that I think might be useful in making a new kind of search engine. Now that a dialog appears to be open and there is good reason to believe a potentially competitive search engine could be built, current experimental technologies might play an important role in the development of Wikia’s search. Some questions that I think might be useful to ponder are:
Can current social bookmarking tools, like del.icio.us, provide a basis for determining “high quality” sites? Will using Wikipedia and it’s external site citing engine make sense for determining “high quality” links? Will using a Digg-like, rating system result spamless or simply just low brow results? Will a search engine dependant on tagging, but no spider be useful? But the question I am most interested in is whether a large scale manual indexing lay the foundation for what could turn into the Semantic Web (Web 3.0)? Or maybe just Web 2.5?
The most obvious and most difficult challenge for Wikia, besides coming up with a good name and solid technology, will be with dealing with sheer size of the internet.
I’ve found that open-source communities are never as large or as strong as they appear. Wikipedia is one of the largest and one of the most successful online collaborative projects, yet just over 500 people make over 50% of all edits and about 1400 make about 75% of all edits. If Wikia’s new search engine does not generate a large group of users to help index the web early on, this project will not survive; A strong online community, possibly in a magnitude we’ve never seen before, might be necessary to ensure that people-powered search is of any use.

future of the filter

An article by Jon Pareles in the Times (December 10th, 2006) brings to mind some points that have been risen here throughout the year. One, is the “corporatization” of user-generated content, the other is what to do with all the material resulting from the constant production/dialogue that is taking place on the Internet.
Pareles summarizes the acquisition of MySpace by Rupert’s Murdoch’s News Corporation and YouTube by Google with remarkable clarity:

What these two highly strategic companies spent more than $2 billion on is a couple of empty vessels: brand-named, centralized repositories for whatever their members decide to contribute.

As he puts it, this year will be remembered as the year in which old-line media, online media and millions of individual web users agreed. I wouldn’t use the term “agreed,” but they definitely came together as the media giants saw the financial possibilities of individual self-expression generated in the Web. As it usually happens with independent creative products, large amounts of the art originated in websites such as MySpace and YouTube, borrow freely and get distributed and promoted outside of the traditional for-profit mechanisms. As Pareles says, “it’s word of mouth that can reach the entire world.” Nonetheless, the new acquisitions will bring a profit for some while the rest will supply material for free. But, problems arise when part of that production uses copyrighted material. While we have artists fighting immorally to extend copyright laws, we have Google paying copyright holders for material used in YouTube, but also fighting them.
The Internet has allowed for the democratization of creation and distribution, it has made the anonymous public while providing virtual meeting places for all groups of people. The flattening of the wax cylinder into a portable, engraved surface that produced sound when played with a needle, brought the music hall, the clubs and cabarets into the home, but it also gave rise to the entertainment business. Now the CD burner, the MP3, and online tools have brought the recording studio into the home. Interestingly enough, far from promoting isolation, the Internet has generated dialogue. YouTube is not a place for merely watching dubious videos; it is also a repository of individual reactions. Something similar is happening with film, photography and books. But, what to do with all that? Pareles sees the proliferation of blogs and the user-generated play lists as a sort of filter from which the media moguls are profiting: “Selection, a time-consuming job, has been outsourced. What’s growing is the plentitude not just of user-generated content, but also of user-filtered content.” But he adds, “Mouse-clicking individuals can be as tasteless, in the aggregate, as entertainment professionals.” What is going to happen as private companies become the holders of those filters?