why google and yahoo love wikipedia

wikipedia.png From Dan Cohen’s excellent Digital Humanities Blog comes a discussion of the Wikipedia story that Cohen claims no one seems to be writing about — namely, the question of why Google and Yahoo give so much free server space and bandwith to Wikipedia. Cohen points out that there’s more going on here than just the open source ethos of these tech companies: in fact, the two companies are becoming increasingly dependent on Wikipedia as a resource, both as something to repackage for commercial use (in sites such as Answers.com), and as a major component in the programming of search algorithms. Cohen writes:
Let me provide a brief example that I hope will show the value of having such a free resource when you are trying to scan, sort, and mine enormous corpora of text. Let’s say you have a billion unstructured, untagged, unsorted documents related to the American presidency in the last twenty years. How would you differentiate between documents that were about George H. W. Bush (Sr.) and George W. Bush (Jr.)? This is a tough information retrieval problem because both presidents are often referred to as just “George Bush” or “Bush.” Using data-mining algorithms such as Yahoo’s remarkable Term Extraction service, you could pull out of the Wikipedia entries for the two Bushes the most common words and phrases that were likely to show up in documents about each (e.g., “Berlin Wall” and “Barbara” vs. “September 11” and “Laura”). You would still run into some disambiguation problems (“Saddam Hussein,” “Iraq,” “Dick Cheney” would show up a lot for both), but this method is actually quite a powerful start to document categorization.
Cohen’s observation is a valuable reminder that all of the discussion of Wikipedia’s accuracy and usefulness as an academic tool is really only skimming the surface of how and why the open-souce encyclopedia is reshaping the way knowledge is made and accessed. Ultimately, the question of whether or not Wikipedia should be used in the classroom might be less important than whether — or how — it is used in the boardroom, by companies whose function is to repackage, reorganize and return “the people’s knowledge” back to the people at a tidy profit.

7 thoughts on “why google and yahoo love wikipedia

  1. dave munger

    Isn’t the point of all this open-source stuff precisely that: all this knowledge is incredibly valuable, and locking it up under copyright prevents people from gaining access to that knowledge?
    If Wikipedia were developed by a corporation, then its knowledge would be sold, instead of given away. Search engines would either be more expensive or less effective than they are now.

  2. bob stein

    i’m not a fan of corporate control of knowledge, much less capitalism, but i don’t understand the thrust of the question being raised. what is the problem if google and yahoo add sufficient value to open-source materials such that people want to access them through these for-profit sites? just as with the adam stacey/gamma issue, a crucial goal of open-source is to enable people to transform knowledge in new and useful ways. as long as capitalism is the dominant mode, corporations are going to play in the sandbox. if you don’t want to play with them, you’ll have to work in a more restricted, proprietary space.

  3. lisa lynch

    yes, corporations are going to play in the sandbox. the question is whether they gradually take over the sandbox and control the distribution of toys. In this instance, what Cohen is pointing to is the fact that Google has a vested interest in keeping Wikipedia going, something that gets lost in the debate over whether “we” (citizen, scholars, citizen-scholars) should trust Wikipedia or not. So the question becomes (as I think Cohen is suggesting) whether the past month’s debate about the “relevance” of Wikipedia — and about its accuracy relative to, say, Britannica — is somehow missing the point. And we also need to wonder if the specific needs of Google and/or Yahoo will, in time, begin to determine the shape of Wikipedia.

  4. dave munger

    Lisa, I think you’re right in that we need to consider carefully the direction in which Wikipedia is developing. Even now rules are being created which limit who can edit and create new posts, and in the future Wikipedia may become even more restrictive. If Yahoo and Google control the purse strings, then perhaps content on Wikipedia will be influenced by them.
    OTOH, isn’t this always the way it is with organizations formed for the “public good” — libraries, museums, concert halls, and the like? Big donors get rooms, collections, box seats named after them. If R.J. Reynolds donates a wing to a hospital, it’s a safe bet that not much research on lung cancer will be done there.
    I think the key is not to turn down the money, but to make sure that Reynolds isn’t the only corporate sponsor of hospitals out there. There should be dozens of Wikipedia-like sites on the Web, each working in its own way to bring as much knowledge as possible to the public domain.

  5. lisa lynch

    I couldn’t agree more with that last point, Dave. I’m curious about whether, for example, The Digital Universe, which has been billed as an alternative to Wikipedia, is going to be able to take root. Unlike Wikipedia, Digital Universe is based on a for-profit model in which anyone can access one layer of content but subscribers get a different, more expanded layer.
    Given that, just today, James Wales announced that Wikipedia might start accepting ads, the standard notion of open source, freely accessible information is going to be changing in many ways. Some of these changes, as you correctly note, will be the predictable result of corporate money colliding with the “organizations formed for the public good.” But there’s some interesting interdepedence going on here that makes the situation a bit different — corporate donors don’t need hospital wings in the way that Google needs Wikipedia, and thus Google might have an interest in making sure alternative Wikis aren’t tangling up its search algorithms.

  6. Parker

    Seems to me that what Google and Yahoo! are doing is using Wikipedia as the cheapest R & D on the planet. They supply some bandwidth and server space–both of which are cheap to share when dealing in bulk like both companies do–and in return get both better search capabilities and content that is (we hope) good enough to keep people coming back to search again.
    Wikipedia AdSense will be very disconcerting at first, don’t you think?

Comments are closed.