Category Archives: Libraries, Search and the Web

the book is reading you

I just noticed that Google Book Search requires users to be logged in on a Google account to view pages of copyrighted works.
google book search account.jpg
They provide the following explanation:

Why do I have to log in to see certain pages?
Because many of the books in Google Book Search are still under copyright, we limit the amount of a book that a user can see. In order to enforce these limits, we make some pages available only after you log in to an existing Google Account (such as a Gmail account) or create a new one. The aim of Google Book Search is to help you discover books, not read them cover to cover, so you may not be able to see every page you’re interested in.

So they’re tracking how much we’ve looked at and capping our number of page views. Presumably a bone tossed to publishers, who I’m sure will continue suing Google all the same (more on this here). There’s also the possibility that publishers have requested information on who’s looking at their books — geographical breakdowns and stats on click-throughs to retailers and libraries. I doubt, though, that Google would share this sort of user data. Substantial privacy issues aside, that’s valuable information they want to keep for themselves.
That’s because “the aim of Google Book Search” is also to discover who you are. It’s capturing your clickstreams, analyzing what you’ve searched and the terms you’ve used to get there. The book is reading you. Substantial privacy issues aside, (it seems more and more that’s where we’ll be leaving them) Google will use this data to refine Google’s search algorithms and, who knows, might even develop some sort of personalized recommendation system similar to Amazon’s — you know, where the computer lists other titles that might interest you based on what you’ve read, bought or browsed in the past (a system that works only if you are logged in). It’s possible Google is thinking of Book Search as the cornerstone of a larger venture that could compete with Amazon.
There are many ways Google could eventually capitalize on its books database — that is, beyond the contextual advertising that is currently its main source of revenue. It might turn the scanned texts into readable editions, hammer out licensing agreements with publishers, and become the world’s biggest ebook store. It could start a print-on-demand service — a Xerox machine on steroids (and the return of Google Print?). It could work out deals with publishers to sell access to complete online editions — a searchable text to go along with the physical book — as Amazon announced it will do with its Upgrade service. Or it could start selling sections of books — individual pages, chapters etc. — as Amazon has also planned to do with its Pages program.
Amazon has long served as a valuable research tool for books in print, so much so that some university library systems are now emulating it. Recent additions to the Search Inside the Book program such as concordances, interlinked citations, and statistically improbable phrases (where distinctive terms in the book act as machine-generated tags) are especially fun to play with. Although first and foremost a retailer, Amazon feels more and more like a search system every day (and its A9 engine, though seemingly always on the back burner, is also developing some interesting features). On the flip side Google, though a search system, could start feeling more like a retailer. In either case, you’ll have to log in first.

who owns the network?

Susan Crawford recently floated the idea of the internet network (see comments 1 and 2) as a public trust that, like America’s national parks or seashore, requires the protection of the state against the undue influence of private interests.

…it’s fine to build special services and make them available online. But broadband access companies that cover the waterfront (literally — are interfering with our navigation online) should be confronted with the power of the state to protect entry into this self-owned commons, the internet. And the state may not abdicate its duty to take on this battle.

Others argue that a strong government hand will create as many problems as it fixes, and that only true competition between private, municipal and grassroots parties — across not just broadband, but multiple platforms like wireless mesh networks and satellite — can guarantee a free net open to corporations and individuals in equal measure.
Discussing this around the table today, Ray raised the important issue of open content: freely available knowledge resources like textbooks, reference works, scholarly journals, media databases and archives. What are the implications of having these resources reside on a network that increasingly is subject to control by phone and cable companies — companies that would like to transform the net from a many-to-many public square into a few-to-many entertainment distribution system? How open is the content when the network is in danger of becoming distinctly less open?

digital universe and expert review

The notion of expert review has been tossed around in the open-content community for a long time. Philosophically, those who lean towards openness tend to sneer at the idea of formalized expert review, trusting in the multiplied consciousness of the community to maintain high standards through less formal processes. Wikipedia is obviously the most successful project in this mode.The informal process has the benefit of speed, and avoids bureaucracy—something which raises the barrier to entry, and keeps out people who just don’t have the time to deal with ‘process.’
The other side of that coin is the belief that experts and editors encourage civil discourse at a high level; without them you’ll end up with mob rule and lowest common denominator content. Editors encourage higher quality writing and thinking. Thinking and writing better than others is, in a way, the definition of expert. In addition, editors and experts tend to have a professional interest in the subject matter, as well as access to better resources. These are exactly the kind of people who are not discouraged by higher barriers to entry, and they are, by extension, the people that you want to create content on your site.
Larry Sanger thinks that, anyway. A Wikipedia co-founder, he gave an interview on news.com about a project that plans to create a better Wikipedia, using a combination of open content development and editorial review: The Digital Universe.

You can think of the Digital Universe as a set of portals, each defined by a topic, such as the planet Mars. And from each portal, there will be links to the best resources on the Web, including a lot of resources of different kinds that are prepared by experts and the general public under the management of experts. This will include an encyclopedia, as well as public domain books, participatory journalism, forums of various kinds and so forth. We’ll build a community of experts and an online collaborative network of independent organizations, each of which has authority over its own discipline to select material and to build resources that are together displayed through a single free-information platform.

I have experience with the editor model from my time at About.com. The About.com model is based on ‘guides’—nominal (and sometimes actual) experts on a chosen topic (say NASCAR, or anesthesiology)—who scour the internet, find good resources, and write articles and newsletters to facilitate understanding and keep communities up to date. The guides were overseen by a bevy of editors, who tended mostly to enforce the quotas for newsletters and set the line on quality. About.com has its problems, but it was novel and successful during its time.
The Digital Universe model is an improvement on the single guide model; it encourages a multitude of people to contribute to a reservoir of content. Measured by available resources, the Digital Universe model wins, hands down. As with all large, open systems, emergent behaviors will add even more to the system in ways than we cannot predict. The Digitial Universe will have it’s own identity and quality, which, according to the blueprint, will be further enhanced by expert editors, shaping the development of a topic and polishing it to a high gloss.
Full disclosure: I find the idea of experts “managing the public” somehow distasteful, but I am compelled by the argument that this will bring about a better product. Sanger’s essay on eliminating anti-elitism from Wikipedia clearly demonstrates his belief in the ‘expert’ methodology. I am willing to go along, mindful that we should be creating material that not only leads people to the best resources, but also allows them to engage more critically with the content. This is what experts do best. However, I’m pessimistic about experts mixing it up with the public. There are strong, and as I see it, opposing forces in play: an expert’s reputation vs. public participation, industry cant vs. plain speech, and one expert opinion vs. another.
The difference between Wikipedia and the Digital Universe comes down, fundamentally, to the importance placed on authority. We’ll see what shape the Digital Universe takes as the stresses of maintaining an authoritative process clashes with the anarchy of the online public. I think we’ll see that adopting authority as your rallying cry is a volatile position in a world of empowered authorship and a universe of alternative viewpoints.

questions about blog search and time

Does anyone know of a good way to search for old blog entries on the web? I’ve just been looking at some of the available blog search resources and few of them appear to provide any serious advanced search options. The couple of major ones I’ve found that do (after an admittedly cursory look) are Google and Ice Rocket. Both, however, appear to be broken, at least when it comes to dates. I’ve tried them on three different browsers, on Mac and PC, and in each case the date menus seem to be frozen. It’s very weird. They give you the option of entering a specific time range but won’t accept the actual dates. Maybe I’m just having a bad tech day, but it’s as if there’s some conceptual glitch across the web vis a vis blogs and time.
Most blog search engines are geared toward searching the current blogosphere, but there should be a way to research older content. My first thought was that blog search engines crawl RSS feeds, most of which do not transmit the entirety of a blog’s content, just the more recent. That would pose a problem for archival search.
Does anyone know what would be the best way to go about finding, say, old blog entries containing the keywords “new orleans superdome” from late August to late September 2005? Is it best to just stick with general web search and painstakingly comb through for blogs? If we agree that blogs have become an important kind of cultural document, than surely there should be a way to find them more than a month after they’ve been written.

librivox — free public domain books read aloud by volunteers

Just read a Dec. 16th Wired article about a Canadian Hugh McGuire’s brilliant new venture Librivox. Librivox is creating and distributing free audiobooks by asking volunteers to create audio files of works of literature in the public domain. The files are hosted on the Internet Archive and are available in MP3 and OGG formats.
librivox.jpg Thus far, Librivox — which has only been up for a few months — has recorded about 30 titles, relying on dozens of volunteers. The website promotes the project as the “acoustical liberation of the public domain” and claims that the ultimate goal is to liberate all public domain works of literature. For now, titles cataloged on the website include L Frank Baum’s The Wizard of Oz, Joseph Conrad’s The Secret Agent and the U.S. Constitution.
Using Librivox couldn’t be easier: clicking on an entry will bring you to a screen which allows you to select a Wikipedia entry on the book in question, the e-Gutenberg file of the book, an alternate Zip file of the book, and the Librivox audio version, available chapter by chapter with the names of each volunteer reader noted prominently next to the chapter information.
I listened to parts of about a half-dozen book chapters to get a sense of the quality of the recordings, and I was impressed. The volunteers have obviously chosen books they are passionate about, and the recordings are lively, quite clear and easy to listen to. As a regular audiobook listener, I was struck by the fact that while most literary audiobooks are read by authors who tend to work hard at conveying a sense of character, the Librivox selections seemed to convey, more than anything, the reader’s passion for the text itself; ie, for the written word. Here at the Institute we’ve been spending a fair amount of time trying to figure out when a book loses it’s book-ness, and I’d argue that while some audiobooks blur the boundary between book and performance, the Librivox books remind us that a book reduced to a stream of digitally produced sound can still be very much a book.
The site’s definitely worth a visit, and, if you’ve got a decent voice and a few spare hours, there’s information about how to become a volunteer reader yourself. And finally, don’t miss the list of other audiolit projects on the lower right-hand corner of the homepage: there are many voices out there, reading many books — including Japanese Classical Literature For Bedtime, if you’re so inclined.

google book search debated at american bar association

Last night I attended a fascinating panel discussion at the American Bar Association on the legality of Google Book Search. In many ways, this was the debate made flesh. Making the case against Google were high-level representatives from the two entities that have brought suit, the Authors’ Guild (Executive Director Paul Aiken) and the Association of American Publishers (VP for legal counsel Allan Adler). It would have been exciting if Google, in turn, had sent representatives to make their case, but instead we had two independent commentators, law professor and blogger Susan Crawford and Cameron Stracher, also a law professor and writer. The discussion was vigorous, at times heated — in many ways a preview of arguments that could eventually be aired (albeit under a much stricter clock) in front of federal judges.
The lawsuits in question center around whether Google’s scanning of books and presenting tiny snippet quotations online for keyword searches is, as they claim, fair use. As I understand it, the use in question is the initial scanning of full texts of copyrighted books held in the collections of partner libraries. The fair use defense hinges on this initial full scan being the necessary first step before the “transformative” use of the texts, namely unbundling the book into snippets generated on the fly in response to user search queries.
google snippets.jpg
…in case you were wondering what snippets look like
At first, the conversation remained focused on this question, and during that time it seemed that Google was winning the debate. The plaintiffs’ arguments seemed weak and a little desperate. Aiken used carefully scripted language about not being against online book search, just wanting it to be licensed, quipping “we’re just throwing a little gravel in the gearbox of progress.” Adler was a little more strident, calling Google “the master of misdirection,” using the promise of technological dazzlement to turn public opinion against the legitimate grievances of publishers (of course, this will be settled by judges, not by public opinion). He did score one good point, though, saying Google has betrayed the weakness of its fair use claim in the way it has continually revised its description of the program.
Almost exactly one year ago, Google unveiled its “library initiative” only to re-brand it several months later as a “publisher program” following a wave of negative press. This, however, did little to ease tensions and eventually Google decided to halt all book scanning (until this past November) while they tried to smooth things over with the publishers. Even so, lawsuits were filed, despite Google’s offer of an “opt-out” option for publishers, allowing them to request that certain titles not be included in the search index. This more or less created an analog to the “implied consent” principle that legitimates search engines caching web pages with “spider” programs that crawl the net looking for new material.
In that case, there is a machine-to-machine communication taking place and web page owners are free to insert programs that instruct spiders not to cache, or can simply place certain content behind a firewall. By offering an “opt-out” option to publishers, Google enables essentially the same sort of communication. Adler’s point (and this was echoed more succinctly by a smart question from the audience) was that if Google’s fair use claim is so air-tight, then why offer this middle ground? Why all these efforts to mollify publishers without actually negotiating a license? (I am definitely concerned that Google’s efforts to quell what probably should have been an anticipated negative reaction from the publishing industry will end up undercutting its legal position.)
Crawford came back with some nice points, most significantly that the publishers were trying to make a pretty egregious “double dip” into the value of their books. Google, by creating a searchable digital index of book texts — “a card catalogue on steroids,” as she put it — and even generating revenue by placing ads alongside search results, is making a transformative use of the published material and should not have to seek permission. Google had a good idea. And it is an eminently fair use.
And it’s not Google’s idea alone, they just had it first and are using it to gain a competitive advantage over their search engine rivals, who in their turn, have tried to get in on the game with the Open Content Alliance (which, incidentally, has decided not to make a stand on fair use as Google has, and are doing all their scanning and indexing in the context of license agreements). Publishers, too, are welcome to build their own databases and to make them crawl-able by search engines. Earlier this week, Harper Collins announced it would be doing exactly that with about 20,000 of its titles. Aiken and Adler say that if anyone can scan books and make a search engine, then all hell will break loose and millions of digital copies will be leaked into the web. Crawford shot back that this lawsuit is not about net security issues, it is about fair use.
But once the security cat was let out of the bag, the room turned noticeably against Google (perhaps due to a preponderance of publishing lawyers in the audience). Aiken and Adler worked hard to stir up anxiety about rampant ebook piracy, even as Crawford repeatedly tried to keep the discussion on course. It was very interesting to hear, right from the horse’s mouth, that the Authors’ Guild and AAP both are convinced that the ebook market, tiny as it currently is, is within a few years of exploding, pending the release of some sort of ipod-like gadget for text. At that point, they say, Google will have gained a huge strategic advantage off the back of appropriated content.
Their argument hinges on the fourth determining factor in the fair use exception, which evaluates “the effect of the use upon the potential market for or value of the copyrighted work.” So the publishers are suing because Google might be cornering a potential market!!! (Crawford goes further into this in her wrap-up) Of course, if Google wanted to go into the ebook business using the material in their database, there would have to be a licensing agreement, otherwise they really would be pirating. But the suits are not about a future market, they are about creating a search service, which should be ruled fair use. If publishers are so worried about the future ebook market, then they should start planning for business.
To echo Crawford, I sincerely hope these cases reach the court and are not settled beforehand. Larger concerns about Google’s expansionist program aside, I think they have made a very brave stand on the principle of fair use, the essential breathing space carved out within our over-extended copyright laws. Crawford reminded the room that intellectual property is NOT like physical property, over which the owner has nearly unlimited rights. Copyright is a “temporary statutory monopoly” originally granted (“with hesitation,” Crawford adds) in order to incentivize creative expression and the production of ideas. The internet scares the old-guard publishing industry because it poses so many threats to the security of their product. These threats are certainly significant, but they are not the subject of these lawsuits, nor are they Google’s, or any search engine’s, fault. The rise of the net should not become a pretext for limiting or abolishing fair use.

wikipedia update: author of seigenthaler smear confesses

According to a Dec 11 New York Times article, Daniel Brandt, a book indexer who runs the site Wikipedia Watch, helped to flush out the man who posted the false biography of USA Today and Freedom Forum founder John Seigenthaler on Wikipedia. After Brandt discovered the post issued from a small delivery company in Nashville, the man in question — 38-year-old Brian Chase — sent a letter of apology to Seigenthaler and resigned from his job as operations manager at the company.
According to the Times, Chase claims that he didn’t realize that Wikipedia was used as a serious research tool: he posted the information to shock a co-worker who was familiar with the Seigenthaler family. Seigenthaler, who complained in a USA Today editorial last week about the protections afforded to the “volunteer vandals” who post anonymously in cyberspace, told the New York Times that he would not seek damages from Chase.
Responding to the fallout from Seigenthaler’s USA Today editorial, Wikipedia founder James Wales changed Wikipedia’s policies so that posters now must all be registered with Wikipedia. But, as Brandt shows, it’s takes work to remain anonymous in cyberspace. Though I’m not sure that I beleive Chase’s professed astonishment that anyone would take his post seriously (why else would it shock his co-worker?), it seems clear that he didn’t think what he was doing so outrageous that he ought to make a serious effort to hide his tracks.
Meanwhile, Wales has become somewhat irked by Seignthaler’s continuing attacks on Wikipedia. Posting to the threaded discussion of the issue on the mailing list of the Association for Internet Researchers, Wikipedia’s founder expressed exasperation about Seigenthaler’s telling the Associated Press this morning that “Wikipedia is inviting [more regulation of the internet] by its allowing irresponsible vandals to write anything they want about anybody.” Wales wrote:
*sigh* Facts about our policies on vandalism are not hard to come by. A statement like Seigenthaler’s, a statement that is egregiously false, would not last long at all at Wikipedia.
For the record, it is just absurd to say that Wikipedia allows “irresponsible vandals to write anything they want about anybody.”
–Jimbo

the poetry archive – nice but a bit mixed up

Last week U.K. Poet Laureate Andrew Motion and recording producer Richard Carrington rolled out The Poetry Archive, a free (sort of) web library that aims to be “the world’s premier online collection of recordings of poets reading their work” — “to help make poetry accessible, relevant and enjoyable to a wide audience.” poetryarchive.jpg The archive naturally focuses on British poets, but offers a significant selection of english-language writers from the U.S. and the British Commonwealth countries. Seamus Heaney is serving as president of the archive.
For each poet, a few streamable mp3s are available, including some rare historic recordings dating back to the earliest days of sound capture, from Robert Browning to Langston Hughes. The archive also curates a modest collection of children’s poetry, and invites teachers to use these and other recordings in the classroom, also providing tips for contacting poets so schools, booksellers and community organizations (again, this is focused on Great Britain) can arrange readings and workshops. While some of this advice seems useful, but it reads more like a public relations/ecudation services page on a publisher’s website. Is this a public archive or a poets’ guild?
The Poetry Archive is a nice resource as both historic repository and contemporary showcase, but the mission seems a bit muddled. They say they’re an archive, but it feels more like a CD store.
poetry archive 1.jpg
Throughout, the archive seems an odd mix of public service and professional leverage for contemporary poets. That’s all well and good, but it could stand a bit more of the former. Beyond the free audio offerings (which are quite skimpy), CDs are available for purchase that include a much larger selection of recordings. The archive is non-profit, and they seem to be counting in significant part on these sales to maintain operations. Still, I would add more free audio, and focus on selling individual recordings and playlists as downloads — the iTunes model. Having streaming teasers and for-sale CDs as the only distribution models seems wrong-headed, and a bit disingenuous if they are to call themselves an archive. It would also be smart to sell subscriptions to the entire archive, with institutional rates for schools. Podcasting would also be a good idea — a poem a day to take with you on your iPod, weaving poetry into daily life.
There’s a growing demand on the web for the spoken word, from audiobooks, podcasts, to performed poetry. The archive would probably do a lot better if they made more of their collection free, and at the same time provided a greater variety of ways to purchase recordings.

tipping point?

An article by Eileen Gifford Fenton and Roger C. Schonfeld in this morning’s Inside Higher Ed claims that over the past year, libraries have accelerated the transition towards purchasing only electronic journals, leaving many publishers of print journals scrambling to make the transition to an online format:
Faced with resource constraints, librarians have been required to make hard choices, electing not to purchase the print version but only to license electronic access to many journals — a step more easily made in light of growing faculty acceptance of the electronic format. Consequently, especially in the sciences, but increasingly even in the humanities, library demand for print has begun to fall. As demand for print journals continues to decline and economies of scale of print collections are lost, there is likely to be a tipping point at which continued collecting of print no longer makes sense and libraries begin to rely only upon journals that are available electronically.
According to Fenton and Schonfeld, this imminent “tipping point” will be a good thing for larger publishing houses which have already begun to embrace an electronic-only format, but smaller nonprofit publishers might “suffer dramatically” if they don’t have the means to convert to an electronic format in time. If they fail, and no one is positioned to help them, “the alternative may be the replacement of many of these journals with blogs, repositories, or other less formal distribution models.”
Fenton and Schonfeld’s point that electronic distribution might substantially change the format of some smaller journals echoes other expressions of concern about the rise of “informal” academic journals and repositories, mainly voiced by scientists who worry about the decline of peer review. Most notably, the Royal Society of London issued a statement on Nov. 24 warning that peer-reviewed scientific journals were threatened by the rise of “open access journals, archives and repositories.”
According to the Royal Society, the main problem in the sciences is that government and nonprofit funding organizations are pressing researchers to publish in open-access journals, in order to “stop commercial publishers from making profits from the publication of research that has been funded from the public purse.” While this is a noble principle, the Society argued, it undermines the foundations of peer review and compels scientists to publish in formats that might be unsustainable:
The worst-case scenario is that funders could force a rapid change in practice, which encourages the introduction of new journals, archives and repositories that cannot be sustained in the long term, but which simultaneously forces the closure of existing peer-reviewed journals that have a long-track record for gradually evolving in response to the needs of the research community over the past 340 years. That would be disastrous for the research community.
There’s more than a whiff of resistance to change in the Royal Society’s citing of 340 years of precedent; more to the point however, their position statement downplays the depth of the fundamental opposition between the open access movement in science and traditional journals. As Roger Chartier notes in a recent issue of Critical Inquiry, “Two different logics are at issue here: the logic of free communication, which is associated with the ideal of the Enlightenment that upheld at the sharing of knowledge, and the logic of publishing based on the notion of author’s rights and commercial gain.”
As we’ve discussed previously on if:book. the fate of peer review in electronic age is an open question: as long as peer review is tied to the logic of publishing, its fate will be determined at least as much by the still evolving market for electronic distribution as by the needs of the various research communities which have traditionally valued it as a method of assessment.