google offers public domain downloads

Google announced today that it has made free downloadable PDFs available for many of the public domain books in its database. This is a good thing, but there are several problems with how they’ve done it. The main thing is that these PDFs aren’t actually text, they’re simply strings of images from the scanned library books. As a result, you can’t select and copy text, nor can you search the document, unless, of course, you do it online in Google. So while public access to these books is a big win, Google still has us locked into the system if we want to take advantage of these books as digital texts.
A small note about the public domain. Editions are key. A large number of books scanned so far by Google have contents in the public domain, but are in editions published after the cut-off (I think we’re talking 1923 for most books). Take this 2003 Signet Classic edition of the Darwin’s The Origin of Species. Clearly, a public domain text, but the book is in “limited preview” mode on Google because the edition contains an introduction written in 1958. Copyright experts out there: is it just this that makes the book off limits? Or is the whole edition somehow copyrighted?
Other responses from Teleread and Planet PDF, which has some detailed suggestions on how Google could improve this service.

8 thoughts on “google offers public domain downloads

  1. Eddie A. Tejeda

    Your small note on editions being key is very interesting.
    I’ve not looked into this very deeply, but does Google always provide the public domain version of a text if it exists? Or do they scan what is available to them and provide the “limited preview” on texts which might open but republished later?
    Is it possible to side step this issue by providing the original text of books, which are public, even if scanned from a later edition, by removing publisher specific content, like the introduction?
    Also, I do not understand what you mean by: “[these] restrictions..[are] not legal code but digital code.” Is Google allowed to show texts of later editions, but choose not to? Does that restriction make sense? Doesn’t that leave Google open to competition in the most trivial area? Is that where the demand is?

  2. bowerbird

    ben said:
    > The main thing is that
    > these PDFs aren’t actually text,
    > they’re simply strings of images
    > from the scanned library books.
    ok, so we need to get to work doing
    o.c.r. on them. that isn’t _nearly_
    as hard as scanning them. geez, ben,
    do you expect google just to hand you
    everything, on a nice silver platter?
    use your head.
    remember that google is a _business_,
    operating in a space where they have
    _competitors_, who are _not_ offering
    a ton of money to do all this scanning
    which benefits society…
    if google were simply to release the text,
    their competitors could just incorporate
    that text into their own search engines.
    is that fair? i don’t think so.
    i agree that it would be very nice to have
    this text in digital form, and thus i am
    willing to put energy where my desires are,
    by joining a collaborative distributed effort
    to turn these scans into digital text, so that
    it can be used in a variety of ways…
    but i refuse to ungratefully complain that google
    hasn’t given me _lemonade_, when they have just
    dropped a ton of lemons on my front lawn.
    instead, i’m gonna send them a thank-you note,
    and then prepare myself to do some squeezing…
    > Google still has us locked into
    > the system if we want to take advantage
    > of these books as digital texts.
    don’t be ridiculous. they haven’t
    “locked” anyone into a darn thing.
    stop bellyaching, and show initiative;
    we’ve got a lot of lemons to squeeze…
    -bowerbird

  3. ben vershbow

    Eddie said:

    Is it possible to side step this issue by providing the original text of books, which are public, even if scanned from a later edition, by removing publisher specific content, like the introduction?

    Maybe in theory, but that could get dangerous for Google, which is already being sued left and right as it is. Their fair use claim for building the books database will be much stronger if they bend over backwards to respect copyright everywhere else.
    And your last question. You were right to be confused — the point I was making was confused and I’ve taken it out (and changed the title of the post). Sometimes you write fast and it all makes sense at the time… ah, blogs.
    That excised text, for the record:

    In a way, Google is creating an extra catch with its digital editions of these print editions. The restrictions here are not legal code but digital code. The system itself is proprietary. Naturally, Google has every right to do this, but its important to keep in mind what public domain means, and to be aware of the layers (of law and code) through which one accesses it.

    And point taken, Bowerbird, about communal efforts to make use of these PDFs. Let’s all download GOCR and get cracking.

  4. bowerbird

    ben said:
    > And point taken, Bowerbird, about
    > communal efforts to make use of these PDFs.
    i’m glad you could tell i am good-humored. :+)
    > Let’s all download GOCR and get cracking.
    that’s probably not a good idea, unless somehow
    gocr has made some tremendous leaps and bounds
    that i am unaware of. the o.c.r. task is simple,
    and can be done by people with good o.c.r. apps
    (the best is abbyy finereader) as a batch process.
    i’m guessing that brewster kahle will facilitate
    a community-wide o.c.r. effort via the o.c.a.
    where the collaborative effort is necessary is
    in the clean-up of the o.c.r. results.
    i’ve prototyped a program to help people do that,
    and i’ll release it to the public down the line,
    but for the short-term, i recommend that people
    go over to distributed proofreaders and help them.
    > http://www.pgdp.net
    realize, however, that their workflow processes
    are absolutely horrendous, so don’t get ingrained
    that that is “the” way to do things. it’s not.
    still, d.p. is good practice at collaborating…
    -bowerbird

  5. Jesse Wilbur

    Despite the fact that Google is a business and is NOT out to serve the public interest, they are getting these books from public universities, and are scanning texts of public domain works. Doesn’t this put things more squarely in the realm of public interest? In this light, Google is providing public material taken from public institutions, but only searchable from within their own system. This is a clear example of privatization of the commons, as far as I can see.
    I’m not arguing that Google’s proffered service isn’t good – it is a big pile of juicy lemons. But I am suggesting that in exchange for the access to all these public domain and (especially) non-public domain works, Google could provide the searchable text for public domain works, retaining the right to withhold searchable text for works that haven’t passed out of copyright. Public lemonade from public lemons. This, to me, seems to be well within reason for Ben to suggest.

  6. Sylvia Thornley

    I read your post with interest and felt compelled to post a response to tell you about http://www.ultrapedia.com – we publish the Recognized version of a scanned public domain book.
    So how does the Recognized version of a book differ from the book formats already on Google Book Search? We preserve the original format of the book, including graphics and photos; so the book appears as though it was recently created on a computer, not printed centuries ago. Rather different from scanned images of a book or plain text!
    We’ve been digitizing books for years as a hobby and when I first looked a GBS, I thought it was great but….
    http://www.ultrapedia.com – is a full text and full retrieval search engine focused on delivering the Recognized version of a scanned book that is out of copyright; that you can read online or download in PDF format.
    The system is still being developed and we’re still ironing out the kinks for example the books aren’t spellchecked yet; the next phase is to publish several different versions of the books; these include
    • Recognized version of the complete book – all pages – rather than single pages like those indexed at the moment.
    • Recognized and layered version of the complete book.
    • Recognized, layered, and spellchecked version of the complete book – the ones online at the moment aren’t spellchecked.
    • Recognized version of the complete book with a copy of the original PDF embedded page-for-page in it’s own independent layer.
    • Collated version of the original PDF and the recognized version in a twin-view window for side-by-side correction.
    • Integrating our own public domain library into the mix.

  7. bowerbird

    good for you, sylvia!
    you still have a lot of work to do,
    but the journey of a thousand miles
    and all that rot…
    contact me if you’d like to chat,
    as i’m doing something quite similar.
    i’m at bowerbird@aol.com
    ***
    and as this old thread has been revived,
    i will also note that, as of this month,
    google is now making the o.c.r. results
    — i.e., the text — publicly available
    for the books it considers public domain.
    (as we know, they judge conservatively.)
    they haven’t made it _convenient_ yet —
    there’s no facility for downloading the
    text on a full-book basis, just by-page
    — and it appears their decision is based
    on concerns around _accessibility_ for
    the visually-impaired, not a more-general
    notion that public-domain text _belongs_
    to the public. and the quality of the o.c.r
    is _abysmal_… but nonetheless, the text
    is now available to us. meaning that i was
    wrong when i said they wouldn’t release text.
    i’m very happy to have been wrong in this case.
    -bowerbird

  8. wanderer

    Thanks, bowerbird – we have created a beta site at http://www.pointmore.com as a jumping off point to introduce a browseable version of the Ultrapedia Library alongside the traditional Ultrapedia Search Engine. To help keep track of the fusion (or is it confusion) of these two vast information resources we have also incorporated the Ultrapedia Blog, and the Ultrapedia Support Forum.
    To sum up; Ultrapedia Search Engine has been designed to deliver single pages from the Ultrapedia Library, and the Ultrapedia Library Browser has been designed so you can browse through the library and to download books – if you are a registered user.

Comments are closed.