Actually a good idea, I always wondered why Google didn't use OCR software to read text on images, forcing everyone to use text based navigation. I am guessing the re-release is to get the software up to par for search.

That's got scope FAR beyond just reading text in images.

Think if Google could generate a kind of screenshot of a page, and OCR that. Instant removal of all hidden text. Could give a better idea of page structure and relative fonts sizes (and therefore importance or different areas of text) without trying to analyse all the HTML/CSS going on in the background.

Not sure how practical that idea is, but with the comment in the story that they are looking for top notch OCR engineers, and the mention of multi-column layouts etc, they are looking at more than just scanning in some books....

Reading text in navigation graphics is extremely tricky

due to the huge variety of non-standard fonts (and their ideosyncratic modifications by designers) being used. Moreover, all OCR software requires a critical mass of graphical text input to learn to convert it reliably.

It's actually what captchas are all about: While there's some algos around that may enable you to auto-convert them, last time I looked they were limited to a very restricted spectrum of captcha generation algos and not too reliable even then.

So I'd expect Goo to focus this project on their public domain books digitalization instead.


Unless I'm missing something here the only benefit to this is that Tesseract is open source. It is:

  • "not nearly as accurate" as commercial programs
  • "will perform poorly on multi-column material"
  • "doesn't do well on gray scale or color documents"

    Benefit to most folks? Not much. Benefit to Google? No desk licenses.

  • Not free as in really free

    I believe the license says it's for research purposes only. If you were planning on using it for commercial purposes at work, sorry you're out of luck.

    I'm doing some OCR work right now, and all my stuff is gray scale. And a lot of it's tables and mulitcolumn. So, nice try, but those two things make it another useless OSS project not another OSS killer app.

    For public domain texts

    geared to a general audience, i.e. probably single column, no tables, etc. (think the works of Shakespeare, Conan Doyle, Mark Twain, etc.) it might just about do the trick even with a less than optimal performance rating.

    Of course, Goo isn't exactly famous for their willingness to pump serious money into technology, hardware and software licenses, vide their prevailing reliance on thousands of cheap Linux boxes rather than going for mainframe tech. (Yes, they seem to be heading that way now, but after how long?)

    Could well be a corporate culture thing.

    And yes: this particular OCR thingy seems to be quite useless to anyone else. Beta, beta all the way. Then again, they may simply be linkbaiting again ... :-)

    I thought they already did

    I thought they already did do some OCR was years ago when I saw it. I keep thinking they had it on froogle. I vaguely remember a levis ad and the text in the picture was highlighted.

    Keep the consipiracy

    Keep the consipiracy theories simple. More evidence of a serious interest in captchas.



    Or maybe Google just found

    Or maybe Google just found something way better, so they are throwing out a reminder for this junk to misdirect the compitition.

    part of a bigger picture

    such as described here

    Nice summary

    of Google's activities:

    I don't believe Google is doing well in most of it's brand expansions. Gtalk beat by Skype, Orkut beat by Myspace, Gvideo beat by YouTube, Blogger beat by Typepad, Google Finance, Froogle, blogsearch etc all beat by other companies. So I am not scared based upon an analysis of their past success record. Others have shown that Google is beatable outside the core of text search.

    To be fair, he might have mentioned AdWords and AdSense on the plus side, but that doesn't refute his basic argument.

    But this is certainly beyond dispute: "Google has many many many enemies these days and not just Yahoo, Microsoft, eBay, Amazon, etc."

