Bobinas P4G
  • Login
  • Public

    • Public
    • Groups
    • Popular
    • People

Conversation

Notices

  1. ilja (ilja@ilja.space)'s status on Sunday, 13-Feb-2022 20:28:09 UTC ilja ilja
    in reply to
    • Amolith
    @amolith I have no experience with it, but there's an OCR bot on fedi and it uses https://github.com/tesseract-ocr/tesseract and it's docs say it also does pdf

    (bot repo is https://github.com/Lynnesbian/OCRbot/ )
    In conversation Sunday, 13-Feb-2022 20:28:09 UTC from ilja.space permalink

    Attachments

    1. GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)
      Tesseract Open Source OCR Engine (main repository) - GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)
    2. GitHub - Lynnesbian/OCRbot: An OCR (Optical Character Recognition) bot for Mastodon (and compatible) instances
      An OCR (Optical Character Recognition) bot for Mastodon (and compatible) instances - GitHub - Lynnesbian/OCRbot: An OCR (Optical Character Recognition) bot for Mastodon (and compatible) instances
    • Amolith (amolith@nixnet.social)'s status on Sunday, 13-Feb-2022 20:28:10 UTC Amolith Amolith
      Does anyone know good tools for applying OCR to a scanned document?

      I don't want any organisation, tagging, cloud, etc. features, just something akin to imagemagick that reads the PDF, does the OCR magick, then spits out the OCRed PDF to a separate file.
      In conversation Sunday, 13-Feb-2022 20:28:10 UTC permalink
      Tagomago repeated this.
    • Dan Jones (danjones000@fedi.absturztau.be)'s status on Monday, 14-Feb-2022 08:28:13 UTC Dan Jones Dan Jones
      in reply to
      • Amolith

      @amolith

      I used to do this for a living.

      I built a cloud based document management system that would take scanned pages, OCR them, and store them as PDFs.

      We used Google Cloud Vision, which was overkill, but my CEO had a hard-on for Google.

      Tesseract should be all you need: https://guides.library.illinois.edu/c.php?g=347520&p=4121426

      Although, may I suggest using DjVu instead of PDF. DjVu is a better archival format. It’s much simpler usually results on smaller fine sizes. Many PDF viewers already support it. But I don’t know exactly what your use case is, so that may not be an option

      In conversation Monday, 14-Feb-2022 08:28:13 UTC permalink

      Attachments

      1. LibGuides: Introduction to OCR and Searchable PDFs: Using Tesseract
        from Scholarly Commons
        Learn OCR best practices and how to begin an OCR project using ABBYY FineReader, Adobe Acrobat Pro, or Tesseract with this guide.

Feeds

  • Activity Streams
  • RSS 2.0
  • Atom
  • Help
  • About
  • FAQ
  • Privacy
  • Source
  • Version
  • Contact

Bobinas P4G is a social network. It runs on GNU social, version 2.0.1-beta0, available under the GNU Affero General Public License.

Creative Commons Attribution 3.0 All Bobinas P4G content and data are available under the Creative Commons Attribution 3.0 license.