Conversation

Notices

ilja (ilja@ilja.space)'s status on Sunday, 13-Feb-2022 20:28:09 UTC ilja
in reply to
- Amolith
@amolith I have no experience with it, but there's an OCR bot on fedi and it uses https://github.com/tesseract-ocr/tesseract and it's docs say it also does pdf

(bot repo is https://github.com/Lynnesbian/OCRbot/ )
In conversation Sunday, 13-Feb-2022 20:28:09 UTC from ilja.space permalink
Attachments
1. GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)
  
  Tesseract Open Source OCR Engine (main repository) - GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)
2. GitHub - Lynnesbian/OCRbot: An OCR (Optical Character Recognition) bot for Mastodon (and compatible) instances
  
  An OCR (Optical Character Recognition) bot for Mastodon (and compatible) instances - GitHub - Lynnesbian/OCRbot: An OCR (Optical Character Recognition) bot for Mastodon (and compatible) instances
- Amolith (amolith@nixnet.social)'s status on Sunday, 13-Feb-2022 20:28:10 UTC Amolith
  
  Does anyone know good tools for applying OCR to a scanned document?
  
  I don't want any organisation, tagging, cloud, etc. features, just something akin to imagemagick that reads the PDF, does the OCR magick, then spits out the OCRed PDF to a separate file.
  
  In conversation Sunday, 13-Feb-2022 20:28:10 UTC permalink
  
  Tagomago repeated this.
- Dan Jones (danjones000@fedi.absturztau.be)'s status on Monday, 14-Feb-2022 08:28:13 UTC Dan Jones
  in reply to
  - Amolith
  @amolith
  I used to do this for a living.
  I built a cloud based document management system that would take scanned pages, OCR them, and store them as PDFs.
  We used Google Cloud Vision, which was overkill, but my CEO had a hard-on for Google.
  Tesseract should be all you need: https://guides.library.illinois.edu/c.php?g=347520&p=4121426
  Although, may I suggest using DjVu instead of PDF. DjVu is a better archival format. It’s much simpler usually results on smaller fine sizes. Many PDF viewers already support it. But I don’t know exactly what your use case is, so that may not be an option
  In conversation Monday, 14-Feb-2022 08:28:13 UTC permalink
  Attachments
  1. LibGuides: Introduction to OCR and Searchable PDFs: Using Tesseract
    
    from Scholarly Commons
    
    Learn OCR best practices and how to begin an OCR project using ABBYY FineReader, Adobe Acrobat Pro, or Tesseract with this guide.

Public

Conversation

Notices

Feeds