AUR (en) - ocrmypdf

Search Criteria

Enter search criteria

Search by

Keywords

Out of Date

Sort by

Sort order

Per page

Package Details: ocrmypdf 16.10.0-1

Package Actions

Git Clone URL:	https://aur.archlinux.org/ocrmypdf.git (read-only, click to copy)
Package Base:	ocrmypdf
Description:	A tool to add an OCR text layer to scanned PDF files, allowing them to be searched
Upstream URL:	https://github.com/ocrmypdf/OCRmyPDF
Licenses:	MPL2
Submitter:	dreuter
Maintainer:	fbrennan (pigmonkey)
Last Packager:	pigmonkey
Votes:	129
Popularity:	2.63
First Submitted:	2014-01-27 11:36 (UTC)
Last Updated:	2025-02-28 03:49 (UTC)

Dependencies (21)

ghostscript
img2pdf (img2pdf-git^AUR)
pngquant
python (python37^AUR, python311^AUR, python310^AUR)
python-deprecation
python-importlib_resources
python-packaging
python-pdfminer
python-pikepdf
python-pillow (python-pillow-simd-git^AUR)
python-pluggy
python-reportlab
python-rich
python-tqdm
tesseract (tesseract-git^AUR)
unpaper (unpaper-git^AUR)
python-build (make)
python-hatch-vcs (make)
python-installer (make)
python-wheel (make)
jbig2enc^AUR (jbig2enc^AUR, jbig2enc-git^AUR) (optional) – Better compression algorithm; results in smaller PDF files

Required by (5)

Sources (1)

https://files.pythonhosted.org/packages/source/o/ocrmypdf/ocrmypdf-16.10.0.tar.gz

Pinned Comments

fbrennan commented on 2023-05-12 22:54 (UTC)

The flag was invalid and has been removed with no action taken as no new version was released. There's nothing to do for this package; no new release has been made. Rebuild, as @eclairevoyant has said.

Latest Comments

« First ‹ Previous 1 .. 11 12 13 14 15 16 17 18 19 20 21 22 Next › Last »

marlemion commented on 2018-11-06 09:18 (UTC) (edited on 2018-11-06 09:32 (UTC) by marlemion)

@bsdice: Thanks, but that did not help. Same error. On another machine, ocrmypdf is working. So it must be some issue on that machine...

Btw. ocrmypdf was working for ages on that machine, but I had to hold back leptonica for other reasons, so it was stuck to a certain version for some time....

Found the Problem: I had installed python2-jmespath-0.9.3-2. This package installs /usr/bin/jp2.py. For some reason, python looked at this jp2.py instead of /usr/lib/python3.x/site-packages/jp2.py. After removing python2-jmespath-0.9.3-2, it works. However, such a behaviour is irritating.

bsdice commented on 2018-11-06 09:16 (UTC)

@marlemion: Replace aur/img2pdf-git 0.2.1.r8.geedf73e-1 with normal img2pdf 0.3.1-1 and see what happens. pacman -Rd img2pdf-git ; pacman -S --asdeps img2pdf ; or something like that.

marlemion commented on 2018-11-06 08:57 (UTC)

I would like to update to the most recent version of ocrmypdf. Builds fine, but throws this error:

Traceback (most recent call last): File "/usr/bin/ocrmypdf", line 11, in <module> load_entry_point('ocrmypdf==7.2.1', 'console_scripts', 'ocrmypdf')() File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 484, in load_entry_point return get_distribution(dist).load_entry_point(group, name) File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 2725, in load_entry_point return ep.load() File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 2343, in load return self.resolve() File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 2349, in resolve module = import(self.module_name, fromlist=['name'], level=0) File "/usr/lib/python3.7/site-packages/ocrmypdf/main.py", line 36, in <module> from ._pipeline import build_pipeline File "/usr/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 26, in <module> import img2pdf File "/usr/lib/python3.7/site-packages/img2pdf.py", line 28, in <module> from jp2 import parsejp2 ImportError: cannot import name 'parsejp2' from 'jp2' (/usr/bin/jp2.py)

img2pdf-git has been rebuilt. No effect.

fbrennan commented on 2018-10-02 03:52 (UTC)

I think lossy mode should still be selectable because it's only dangerous in certain situations and leads to really small files otherwise. It just shouldn't be default.

jbarlow commented on 2018-10-01 18:33 (UTC)

@bsdice: I'm aware of the JBIG2 6/8 issue. However, I never intended to enable lossy mode. I attribute the issue to the help text of jbig2enc misleading. I had to inspect the jbig2enc source to confirm it would indeed select lossy encoding.

In any case it is an easy fix to switch to lossless JBIG2 which still gets better results than CCITT G4 so I will do in the next release. I haven't decided if I will keep lossy mode.

Generally it is ideal to report upstream issues to upstream since users other than ArchLinux are affected. It so happens I subscribe to the AUR comments, but ocrmypdf is deployed in a lot of places I don't follow.

@fbrennan: I recommend just waiting till the next version.

fbrennan commented on 2018-10-01 10:42 (UTC)

Should the PKGBUILD be changed to reflect the possible danger of jbig2enc?

bsdice commented on 2018-09-29 22:03 (UTC)

Here is a cautionary note for people using this AUR for archival purposes:

The default of ocrmypdf is --optimize 1 ("do safe, lossless optimizations"). If you have jbig2enc installed, this means b/w documents will be re-encoded from CCITT G4 to JBIG2 in so-called "symbol mode", see https://github.com/jbarlow83/OCRmyPDF/blob/master/src/ocrmypdf/exec/jbig2enc.py#L42

Unfortunately it has been shown by D. Kriesel that JBIG2 is able to alter the contents of documents, e.g. by changing a "6" into an "8" due to their similarity at low resolution. In the aftermath German BSI (https://www.bsi.bund.de/DE/Publikationen/TechnischeRichtlinien/tr03138/index_htm.html), Swiss KOST (https://kost-ceco.ch/cms/index.php?id=312,569,0,0,1,0), and maybe others have issued statements forbidding JBIG2 altogether for archival purposes of legally relevant documents. Instead it is recommended to keep using lossless CCITT G4 compression.

Users of this package should therefore use this tool with "--optimize 0" (do not optimize) until further notice. Upstream should use jbig2 only at "--optimize 4" ("do dangerous aggressive lossy optimizations"), which does not exist at this point.

The drawback of G4 is of course larger file sizes, but I prefer that over having to doubt every document scanned, whether numbers or letters are really that what was printed in the original document.

sagittarius commented on 2018-09-06 08:52 (UTC) (edited on 2018-09-06 08:54 (UTC) by sagittarius)

@fbrennan No worries. We're sure you're doing your best and I'm very glad of it. And the least I can do is to report some issues as a user. My very little contribution. So thank you fbrennan. BTW, problem solved: v7.04 works great ;-)