Package Details: ocrmypdf 16.7.0-1

Git Clone URL: https://aur.archlinux.org/ocrmypdf.git (read-only, click to copy)
Package Base: ocrmypdf
Description: A tool to add an OCR text layer to scanned PDF files, allowing them to be searched
Upstream URL: https://github.com/ocrmypdf/OCRmyPDF
Licenses: MPL2
Submitter: dreuter
Maintainer: fbrennan (pigmonkey)
Last Packager: pigmonkey
Votes: 125
Popularity: 3.53
First Submitted: 2014-01-27 11:36 (UTC)
Last Updated: 2024-12-10 05:10 (UTC)

Pinned Comments

fbrennan commented on 2023-05-12 22:54 (UTC)

The flag was invalid and has been removed with no action taken as no new version was released. There's nothing to do for this package; no new release has been made. Rebuild, as @eclairevoyant has said.

Latest Comments

« First ‹ Previous 1 .. 11 12 13 14 15 16 17 18 19 20 21 22 Next › Last »

marlemion commented on 2018-11-06 08:57 (UTC)

I would like to update to the most recent version of ocrmypdf. Builds fine, but throws this error:

Traceback (most recent call last): File "/usr/bin/ocrmypdf", line 11, in <module> load_entry_point('ocrmypdf==7.2.1', 'console_scripts', 'ocrmypdf')() File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 484, in load_entry_point return get_distribution(dist).load_entry_point(group, name) File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 2725, in load_entry_point return ep.load() File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 2343, in load return self.resolve() File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 2349, in resolve module = import(self.module_name, fromlist=['name'], level=0) File "/usr/lib/python3.7/site-packages/ocrmypdf/main.py", line 36, in <module> from ._pipeline import build_pipeline File "/usr/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 26, in <module> import img2pdf File "/usr/lib/python3.7/site-packages/img2pdf.py", line 28, in <module> from jp2 import parsejp2 ImportError: cannot import name 'parsejp2' from 'jp2' (/usr/bin/jp2.py)

img2pdf-git has been rebuilt. No effect.

fbrennan commented on 2018-10-02 03:52 (UTC)

I think lossy mode should still be selectable because it's only dangerous in certain situations and leads to really small files otherwise. It just shouldn't be default.

jbarlow commented on 2018-10-01 18:33 (UTC)

@bsdice: I'm aware of the JBIG2 6/8 issue. However, I never intended to enable lossy mode. I attribute the issue to the help text of jbig2enc misleading. I had to inspect the jbig2enc source to confirm it would indeed select lossy encoding.

In any case it is an easy fix to switch to lossless JBIG2 which still gets better results than CCITT G4 so I will do in the next release. I haven't decided if I will keep lossy mode.

Generally it is ideal to report upstream issues to upstream since users other than ArchLinux are affected. It so happens I subscribe to the AUR comments, but ocrmypdf is deployed in a lot of places I don't follow.

@fbrennan: I recommend just waiting till the next version.

fbrennan commented on 2018-10-01 10:42 (UTC)

Should the PKGBUILD be changed to reflect the possible danger of jbig2enc?

bsdice commented on 2018-09-29 22:03 (UTC)

Here is a cautionary note for people using this AUR for archival purposes:

The default of ocrmypdf is --optimize 1 ("do safe, lossless optimizations"). If you have jbig2enc installed, this means b/w documents will be re-encoded from CCITT G4 to JBIG2 in so-called "symbol mode", see https://github.com/jbarlow83/OCRmyPDF/blob/master/src/ocrmypdf/exec/jbig2enc.py#L42

Unfortunately it has been shown by D. Kriesel that JBIG2 is able to alter the contents of documents, e.g. by changing a "6" into an "8" due to their similarity at low resolution. In the aftermath German BSI (https://www.bsi.bund.de/DE/Publikationen/TechnischeRichtlinien/tr03138/index_htm.html), Swiss KOST (https://kost-ceco.ch/cms/index.php?id=312,569,0,0,1,0), and maybe others have issued statements forbidding JBIG2 altogether for archival purposes of legally relevant documents. Instead it is recommended to keep using lossless CCITT G4 compression.

Users of this package should therefore use this tool with "--optimize 0" (do not optimize) until further notice. Upstream should use jbig2 only at "--optimize 4" ("do dangerous aggressive lossy optimizations"), which does not exist at this point.

The drawback of G4 is of course larger file sizes, but I prefer that over having to doubt every document scanned, whether numbers or letters are really that what was printed in the original document.

sagittarius commented on 2018-09-06 08:52 (UTC) (edited on 2018-09-06 08:54 (UTC) by sagittarius)

@fbrennan No worries. We're sure you're doing your best and I'm very glad of it. And the least I can do is to report some issues as a user. My very little contribution. So thank you fbrennan. BTW, problem solved: v7.04 works great ;-)

fbrennan commented on 2018-09-04 14:33 (UTC)

@sagittarius Sorry, I am doing my best. I am new at this. I updated python-pikepdf -- updating that package should solve your problem. I'll make sure that this never happens again, I forgot how strict it is about package versions.

@jbehmel Your question has already been answered. Github archives are unusable without hacks for AUR packages. That's because they don't include the .git directory, required by python-setuptools.

jbehmel commented on 2018-09-03 13:55 (UTC)

Hey,I've just asked myself why You are not using this link: https://github.com/jbarlow83/OCRmyPDF/archive/v7.0.4.tar.gz