Package Details: ocrmypdf 16.8.0-1

Git Clone URL: https://aur.archlinux.org/ocrmypdf.git (read-only, click to copy)
Package Base: ocrmypdf
Description: A tool to add an OCR text layer to scanned PDF files, allowing them to be searched
Upstream URL: https://github.com/ocrmypdf/OCRmyPDF
Licenses: MPL2
Submitter: dreuter
Maintainer: fbrennan (pigmonkey)
Last Packager: pigmonkey
Votes: 125
Popularity: 3.28
First Submitted: 2014-01-27 11:36 (UTC)
Last Updated: 2025-01-07 20:27 (UTC)

Pinned Comments

fbrennan commented on 2023-05-12 22:54 (UTC)

The flag was invalid and has been removed with no action taken as no new version was released. There's nothing to do for this package; no new release has been made. Rebuild, as @eclairevoyant has said.

Latest Comments

« First ‹ Previous 1 .. 4 5 6 7 8 9 10 11 12 13 14 .. 22 Next › Last »

ginkel commented on 2020-10-26 10:56 (UTC)

ocrmypdf currently fails to work with the recently updated python-pdfminer package. Downgrading the package to python-pdfminer-20200726-1 works around the issue for now.

pkg_resources.DistributionNotFound: The 'pdfminer.six!=20200720,<=20200726,>=20191110' distribution was not found and is required by ocrmypdf

pigmonkey commented on 2020-10-19 12:42 (UTC)

I still use the package, so I'm happy to continue updating or to step back. No preference.

fbrennan commented on 2020-10-18 23:02 (UTC)

Hello all.

I'm back to using Arch if pigmonkey no longer wants to maintain this package. :-)

But I think they've done a good job so can also just give them the package. I can also just do nothing, but since I'm back in that situation it can be confusing who is responsible to push the update.

Which would you prefer?

pigmonkey commented on 2020-10-14 22:36 (UTC)

tesseract-data-osd is included with the standard tesseract Arch package.

Looking at the "Required By" section of the tesseract-data-eng package, it does not appear that it is common for other Arch packages to list it as a dependency.

If this is confusing for users, I think it would be acceptable to add it as an optional dependency, so that there is an indication at the end of the install that another package might be needed. But it may be weird for non-English speakers if the package has an optional dependency on the English language pack, but not whatever data pack is needed for the user's native language. I don't really want a 106 item optdepends array for every possible language pack.

jbarlow commented on 2020-10-14 07:07 (UTC)

OCRmyPDF assumes English unless a language is specified with -l fra for example. So strictly speaking it works, but you have to issue the option every time. The test suite also assumes English is installed. I believe most package managers have added an explicit dependency on tesseract-data-eng or whatever it's called in the system, but some have not.

I did poll users whether to default to the system language based on locale, but surprisingly non-English users didn't like the idea.

OCRmyPDF does assume tesseract-data-osd is installed so that should be a dependency if Arch breaks that out as a separate package.

pigmonkey commented on 2020-10-13 16:51 (UTC)

Tesseract does require a data package to be installed, but it does not have to be English. If a language is not specified, Tesseract does assume English, hence the error.

I don't think it's appropriate to include tesseract-data-eng as a dependency since that might not be the user's language.

ioan commented on 2020-10-13 13:45 (UTC)

crmypdf test.pdf test2.pdf Tesseract failed to report available languages. Output from Tesseract:


Error opening data file /usr/share/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! List of available languages (1): osd

looks like it needs eng data by default

jorges commented on 2020-08-05 19:49 (UTC)

Thanks for the explanation! I just got rid of pyhton-pdfminer.six from AUR and downgraded python-pdfminer to 20200517-1. OCRMyPDF works and all is well!

pigmonkey commented on 2020-07-29 17:39 (UTC)

It's a little convoluted, but here is what I think is happening:

The confusingly-named python-pdfminer from community that we use is in fact python-pdfminer.six. You can verify that by looking at its PKGBUILD. The AUR python-pdfminer.six is basically the same package, except it pulls from PyPi instead of Github and is on an outdated version (20200124 instead of community's 20200720).

OCRMyPDF claims to support 20200720, but that version of python-pdfminer{,.six} dropped PDFTextExtractionNotAllowed. This apparently was unintentional and has been reversed in 20200726. But as of now 20200726 has not been officially tagged.

So, we need to wait for upstream python-pdfminer.six to make 20200726 official, and then wait for the community maintainer to update the python-pdfminer package to 20200726. And then we need to wait for upstream OCRMyPDF to release a new version that notes support for 20200726. Then I can update this package and everything will be copacetic.

In the meantime, you can downgrade the community python-pdfminer package to the previous version, or run the much older version provided by the AUR python-pdfminer.six package.