I'm trying to use pdf2image and it seems I need something called propeller :
(sum_env) C:\Users\antoi\Documents\Programming\projects\summarizer>python ocr.py -i fr13_idf.pdf
Traceback (most recent call last):
File "c:\Users\antoi\Documents\Programming\projects\summarizer\sum_env\lib\site-packages\pdf2image\pdf2image.py", line 165, in __page_count
proc = Popen(["pdfinfo", pdf_path], stdout=PIPE, stderr=PIPE)
File "C:\Python37\lib\subprocess.py", line 769, in __init__
restore_signals, start_new_session)
File "C:\Python37\lib\subprocess.py", line 1172, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "ocr.py", line 53, in <module>
pdfspliterimager(image_path)
File "ocr.py", line 32, in pdfspliterimager
pages = convert_from_path("document-page%s.pdf" % i, 500)
File "c:\Users\antoi\Documents\Programming\projects\summarizer\sum_env\lib\site-packages\pdf2image\pdf2image.py", line 30, in convert_from_path
page_count = __page_count(pdf_path, userpw)
File "c:\Users\antoi\Documents\Programming\projects\summarizer\sum_env\lib\site-packages\pdf2image\pdf2image.py", line 169, in __page_count
raise Exception('Unable to get page count. Is poppler installed and in PATH?')
Exception: Unable to get page count. Is poppler installed and in PATH?
I tried this link but it the thing to download didn't solved my problem.
pdf2image is only a wrapper around poppler (not propeller!), to use the module you need to have poppler-utils installed on your machine and in your path.
The procedure is linked in the project's README in the "How to install" section.
1st of all Download Poppler from here here,Then extract it.In the code section just add poppler_path=r'C:\Program Files\poppler-0.68.0\bin'(for eg.) like below
from pdf2image import convert_from_path
images = convert_from_path("mypdf.pdf", 500,poppler_path=r'C:\Program Files\poppler-0.68.0\bin')
for i, image in enumerate(images):
fname = 'image'+str(i)+'.png'
image.save(fname, "PNG")
Now its done.With this trick no need to add Environmental Variables.Let me know if you have any problem.
These pdf2image and pdftotext library backend requierment is Poppler,
so you have to install
'conda install -c conda-forge poppler '
then the error will be resolved.
and if still it won't work for you then you can follow
http://blog.alivate.com.au/poppler-windows/ to install this library.
It is poppler which is not installed properly.
Using this you can get correct package for installation.
sudo apt-get install poppler-utils
For windows; to solve PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH? :
Install chocolatey https://chocolatey.org/install
then install poppler using choco:
choco install poppler
Poppler in path for pdf2image
While working with pdf2image there are dependency that needs to be satisfied:
Installation of pdf2image
pip install pdf2image
Installation of python-dateutil
pip install python-dateutil
Installation of Poppler
Specifying Poppler path in environment variable (system path)
Installing Poppler on Windows
Go to https://github.com/oschwartz10612/poppler-windows/releases/
Under Release 21.11.0-0 Latest v21.11.0-0
Go to Assets 3
Download Release-21.11.0-0.zip
Adding Poppler to path
Add Poppler installed to loaction :C:\Users\UserName\Downloads\Release-21.11.0-0.zip
Add C:\Users\UserName\Downloads\Release-21.11.0-0.zip to system
variable path in Environment Variable
Specifying poppler path in code
pages = convert_from_path(filepath, poppler_path=r"actualpoppler_path")
If anyone still has this error on Windows, I solved the problem by:
Download the Latest binary of Poppler for Windows from Poppler for Windows
Unzip it into C drive like C:\poppler-0.68.0
Specify the Poppler path like this:
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
ROOT_DIR = os.path.abspath(os.curdir)
# Path of the pdf
PDF_file = ROOT_DIR + r"\PdfToImage\src\2.pdf"
'''
Part #1 : Converting PDF to images
'''
# Store all the pages of the PDF in a variable
pages = convert_from_path(PDF_file, 500, poppler_path=r'C:\poppler-0.68.0\bin')
FOR MAC, if you have brew installed, that is the way to go.
brew install poppler
Takes several minutes to install all the dependencies, but pdf2image will work afterwards.
This is a repeat of an answer here and the answer is also in a comment on this page. Adding this answer b/c it took me a while to find the correct solution FOR MACs.
In Windows
Install the Poppler for Windows Poppler
500 = Quality of JPG
the path contains the pdf files
pip install pdf2img
path = r'C:\ABC\FEF\KLH\pdf_extractor\output\break'
def spliting_pdf2img( path):
from pdf2image import convert_from_path, convert_from_bytes
for file in os.listdir(path):
if file.lower().endswith(".pdf"):
pages = convert_from_path(os.path.join(path,file), 500,poppler_path= r'C:\ABC\DEF\Downloads\poppler-0.68.0\bin')
for page in pages:
page.save(os.path.join(path,file.lower().replace(".pdf",".jpg")),'JPEG')
In Linux/UBUNTU
Install the below packages in the ubuntu/linux terminal
sudo apt-get update
sudo apt-get install poppler-utils
path = r'C:\ABC\FEF\KLH\pdf_extractor\output\break'
def spliting_pdf2img( path):
from pdf2image import convert_from_path, convert_from_bytes
for file in os.listdir(path):
if file.lower().endswith(".pdf"):
pages = convert_from_path(os.path.join(path,file), 500)
for page in pages:
page.save(os.path.join(path,file.lower().replace(".pdf",".jpg")),'JPEG')
I'm working on a mac in Visual Studio Code and I encountered this error. I followed the install instructions and was able to verify the packages were installed but the error persisted when running in VSC.
Even though I had my python.condaPath and python.pythonPath specified in my settings.json it wasn't until activated the conda environment inside of the VSC integrated terminal itself
conda activate my_env
that the error went away..
Bizarre.
After downloading poppler do this....
import os
os.environ["PATH"] = r"C:.....\poppler-xxxxxxx\bin"
use this to make environment hope it works.It worked for me.
I had the same problem on my Mac
I solved it by replacing the poppler_path from - poppler_path= '\usr\bin'
" to poppler_path= '\usr\local\bin'
but you can try to print all the places that poppler might be in your mac
by echo $PATH in the Terminal and try all the options as poppler_path=" "
I had the same issue on Mac using Visual Studio Code and a conda environment.
I found out that I could run the code from the command line, however not from VS code. I then printed the environment variables when running from the command line and in VS code using:
print(os.environ)
When I compared the two, I noticed that the "PATH" variable was different. My conda environment was not in the "PATH" variable in VS code. I think this means that VS code was not correctly activating my conda environment. I therefore took my "PATH" from the command line and set it in my launch.json environment variables. Then the problem was fixed.
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"python": "/Users/<username>/miniconda3/envs/<env_name>/bin/python",
"env": {
"PATH":"<PATH STRING from command line>"
},
"program": "${file}"
}
I kind of followed the steps from one of the previous posted answers except I had to add the path in env variables. Adding path in pdf2image.convert_from_path didn't worked for me. So, if anyone still has this error on Windows, I solved the problem by:
Download the Latest binary of Poppler for Windows from Poppler
Windows
Unzip it into C drive like C:\poppler-0.68.0
Specify the Poppler path in environment variables
Poppler path in env variables
I had same issue but I have fixed it in my django project by changing directory.
Actually first you need to store this pdf image file in side your media directory.
Then you need to change your current directory to this media directory(where this pdf image file has been stored).
This is my code snippet in django project where I have converted .pdf image to .jpg
import PIL
from PIL import Image
def convert_pdf_2_image(uploaded_image_path, uploaded_image,img_size):
project_dir = os.getcwd()
os.chdir(uploaded_image_path)
file_name = str(uploaded_image).replace('.pdf','')
output_file = file_name+'.jpg'
pages = convert_from_path(uploaded_image, 200)
for page in pages:
page.save(output_file, 'JPEG')
break
os.chdir(project_dir)
img = Image.open(output_file)
img = img.resize(img_size, PIL.Image.ANTIALIAS)
img.save(output_file)
return output_file
Related
I am attempting to use the PDFPlumber library, which uses Wand's image format. However, upon trying to run:
from wand.image import Image
I get this error:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/wand/api.py", line 151, in <module>
libraries = load_library()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/wand/api.py", line 140, in load_library
raise IOError('cannot find library; tried paths: ' + repr(tried_paths))
OSError: cannot find library; tried paths: ['/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWandHDRI.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWandHDRI-2.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-7.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-7HDRI.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-7HDRI-2.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-7.Q8.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-7.Q8HDRI.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-7.Q8HDRI-2.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-7.Q16.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-7.Q16HDRI.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-7.Q16HDRI-2.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-6.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-6HDRI.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-6HDRI-2.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-Q16.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-Q16HDRI.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-Q16HDRI-2.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-Q8.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-Q8HDRI.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-Q8HDRI-2.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-6.Q16.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-6.Q16HDRI.dylib', '/opt/homebrew/opt/imagemagick#6/lib/lib/libMagickWand-6.Q16HDRI-2.dylib']
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/wand/api.py", line 177, in <module>
'Try to install:\n ' + msg)
ImportError: MagickWand shared library not found.
You probably had not installed ImageMagick library.
Try to install:
brew install freetype imagemagick
I first tried installing normally:
pip3 install wand
brew install imagemagick
Then, I tried using the method listed here, and tried the following:
pip3 install wand
brew uninstall imagemagick
brew install imagemagick#6
brew unlink imagemagick && brew link imagemagick#6
export MAGICK_HOME="/opt/homebrew/opt/imagemagick#6/"
export PATH="/opt/homebrew/opt/imagemagick#6/bin:$PATH"
but am still getting the same error.
I also tried the solutions listed here and confirmed that I am running 64-bit python 3.7 as mentioned here. How can I fix this? I'm especially confused because after running:
cd /opt/homebrew/opt/imagemagick#6/lib
ls
I can see that /opt/homebrew/opt/imagemagick#6/lib/libMagickWand-6.Q16.dylib is where Wand expects it to be (listed in the tried paths in the error above):
ImageMagick libMagickCore-6.Q16.7.dylib libMagickWand-6.Q16.a
libMagick++-6.Q16.9.dylib libMagickCore-6.Q16.a libMagickWand-6.Q16.dylib
libMagick++-6.Q16.a libMagickCore-6.Q16.dylib libMagickWand-6.Q16.la
libMagick++-6.Q16.dylib libMagickCore-6.Q16.la pkgconfig
libMagick++-6.Q16.la libMagickWand-6.Q16.7.dylib
I faced the same issue when tried running Wand on M1 mac, even though the same steps worked on x86 system. Solution that worked for me is to install ImageMagick via brew in x86 mode:
alias brew86="arch -x86_64 /usr/local/bin/brew"
brew86 install imagemagick
# get imagemagick installation path
brew86 info imagemagick
export MAGICK_HOME=/usr/local/Cellar/imagemagick/7.1.0-49_1
export PATH="$MAGICK_HOME/bin:$PATH"
I also face same issue on Mac M1 machine. I only set the ENV variables
like this after checking brew info imagemagick command. There is no need to reinstall imagemigick as arch -x86_64 at least for me on Mac M1 OS Monterey 12.6
export MAGICK_HOME=/opt/homebrew/Cellar/imagemagick/7.1.0-51
export PATH="$MAGICK_HOME/bin:$PATH"
and it working fine.
I have installed the pytesseract module in my venv and want to extract text from a german file
with executingthis script from
pytesseract and setting the lenguage to german
import cv2
import pytesseract
try:
from PIL import Image
except ImportError:
import Image
print(pytesseract.image_to_string(Image.open('test.jpg')))
print(pytesseract.image_to_string(Image.open('test.jpg'), lang='ger'))
which gives me
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Error opening data file C:\\Program Files (x86)\\Tesseract-OCR/tessdata/ger.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language \'ger\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')
I have found the lenguage data on [tessdoc/Data-Files] (https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files.md)
so far I only found an guide for linux How do I install a new language pack for Tesseract on 16.04
where to I need to move the lenguage files in my pyteseract sidepackage to get the script working ?
There are two ways.
1. Install the corresponding tesseract package for your language -
apt-get install tesseract-ocr-YOUR_LANG_CODE
for example- in my case it was Bengali so I installed -
apt-get install tesseract-ocr-ben
or for installing all languages -
apt-get install tesseract-ocr-all.
This worked for me Ubuntu environment.
2. The other way is mentioned in the error message itself. Add an environment variable TESSDATA_PREFIX that point to the langauge pack. You can download the language pack from here: https://github.com/tesseract-ocr/tessdata .
Once you have downloaded the datapack you can also programmatically set the environment variable as
import os
os.putenv('TESSDATA_PREFIX','path/to/your/tessdata/file'
Best way I've found:
Download and install tesseract-ocr-w64-setup-v5.0.0-rc1.20211030.exe.
Open https://github.com/tesseract-ocr/tessdata and download your language. For example, for Farsi download fas.traineddata.
Copy the downloaded file to the tessreact_ocr installation location, some location like: C:\Program Files\Tesseract-OCR\tessdata
Don't forget to use the traineddata name for the language. For Farsi, I use lang='fas'.
found a guide to do this on a german site Python Texterkennung: Bild zu Text mit PyTesseract in Windows
I am trying to do the following in Python 3.7.1 on Windows
import sqlite3
but I get the following error message
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "c:\programdata\anaconda3\lib\sqlite3\__init__.py", line 23, in <module>
from sqlite3.dbapi2 import *
File "c:\programdata\anaconda3\lib\sqlite3\dbapi2.py", line 27, in <module>
from _sqlite3 import *
ImportError: DLL load failed: The specified module could not be found.
I have searched for a solution to the problem for quite a while now to no avail. I have also successfully run pip install pysqlite3 on the Anaconda prompt, but the import still fails. What do?
I got this working on windows by downloading: the sqlite3 dll (find your system version)
And placing it into the folder: C:\Users\YOURUSER\Anaconda3\DLLs
(Depending on how you installed Anaconda, this may have to be placed into
the following folder: C:\ProgramData\Anaconda3\DLLs)
According to #alireza-taghdisian, you can locate the exact path of
your conda environments (where you need to copy the sqlite3 dll) by typing:
conda info --envs on your anaconda prompt.
Locate the sqlite3.dll file. In my case it was in following folder
C:\Users\Admin\anaconda3\Library\bin
where C:\Users\Admin\anaconda3 is the folder where Anaconda was installed
Add this to PATH in environment variables, and it should work then.
Try copying the sqlite3.dll from the
C:\Users\YOURUSER\anaconda3\Library\bin
folder to
C:\Users\YOURUSER\Anaconda3\DLLs
Please check https://github.com/jupyter/notebook/issues/4332
I added anaconda root/Library/bin to my PATH and now it works!
Add CONDA_DLL_SEARCH_MODIFICATION_ENABLE=1 to your environment variables.
before executing the program, enter conda activate in your shell.
I had tried all above solutions But for me and my system I got to know that
I downloaded Python in C:\Python27 hence there is dll folder in python C:\Python27\DLLs
I installed Sqlite3.dll in my above dll folder
May be this solution will help you because it completely depends on where do you install your python
Happy coding :)
I put the sqlite3.dll in the path folder of my Python venv and still wont work. I suspected it is a path problem.
(In my case: E:\Virtual_Env\mini_zinc\env\Scripts)
I found in my case I messed up installation in a virtual evn, somehow using an anaconda python kernel within a Python venv.
I reinstall the Python Venv and check the python version after installed Env is correct (not the Anaconda python), then proceed with Jupyter Notebook (or Juyterlab) and works fine.
I was able to resolve this issue by putting sqlite3.dll file in the C:\Users<USERID>\AppData\Local\conda\conda\envs<ENV NAME>\DLLs.
Download sqlite3.dll file from https://www.sqlite.org/download.html
or copy it from C:\ProgramData\Anaconda3\DLLs\
I found the #elgsantos useful. But for those who are new to Python and Conda like me, I would like to add a little bit of details.
1- I use miniconda3 for creating new environment.
2- interestingly, I got two installation path on my computer for conda: the first one (the obvious) is located on "C:\Users\taghdisian\miniconda3". The second one is on "C:\Users\taghdisian\AppData\Local\r-miniconda". The latter is the primary path that you need to copy your sqlite3 files into the envs folder. I copy them in the "C:\Users\taghdisian\AppData\Local\r-miniconda\envs\sdr3.9\DLLs" in which the sdr3.9 is one of my virtual Condo environment.
you can locate the exact path of your conda environments (where you need to copy sqlite3) by typing the conda info --envs on your anaconda prompt.
I hope this help.
got the same error while loading the jupyter notebook from other conda prompt than "base" environment.
[1]: https://i.stack.imgur.com/2DW7E.png
Resolved by installing sqlite package
(nlpenv) C:\Users\arunk>conda install sqlite
launching
*
(nlpenv) C:\Users\arunk>jupyter notebook
I am running Python 3.6 in a venv on 64 bit Windows 10 inside PyCharm. Here are the steps I performed:
Open PyCharm and start a new project using Python 3.6 as the venv.
Downloaded the PythonMagick from a wheel file for Python3.6 from this source:PythonMagick wheel file
Open the terminal in PyCharm and run:
pip install PythonMagick-0.9.19-cp36-cp36m-win_amd64.whl
Download ghostscript from here: Ghostscript 9.25 for Windows (64 bit) and run the exe file.
Add the ghostscript directory C:\Program Files\gs\gs9.25\bin to the user PATH environment variable.
Now I run the sample file from here
import PythonMagick
if __name__ == "__main__":
pdf = 'a.pdf'
p = PythonMagick.Image()
p.read(pdf)
p.write('doc.jpg')
I get the following error:
RuntimeError: Magick: UnableToOpenConfigureFile `delegates.xml' #
warning/configure.c/GetConfigureOptions/714
How do I fix this error?
When installing PythonMagick in a VENV, apparently you need to also add a system variable called MAGICK_HOME so that Magick can find the config files.
Add the following to the User Variable
MAGICK_HOME = %your-project-dir%\venv\Lib\site-packages\PythonMagick
Then restart PyCharm.
Python 3.5 on Windows 10, 32-bit box; all I want to do is run this:
import quandl
import pandas as pd
import html5lib
import lxml
# retrieve web page with list of 50 states
fiddy_states = pd.read_html('https://simple.wikipedia.or /wiki/List_of_U.S._states')
But for the life of me I can't seem to get a properly installed lxml, which is required by pd.read_html. Following advice from several online sources I have MinGW installed in my system and I have also added the following to C:\Python35-32\Lib\distutils\distutils.cfg:
[build]
compiler=mingw32
I have MinGW installed and included in PATH. I have tried installing lxml using both pip3 as well as the binaries found at Unofficial Windows Binaries for Python Extension Packages.
Here's all installed packages:
['beautifulsoup4==4.4.1', 'cffi==1.6.0', 'cryptography==1.3.2', 'cycler==0.10.0', 'cython==0.24', 'html5lib==0.9999999', 'idna==2.1', 'inflection==0.3.1', 'lxml==3.4.4', 'matplotlib==1.5.1', 'more-itertools==2.2', 'ndg-httpsclient==0.4.0', 'numpy==1.11.0', 'pandas-datareader==0.2.1', 'pandas==0.18.1', 'pip==8.1.2', 'pyasn1==0.1.9', 'pycparser==2.14', 'pyopenssl==16.0.0', 'pyparsing==2.1.4', 'python-dateutil==2.5.3', 'pytz==2016.4', 'quandl==3.0.1', 'requests-file==1.4', 'requests==2.10.0', 'scikit-learn==0.17.1', 'setuptools==18.2', 'six==1.10.0']
As shown above, lxml==3.4.4 appears to be installed, however when I try to run the line containing pd.read_html I get the following error message:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\Jose Manuel\AppData\Local\Programs\Python\Python35-32 \lib\site-packages\pandas\io\html.py", line 874, in read_html
parse_dates, tupleize_cols, thousands, attrs, encoding)
File "C:\Users\Jose Manuel\AppData\Local\Programs\Python\Python35-32\lib\site-packages\pandas\io\html.py", line 726, in _parse
parser = _parser_dispatch(flav)
File "C:\Users\Jose Manuel\AppData\Local\Programs\Python\Python35-32\lib\site-packages\pandas\io\html.py", line 685, in _parser_dispatch
raise ImportError("lxml not found, please install it")
ImportError: lxml not found, please install itenter code here
Your help is very much appreciated
I have been struggling with this today. I found, elsewhere on stackoverflow.com, this two-part and quick solution, which resulted in python no longer complaining when I tried to use lxml:
go to this repository and download a version which matches your Python installation (the version number, and 32- vs 64-bit. I use Python 3.5.1 64-bit, installed on Windows 10, so on that page, I chose lxml-3.6.0-cp35-cp35m-win_amd64.whl. You say you have 32-bit Python, so use a version that matches that (like lxml-3.6.0-cp35-cp35m-win32.whl.
My download directory is d:\Downloads. Python must be in your PATH environment variable for the next step to work. Use a command like the following, changing "D:\Downloads" to the pathname to your download directory. Then, at a DOS prompt, type:
python -m pip install "D:\Downloads\lxml-3.6.0-cp35-cp35m-win_amd64.whl" lxml-3.6.0-cp35-cp35m-win_amd64.whl