I have installed the pytesseract module in my venv and want to extract text from a german file
with executingthis script from
pytesseract and setting the lenguage to german
import cv2
import pytesseract
try:
from PIL import Image
except ImportError:
import Image
print(pytesseract.image_to_string(Image.open('test.jpg')))
print(pytesseract.image_to_string(Image.open('test.jpg'), lang='ger'))
which gives me
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Error opening data file C:\\Program Files (x86)\\Tesseract-OCR/tessdata/ger.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language \'ger\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')
I have found the lenguage data on [tessdoc/Data-Files] (https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files.md)
so far I only found an guide for linux How do I install a new language pack for Tesseract on 16.04
where to I need to move the lenguage files in my pyteseract sidepackage to get the script working ?
There are two ways.
1. Install the corresponding tesseract package for your language -
apt-get install tesseract-ocr-YOUR_LANG_CODE
for example- in my case it was Bengali so I installed -
apt-get install tesseract-ocr-ben
or for installing all languages -
apt-get install tesseract-ocr-all.
This worked for me Ubuntu environment.
2. The other way is mentioned in the error message itself. Add an environment variable TESSDATA_PREFIX that point to the langauge pack. You can download the language pack from here: https://github.com/tesseract-ocr/tessdata .
Once you have downloaded the datapack you can also programmatically set the environment variable as
import os
os.putenv('TESSDATA_PREFIX','path/to/your/tessdata/file'
Best way I've found:
Download and install tesseract-ocr-w64-setup-v5.0.0-rc1.20211030.exe.
Open https://github.com/tesseract-ocr/tessdata and download your language. For example, for Farsi download fas.traineddata.
Copy the downloaded file to the tessreact_ocr installation location, some location like: C:\Program Files\Tesseract-OCR\tessdata
Don't forget to use the traineddata name for the language. For Farsi, I use lang='fas'.
found a guide to do this on a german site Python Texterkennung: Bild zu Text mit PyTesseract in Windows
Let me explain what I want to do.
The list of libraries I want installed is listed in a .txt file.
My script reads the list from the file sequentially, and if the script isn't installed, it installs it via pip, or if it is already installed, checks the version and updates it if necessary.
I googled it up but didn't find how to do that. Can you offer any help or guidance?
Yes you can. Try this, here is an example of one module which is hard coded
import os
import subprocess
import sys
get_pckg = subprocess.check_output([sys.executable, '-m', 'pip', 'freeze'])
installed_packages = [r.decode().split('==')[0] for r in get_pckg.split()]
required_packeges = ['shopifyAPI'] // Make a change here to fetch from file
for packg in required_packeges:
if packg in installed_packages:
pass
else:
print('installing package')
os.system('pip install ' + packg)
First i will fetch all installed modules and then i will check my required module is installed or not if not then it will install it.
Yes, you can. Python module os does support running script programmatically. Since I don't know how your file structure looks like, I guess you can read the file and run the script sequentially.
import os
os.system("pip install <module>")
Use Following to install lib. programmatically.
import pip
try:
pip.main(["install", "pandas"])
except SystemExit as e:
pass
I'm trying to use pdf2image and it seems I need something called propeller :
(sum_env) C:\Users\antoi\Documents\Programming\projects\summarizer>python ocr.py -i fr13_idf.pdf
Traceback (most recent call last):
File "c:\Users\antoi\Documents\Programming\projects\summarizer\sum_env\lib\site-packages\pdf2image\pdf2image.py", line 165, in __page_count
proc = Popen(["pdfinfo", pdf_path], stdout=PIPE, stderr=PIPE)
File "C:\Python37\lib\subprocess.py", line 769, in __init__
restore_signals, start_new_session)
File "C:\Python37\lib\subprocess.py", line 1172, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "ocr.py", line 53, in <module>
pdfspliterimager(image_path)
File "ocr.py", line 32, in pdfspliterimager
pages = convert_from_path("document-page%s.pdf" % i, 500)
File "c:\Users\antoi\Documents\Programming\projects\summarizer\sum_env\lib\site-packages\pdf2image\pdf2image.py", line 30, in convert_from_path
page_count = __page_count(pdf_path, userpw)
File "c:\Users\antoi\Documents\Programming\projects\summarizer\sum_env\lib\site-packages\pdf2image\pdf2image.py", line 169, in __page_count
raise Exception('Unable to get page count. Is poppler installed and in PATH?')
Exception: Unable to get page count. Is poppler installed and in PATH?
I tried this link but it the thing to download didn't solved my problem.
pdf2image is only a wrapper around poppler (not propeller!), to use the module you need to have poppler-utils installed on your machine and in your path.
The procedure is linked in the project's README in the "How to install" section.
1st of all Download Poppler from here here,Then extract it.In the code section just add poppler_path=r'C:\Program Files\poppler-0.68.0\bin'(for eg.) like below
from pdf2image import convert_from_path
images = convert_from_path("mypdf.pdf", 500,poppler_path=r'C:\Program Files\poppler-0.68.0\bin')
for i, image in enumerate(images):
fname = 'image'+str(i)+'.png'
image.save(fname, "PNG")
Now its done.With this trick no need to add Environmental Variables.Let me know if you have any problem.
These pdf2image and pdftotext library backend requierment is Poppler,
so you have to install
'conda install -c conda-forge poppler '
then the error will be resolved.
and if still it won't work for you then you can follow
http://blog.alivate.com.au/poppler-windows/ to install this library.
It is poppler which is not installed properly.
Using this you can get correct package for installation.
sudo apt-get install poppler-utils
For windows; to solve PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH? :
Install chocolatey https://chocolatey.org/install
then install poppler using choco:
choco install poppler
Poppler in path for pdf2image
While working with pdf2image there are dependency that needs to be satisfied:
Installation of pdf2image
pip install pdf2image
Installation of python-dateutil
pip install python-dateutil
Installation of Poppler
Specifying Poppler path in environment variable (system path)
Installing Poppler on Windows
Go to https://github.com/oschwartz10612/poppler-windows/releases/
Under Release 21.11.0-0 Latest v21.11.0-0
Go to Assets 3
Download Release-21.11.0-0.zip
Adding Poppler to path
Add Poppler installed to loaction :C:\Users\UserName\Downloads\Release-21.11.0-0.zip
Add C:\Users\UserName\Downloads\Release-21.11.0-0.zip to system
variable path in Environment Variable
Specifying poppler path in code
pages = convert_from_path(filepath, poppler_path=r"actualpoppler_path")
If anyone still has this error on Windows, I solved the problem by:
Download the Latest binary of Poppler for Windows from Poppler for Windows
Unzip it into C drive like C:\poppler-0.68.0
Specify the Poppler path like this:
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
ROOT_DIR = os.path.abspath(os.curdir)
# Path of the pdf
PDF_file = ROOT_DIR + r"\PdfToImage\src\2.pdf"
'''
Part #1 : Converting PDF to images
'''
# Store all the pages of the PDF in a variable
pages = convert_from_path(PDF_file, 500, poppler_path=r'C:\poppler-0.68.0\bin')
FOR MAC, if you have brew installed, that is the way to go.
brew install poppler
Takes several minutes to install all the dependencies, but pdf2image will work afterwards.
This is a repeat of an answer here and the answer is also in a comment on this page. Adding this answer b/c it took me a while to find the correct solution FOR MACs.
In Windows
Install the Poppler for Windows Poppler
500 = Quality of JPG
the path contains the pdf files
pip install pdf2img
path = r'C:\ABC\FEF\KLH\pdf_extractor\output\break'
def spliting_pdf2img( path):
from pdf2image import convert_from_path, convert_from_bytes
for file in os.listdir(path):
if file.lower().endswith(".pdf"):
pages = convert_from_path(os.path.join(path,file), 500,poppler_path= r'C:\ABC\DEF\Downloads\poppler-0.68.0\bin')
for page in pages:
page.save(os.path.join(path,file.lower().replace(".pdf",".jpg")),'JPEG')
In Linux/UBUNTU
Install the below packages in the ubuntu/linux terminal
sudo apt-get update
sudo apt-get install poppler-utils
path = r'C:\ABC\FEF\KLH\pdf_extractor\output\break'
def spliting_pdf2img( path):
from pdf2image import convert_from_path, convert_from_bytes
for file in os.listdir(path):
if file.lower().endswith(".pdf"):
pages = convert_from_path(os.path.join(path,file), 500)
for page in pages:
page.save(os.path.join(path,file.lower().replace(".pdf",".jpg")),'JPEG')
I'm working on a mac in Visual Studio Code and I encountered this error. I followed the install instructions and was able to verify the packages were installed but the error persisted when running in VSC.
Even though I had my python.condaPath and python.pythonPath specified in my settings.json it wasn't until activated the conda environment inside of the VSC integrated terminal itself
conda activate my_env
that the error went away..
Bizarre.
After downloading poppler do this....
import os
os.environ["PATH"] = r"C:.....\poppler-xxxxxxx\bin"
use this to make environment hope it works.It worked for me.
I had the same problem on my Mac
I solved it by replacing the poppler_path from - poppler_path= '\usr\bin'
" to poppler_path= '\usr\local\bin'
but you can try to print all the places that poppler might be in your mac
by echo $PATH in the Terminal and try all the options as poppler_path=" "
I had the same issue on Mac using Visual Studio Code and a conda environment.
I found out that I could run the code from the command line, however not from VS code. I then printed the environment variables when running from the command line and in VS code using:
print(os.environ)
When I compared the two, I noticed that the "PATH" variable was different. My conda environment was not in the "PATH" variable in VS code. I think this means that VS code was not correctly activating my conda environment. I therefore took my "PATH" from the command line and set it in my launch.json environment variables. Then the problem was fixed.
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"python": "/Users/<username>/miniconda3/envs/<env_name>/bin/python",
"env": {
"PATH":"<PATH STRING from command line>"
},
"program": "${file}"
}
I kind of followed the steps from one of the previous posted answers except I had to add the path in env variables. Adding path in pdf2image.convert_from_path didn't worked for me. So, if anyone still has this error on Windows, I solved the problem by:
Download the Latest binary of Poppler for Windows from Poppler
Windows
Unzip it into C drive like C:\poppler-0.68.0
Specify the Poppler path in environment variables
Poppler path in env variables
I had same issue but I have fixed it in my django project by changing directory.
Actually first you need to store this pdf image file in side your media directory.
Then you need to change your current directory to this media directory(where this pdf image file has been stored).
This is my code snippet in django project where I have converted .pdf image to .jpg
import PIL
from PIL import Image
def convert_pdf_2_image(uploaded_image_path, uploaded_image,img_size):
project_dir = os.getcwd()
os.chdir(uploaded_image_path)
file_name = str(uploaded_image).replace('.pdf','')
output_file = file_name+'.jpg'
pages = convert_from_path(uploaded_image, 200)
for page in pages:
page.save(output_file, 'JPEG')
break
os.chdir(project_dir)
img = Image.open(output_file)
img = img.resize(img_size, PIL.Image.ANTIALIAS)
img.save(output_file)
return output_file
So I just found out that python modules os, shutil are no longer available to clone from: https://pypi.python.org/
and after some research I did not find any replacement..
So the question is: how to move files from dir/fileX -> /dir2/fileX
AND / OR is there possibility to rename the folder if needed to 'move' all the files in it?
Using Linux / macOS
Thank you.
Tl;dr OS and SHUTIL are part of standard library in python 3x
shutil is the default library in python versions. So you can use shutil.move(src, dest) to recursively move folder from one repository to another.
I searched how to predict my own digit image using Google TensorFlow.
I used 64bit Red Hat Linux.
I installed Python3.4.3, other related development environments and TensorFlow version 0.6.0.
Then, I tried to write code for digit prediction.
Firstly, I need to read images in my python program.
So, I searched how to read images in python then I found OpenCV (http://opencv.org/)
I installed OpenCV(version:3.1.0) using cmake.
After installing OpenCV, I tried to import cv2(OpenCV function) in order to read image.
But I cannot import cv2 and ImportError occured as the following.
ImportError: No module named 'cv2'
I tried to solve this problem by changing default PYTHONPATH.
For example:
export PYTHONPATH=/usr/local/python/lib/python3.4/site-packages:$PYTHONPATH
I tried to add some code in my python program.
import sys
sys.path.append('/usr/local/python/lib/python3.4')
But the above two steps cannot solve the problem ImportError: No module named 'cv2'.
So, I search how to solve this problem and I tried to solve with many other ways. But no success.
How can I import cv2 in my python program?
OpenCV Installation steps are as follow:
>>> yum install cmake
** Download OpenCV latest version from it's official site **
>>> cd /directory of OpenCV/
>>> mkdir release
>>> cd release
>>> cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D BUILD_PYTHON_SUPPORT=ON ..
>>> make && make install
>>> echo "/usr/local/lib" >> /etc/ld.so.conf.d/opencv.conf
>>> ldconfig
>>> echo "PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig" >> /etc/bash.bashrc
>>> echo "export PKG_CONFIG_PATH" >> /etc/bash.bashrc
This doesn't directly answer your question—perhaps an OpenCV expert can help with that—but TensorFlow includes reasonably comprehensive support for manipulating images. If your images are in JPEG format you can use tf.image.decode_jpeg() to convert them to tensors; likewise tf.image.decode_png() supports PNG images.