Text recognition of an image with Tesseract - linux

I would like to create a pdf file with text recognition from a scanned image.
But I don't want the original image in the PDF file, just plain text. The text should be visible so that it can be read, but the font doesn't matter that much.
This Tesseract command does almost what I want, but the text is invisible.
tesseract -c textonly_pdf=1 test.tif test pdf
How can I make the text visible?
Can I create a pdf file with another command-line or python tool?
I'm running Tesseract in Ubuntu.

Here a snippet of code from a script I made in python (on windows) one year ago to extract the text in a dataframe (that you can then save to csv or other formats).
import cv2
import pytesseract as pya
pya.pytesseract.tesseract_cmd = r'D:\Programs\Tesseract_OCR\tesseract.exe'
from pytesseract import Output
imgcv = cv2.imread('foo.jpg')
# in text_df you have the extracted text, confidence and so on
text_df = pya.image_to_data(imgcv , output_type='data.frame')
text_df = text_df[text_df.conf != -1]
text_df = text_df[text_df.conf > 50]
conf = text_df['conf'].mean()

Related

Using PDF file in Keras OCR or PyTesseract - Python, is it possible?

I am using Keras OCR and PyTesseract and was wondering if it is possible to use PDF files as the image input.
If not, does anyone have a suggestion as to how to convert a very massive PDF file into PNG or another acceptable format?
Thank you!
No, as far as I know PyTesseract works only with images. You'll need to convert your pdf to images first.
By "very massive PDF" I'm assuming you mean a pdf with lots of pages. This is not an issue. You can use pdf2image library (see the docs here). The method convert_from_path has an output_folder argument that lets you specify the folder where all your generated images will be saved:
Output directory for the generated files, should be seen more as a
“working directory” than an output folder. The converted images will
be written there to save system memory.
You can later use them one by one instead of your pdf to work with PyTesseract. If you don't assign the returned list of images from convert_from_path you don't risk filling up your memory.
Otherwise, if you are willing to keep everything in memory you can use the returned pages directly, like so:
pages = convert_from_path(pdf_path)
for example, my code :
Python : 3.9
Macos: BigSur
from PIL import Image
from fonctions_images import *
from pdf2image import convert_from_path
path='/Users/yves/documents_1/'
fichier =path+'TOUTOU.pdf'
images = convert_from_path(fichier,500, transparent=True,grayscale=True,poppler_path='/usr/local/Cellar/poppler/21.12.0/bin')
for v in range(0,len(images)):
image=images[v]
image.save(path+"image.png", format="png")
test=path+"image.png"
img = cv2.imread(test) # to store image in memory
img = del_lines(path,img) # to supprime the lines
img = cv2.imread(path+"img_final_bin_1.png")
pytesseract.pytesseract.tesseract_cmd = "/usr/local/bin/tesseract"
d=pytesseract.image_to_data(img[3820:4050,2340:4000], lang='fra',config=custom_config,output_type='data.frame')

How to convert/export .pptx to pdf using python on linux machine?

I would like to convert/export .pptx file into pdf but seems like it's not possible on Linux;
I tried using the python pptx library but the max it do is to extract the text from powerpoint file
from pptx import Presentation
prs = Presentation("myfile.pptx")
# text_runs will be populated with a list of strings,
# one for each text run in presentation
text_runs = []
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text_runs.append(run.text)
I would like to simply transform the file to pdf... does someone know how to do it in an easy way?
Regards,
Leonardo

Scan datamatrix codes from pdf file and save them to csv

A task:
Scan datamatrix codes from pdf file and save them to csv.
File
Final result:
010466010514027621)ZPTsFWoUgqe,91009492ZCUruNv8/rQRlZyH/mZhkRY11D5aW4aLjpVn3DVxFIi7l9gV/pvguWxiVnpTRI0SFkNx1dPavcQYjiQ6DCSnNw==
I cannot form the structure of this code in my head.
I started to study libraries for working with pdf files, specifically PyPDF2, but ran into a problem. PyPDF2 finds absolutely nothing in the file. I tried to find the sequence in the code of the pdf file but did not understand anything.
Please help me with any piece of this code (except for writing to csv).
It may be possible to extract information from the PDF without rendering into an image, since large amounts of codes and code speed play a role.
If there are people who know the structure of pdf, tell me if it will be possible to draw out the location of each pixel (black square) of the datamatrix code and will it be possible to translate all this into the final form.
I would be grateful for any information. Thank you.
You can use my solution:
import fitz, cv2, argparse
from pylibdmtx import pylibdmtx
def reader(pdf, csv):
pdf_file = fitz.open(pdf)
csv_file = open(csv, 'ab')
for current_page_index in range(len(pdf_file)):
for img_index,img in enumerate(pdf_file.get_page_images(current_page_index)):
image = fitz.Pixmap(pdf_file, img[0])
if image.height>50:
image.save("1.png")
img = cv2.imread('1.png')
border = cv2.copyMakeBorder(img, 10, 10, 10, 10, cv2.BORDER_CONSTANT, None, value = [255, 255, 255])
csv_file.write(pylibdmtx.decode(border)[0].data)
csv_file.write(b'\n')
csv_file.close()

pytesseract image_to_string not pulling strings, but no Error

I am using the image_to_string function in the pytesseract package to convert multiple parts of a single picture file to string. All parts are working except for this image:
Here is the script that I am using to convert it:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
im = Image.open('image.png')
text = pytesseract.image_to_string(im)
print(text)
Which gives the output:
—\—\—\N—\—\—\—\—\N
I have tried breaking up the image into smaller parts as well as processing the image as a jpg and as png. What can I do to have it output the values in the image?
Using a different page segmentation instead of the default one seems to work.
text = pytesseract.image_to_string(im,config ='--psm 6'))
According to the tesseract wiki, option 6 assumes a single uniform block of text. I tried with other options but only this one worked.
To check for other page segmentation methods read the tesseract wiki on how to improve quality of an image.

PyPDF2 returns only empty lines for some files

I am working on a script that "reads" PDF files and and then automatically renames the files it recognizes from a dictionary. PyPDF2 however only returns empty lines for some PDFs, while working fine for others. The code for reading the files:
import PyPDF2
# File name
file = 'sample.pdf'
# Open File
with open(file, "rb") as f:
# Read in file
pdfReader = PyPDF2.PdfFileReader(f)
# Check number of pages
number_of_pages = pdfReader.numPages
print(number_of_pages)
# Get first page
pageObj = pdfReader.getPage(0)
# Extract text from page 1
text = pageObj.extractText()
print(text)
It does get the number of pages correctly, so it is able to open the PDF.
If I replace the print(text) by repr(text) for files it doesn't read, I get something like:
"'\\n\\n\\n\\n\\n\\n\\n\\nn\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n'"
Weirdly enough, when I enhance (OCR) the files with Adobe, the script performs slightly worse. It Recognized 140 out of 800 files and after enhancing just 110.
The PDFs are machine readable/searchable, because I am able to copy/paste text to Notepad. I tested some files with "pdfminer" and it does show some text, but also throws in a lot of errors. If possible I like to keep working with PyPDF2.
Specifications of the software I am using:
Windows: 10.0.15063
Python: 3.6.1
PyPDF: 1.26.0
Adobe version: 17.009.20058
Anyone any suggestions? Your help is very much appreciated!
I had the same issue, i fixed it using another python library called slate
Fortunately, i found a fork that works in Python 3.6.5
import slate3k as slate
with open(file.pdf,'rb') as f:
extracted_text = slate.PDF(f)
print(extracted_text)

Resources