Scan datamatrix codes from pdf file and save them to csv - python-3.x

A task:
Scan datamatrix codes from pdf file and save them to csv.
File
Final result:
010466010514027621)ZPTsFWoUgqe,91009492ZCUruNv8/rQRlZyH/mZhkRY11D5aW4aLjpVn3DVxFIi7l9gV/pvguWxiVnpTRI0SFkNx1dPavcQYjiQ6DCSnNw==
I cannot form the structure of this code in my head.
I started to study libraries for working with pdf files, specifically PyPDF2, but ran into a problem. PyPDF2 finds absolutely nothing in the file. I tried to find the sequence in the code of the pdf file but did not understand anything.
Please help me with any piece of this code (except for writing to csv).
It may be possible to extract information from the PDF without rendering into an image, since large amounts of codes and code speed play a role.
If there are people who know the structure of pdf, tell me if it will be possible to draw out the location of each pixel (black square) of the datamatrix code and will it be possible to translate all this into the final form.
I would be grateful for any information. Thank you.

You can use my solution:
import fitz, cv2, argparse
from pylibdmtx import pylibdmtx
def reader(pdf, csv):
pdf_file = fitz.open(pdf)
csv_file = open(csv, 'ab')
for current_page_index in range(len(pdf_file)):
for img_index,img in enumerate(pdf_file.get_page_images(current_page_index)):
image = fitz.Pixmap(pdf_file, img[0])
if image.height>50:
image.save("1.png")
img = cv2.imread('1.png')
border = cv2.copyMakeBorder(img, 10, 10, 10, 10, cv2.BORDER_CONSTANT, None, value = [255, 255, 255])
csv_file.write(pylibdmtx.decode(border)[0].data)
csv_file.write(b'\n')
csv_file.close()

Related

Using PDF file in Keras OCR or PyTesseract - Python, is it possible?

I am using Keras OCR and PyTesseract and was wondering if it is possible to use PDF files as the image input.
If not, does anyone have a suggestion as to how to convert a very massive PDF file into PNG or another acceptable format?
Thank you!
No, as far as I know PyTesseract works only with images. You'll need to convert your pdf to images first.
By "very massive PDF" I'm assuming you mean a pdf with lots of pages. This is not an issue. You can use pdf2image library (see the docs here). The method convert_from_path has an output_folder argument that lets you specify the folder where all your generated images will be saved:
Output directory for the generated files, should be seen more as a
“working directory” than an output folder. The converted images will
be written there to save system memory.
You can later use them one by one instead of your pdf to work with PyTesseract. If you don't assign the returned list of images from convert_from_path you don't risk filling up your memory.
Otherwise, if you are willing to keep everything in memory you can use the returned pages directly, like so:
pages = convert_from_path(pdf_path)
for example, my code :
Python : 3.9
Macos: BigSur
from PIL import Image
from fonctions_images import *
from pdf2image import convert_from_path
path='/Users/yves/documents_1/'
fichier =path+'TOUTOU.pdf'
images = convert_from_path(fichier,500, transparent=True,grayscale=True,poppler_path='/usr/local/Cellar/poppler/21.12.0/bin')
for v in range(0,len(images)):
image=images[v]
image.save(path+"image.png", format="png")
test=path+"image.png"
img = cv2.imread(test) # to store image in memory
img = del_lines(path,img) # to supprime the lines
img = cv2.imread(path+"img_final_bin_1.png")
pytesseract.pytesseract.tesseract_cmd = "/usr/local/bin/tesseract"
d=pytesseract.image_to_data(img[3820:4050,2340:4000], lang='fra',config=custom_config,output_type='data.frame')

Is there a way to save an adjusted image from a named Window in opencv in Python?

Just a general question - I made a contrast slider, so I could open images and adjust their brightness and/or contrast to what best fits my needs. However, how can I save those settings resp. overwrite the image with those settings once I close the window?
Unless I am misunderstanding your question, I think you just need to make sure you are saving the adjusted image to a different file path. The code below might help you. You can insert the datetime into the file name if you want a new image to be saved each time.
import cv2
from datetime import datetime
def some_adjustment(img):
"""
your adjustment
"""
return img
path = "img.jpg"
img = cv2.imread(path)
adjusted_img = some_adjustment(img)
cv2.imshow('adjusted_img', adjusted_img)
cv2.waitKey(0)
now = datetime.now()
dt_string = now.strftime("%m.%d.%Y.%H.%M.%S")
cv2.imwrite(f"adjusted_{dt_string}.png", adjusted_img)
cv2.destroyAllWindows()

Raw Images from rawpy darker than their thumbnails

I'm wanting to convert '.NEF' to '.png' using the rawpy, imageio and opencv libraries in Python. I've tried a variety of flags in rawpy to produce the same image that I see when I just open the NEF, but all of the images that output are extremely dark. What am I doing wrong?
My current version of the code is:
import rawpy
import imageio
from os.path import *
import os
import cv2
def nef2png(inputNEFPath):
parent, filename = split(inputNEFPath)
name, _ = splitext(filename)
pngName = str(name+'.png')
tempFileName = str('temp%s.tiff' % (name))
with rawpy.imread(inputNEFPath) as raw:
rgb = raw.postprocess(gamma=(2.222, 4.5),
no_auto_bright=True,
output_bps=16)
imageio.imsave(join(parent, tempFileName), rgb)
image = cv2.imread(join(parent, tempFileName), cv2.IMREAD_UNCHANGED)
cv2.imwrite(join(parent, pngName), image)
os.remove(join(parent, tempFileName))
I'm hoping to get to get this result:
https://imgur.com/Q8qWfwN
But I keep getting dark outputs like this:
https://imgur.com/0jIuqpQ
For the actual file NEF, I uploaded them to my google drive if you want to mess with it: https://drive.google.com/drive/folders/1DVSPXk2Mbj8jpAU2EeZfK8d2HZM9taiH?usp=sharing
You're not doing anything wrong, it's just that the thumbnail was generated by Nikon's proprietary in-camera image processing pipeline. It's going to be hard to get the exact same visual output from an open source tool with an entirely different set of algorithms.
You can make the image brighter by setting no_auto_bright=False. If you're not happy with the default brightening, you can play with the auto_bright_thr parameter (see documentation).

Vader Sentiment with multiple PDF

I have recently merged 20 pdf in 1 pdf via adobe. I have import the pdf in python with this code.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file = open ('/Users/cj/Desktop/PEI.pdf','rb')
newfile=open('rjtjj.txt','w')
pdf_reader= PdfFileReader (pdf_file)
pdf_writer= PdfFileWriter()
print(pdf_reader.numPages)
n=pdf_reader.getNumPages()
for i in range(0, n-1):
# pdf_writer.addPage(pdf_reader.getPage(i))
gft=pdf_reader.getPage(i)
newfile.write(gft.extractText())
pdf_file.close()
newfile.close()
I'm trying to use Vadersentiment to analyse the pdf. What i want to do is analyse individually the 20 pdf that are merged into 1.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
with open('rjtjj.txt', 'r') as f:
for line in f.read().split("\n"):
vs=analyzer.polarity_scores(line)
I know my code is wrong, because it only gives me the first line of the entire pdf. I am new to this, i would really appreciate your help.
Thank you
Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.
Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.
If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.

pytesseract image_to_string not pulling strings, but no Error

I am using the image_to_string function in the pytesseract package to convert multiple parts of a single picture file to string. All parts are working except for this image:
Here is the script that I am using to convert it:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
im = Image.open('image.png')
text = pytesseract.image_to_string(im)
print(text)
Which gives the output:
—\—\—\N—\—\—\—\—\N
I have tried breaking up the image into smaller parts as well as processing the image as a jpg and as png. What can I do to have it output the values in the image?
Using a different page segmentation instead of the default one seems to work.
text = pytesseract.image_to_string(im,config ='--psm 6'))
According to the tesseract wiki, option 6 assumes a single uniform block of text. I tried with other options but only this one worked.
To check for other page segmentation methods read the tesseract wiki on how to improve quality of an image.

Resources