How do I save my files at 300 dpi using Pillow(PIL)? - python-3.x

I opening an image file using the pillow(PIL) library and saving it again under a different name. But when I save the image under the different name it takes my original 300 DPI file and makes it a 72 DPI file. I tried adding dpi=(300, 300) But still no success.
See code
from PIL import Image
image = Image.open('image-1.jpg')
image.save('image-2.jpg' , dpi=(300, 300))
My original file(image-1.jpg)
https://www.dropbox.com/s/x7xj6hyoemv3t94/image_info_1.jpg?raw=1
My copied file(image-2.jpg)
https://www.dropbox.com/s/dpcnkfozefobopn/image_info_2.jpg?raw=1
Notice how they still have the same image size: 8.45.

Thanks to #HansHirse explaining that the meta data was missing AKA exif information I saved the image with the exif info and it worked
from PIL import Image
image = Image.open('image-1.jpg')
exif = image.info['exif']
image.save('image-2.jpg' , exif=exif)

Related

Using PDF file in Keras OCR or PyTesseract - Python, is it possible?

I am using Keras OCR and PyTesseract and was wondering if it is possible to use PDF files as the image input.
If not, does anyone have a suggestion as to how to convert a very massive PDF file into PNG or another acceptable format?
Thank you!
No, as far as I know PyTesseract works only with images. You'll need to convert your pdf to images first.
By "very massive PDF" I'm assuming you mean a pdf with lots of pages. This is not an issue. You can use pdf2image library (see the docs here). The method convert_from_path has an output_folder argument that lets you specify the folder where all your generated images will be saved:
Output directory for the generated files, should be seen more as a
“working directory” than an output folder. The converted images will
be written there to save system memory.
You can later use them one by one instead of your pdf to work with PyTesseract. If you don't assign the returned list of images from convert_from_path you don't risk filling up your memory.
Otherwise, if you are willing to keep everything in memory you can use the returned pages directly, like so:
pages = convert_from_path(pdf_path)
for example, my code :
Python : 3.9
Macos: BigSur
from PIL import Image
from fonctions_images import *
from pdf2image import convert_from_path
path='/Users/yves/documents_1/'
fichier =path+'TOUTOU.pdf'
images = convert_from_path(fichier,500, transparent=True,grayscale=True,poppler_path='/usr/local/Cellar/poppler/21.12.0/bin')
for v in range(0,len(images)):
image=images[v]
image.save(path+"image.png", format="png")
test=path+"image.png"
img = cv2.imread(test) # to store image in memory
img = del_lines(path,img) # to supprime the lines
img = cv2.imread(path+"img_final_bin_1.png")
pytesseract.pytesseract.tesseract_cmd = "/usr/local/bin/tesseract"
d=pytesseract.image_to_data(img[3820:4050,2340:4000], lang='fra',config=custom_config,output_type='data.frame')

Scan datamatrix codes from pdf file and save them to csv

A task:
Scan datamatrix codes from pdf file and save them to csv.
File
Final result:
010466010514027621)ZPTsFWoUgqe,91009492ZCUruNv8/rQRlZyH/mZhkRY11D5aW4aLjpVn3DVxFIi7l9gV/pvguWxiVnpTRI0SFkNx1dPavcQYjiQ6DCSnNw==
I cannot form the structure of this code in my head.
I started to study libraries for working with pdf files, specifically PyPDF2, but ran into a problem. PyPDF2 finds absolutely nothing in the file. I tried to find the sequence in the code of the pdf file but did not understand anything.
Please help me with any piece of this code (except for writing to csv).
It may be possible to extract information from the PDF without rendering into an image, since large amounts of codes and code speed play a role.
If there are people who know the structure of pdf, tell me if it will be possible to draw out the location of each pixel (black square) of the datamatrix code and will it be possible to translate all this into the final form.
I would be grateful for any information. Thank you.
You can use my solution:
import fitz, cv2, argparse
from pylibdmtx import pylibdmtx
def reader(pdf, csv):
pdf_file = fitz.open(pdf)
csv_file = open(csv, 'ab')
for current_page_index in range(len(pdf_file)):
for img_index,img in enumerate(pdf_file.get_page_images(current_page_index)):
image = fitz.Pixmap(pdf_file, img[0])
if image.height>50:
image.save("1.png")
img = cv2.imread('1.png')
border = cv2.copyMakeBorder(img, 10, 10, 10, 10, cv2.BORDER_CONSTANT, None, value = [255, 255, 255])
csv_file.write(pylibdmtx.decode(border)[0].data)
csv_file.write(b'\n')
csv_file.close()

Is there a way to ignore EXIF orientation data when loading an image with PIL?

I'm getting some unwanted rotation when loading images using PIL. I'm loading image samples and their binary mask, so this is causing issues. I'm attempting to convert the code to use openCV instead, but this is proving sticky. I haven't seen any arguments in the documentation under Image.load(), but I'm hoping there's a workaround I just haven't found...
There is, but I haven't written it all up. Basically, if you load an image with EXIF "Orientation" field set, you can get that parameter.
First, a quick test using this image from the PIL GitHub source Pillow-7.1.2/Tests/images/hopper_orientation_6.jpg and run jhead on it you can see the EXIF orientation is 6:
jhead /Users/mark/StackOverflow/PillowBuild/Pillow-7.1.2/Tests/images/hopper_orientation_6.jpg
File name : /Users/mark/StackOverflow/PillowBuild/Pillow-7.1.2/Tests/images/hopper_orientation_6.jpg
File size : 4951 bytes
File date : 2020:04:24 14:00:09
Resolution : 128 x 128
Orientation : rotate 90 <--- see here
JPEG Quality : 75
Now do that in PIL:
from PIL import Image
# Load that image
im = Image.open('/Users/mark/StackOverflow/PillowBuild/Pillow-7.1.2/Tests/images/hopper_orientation_6.jpg')
# Get all EXIF data
e = im.getexif()
# Specifically get orientation
e.get(0x0112)
# prints 6
Now click on the source and you can work out how your image has been rotated and undo it.
Or, you could be completely unprofessional ;-) and create a function called SneakilyRemoveOrientationWhileNooneIsLooking(filename) and shell out (subprocess) to exiftool and remove the orientation with:
exiftool -Orientation= image.jpg
Author's "much simpler solution" detailed in above comment is misleading so I just wanna clear that up.
Pillow does not automatically apply EXIF orientation transformation when reading an image. However, it has a method to do so: PIL.ImageOps.exif_transpose(image)
OpenCV automatically applies EXIF orientation when reading an image. You can disable this behavior by using the IMREAD_IGNORE_ORIENTATION flag.
I believe the author's true intention was to apply the EXIF orientation rather than ignore it, which is exactly what his solution accomplished.

Changing exif data without re-compressing JPEG image

I write a python 3 CLI tool to fix creation dates of photos in a library (see here.
I use Pillow to load and save the image and piexif to handle exif data retrieval/modification.
The problem I have is that I only want to change the EXIF data in the pictures and not recompress the whole image. It seems that Pillow save can't do that.
My question is:
Any better exif library I could use to only play with the exif data (so far I tried py3exiv2, pexif and piexif) ?
If not, is there a way to indicate to Pillow to only change the exif of the image without recompressing when saving ?
Thanks !
Here is the code I use to change the creation date so far:
# Get original exif data
try:
exif_dict = piexif.load(obj.path)
except (KeyError, piexif._exceptions.InvalidImageDataError):
logger.debug('No exif data for {}'.format(obj.path))
return
# Change creation date in exif_dict
date = obj.decided_stamp.strftime('%Y:%m:%d %H:%M:%S').encode('ascii')
try:
exif_dict['Exif'][EXIF_TAKE_TIME_ORIG] = date
except (KeyError, piexif._exceptions.InvalidImageDataError):
return
exif_bytes = piexif.dump(exif_dict)
# Save new exif
im = Image.open(obj.path)
im.save(obj.path, 'jpeg', exif=exif_bytes)
In your case, I think that no need to use Pillow.
exif_bytes = piexif.dump(exif_dict)
piexif.insert(exif_bytes, obj.path)

Wand Image from PDF doesn't apply resizing

I'm using wand in a Django project, to generate a thumbnail from different kind of files, e.g pdf, all the thumbnail generation process is done in memory, the source file is get from a request and the thumbnail is saved to a temporary file, then Django FileFiled saves the image in the correct path, but the thumbnail generated keeps the initial size, this is my code:
with image.Image(file=self.content.file, format="png") as im: # self.content is a django model FileField didn't saved yet, so the file inside is still in memory (from the request)
im.resize(200, 200)
name = self.content.file.name
self.temp = tempfile.NamedTemporaryFile()
im.save(file=self.temp)
self.thumbnail = InMemoryUploadedFile(self.temp, None, name + ".png", 'image/png', 0, 0, None) # then self.thumnail as FileField saves the image
Do you have any idea what happen? could be a bug? I've already reported it as issue on wand github page.
The problem comes from the fact that your PDF has more than one page. If you only resize the first page (which is the one you want to display), it works. Try adding the following line after your with statement:
im = image.Image(image=im.sequence[0])
But I agree with you that your version should work as well.

Resources