Converting TIFF file to JPEG when using Imagecodecs Python library gives an error - python-3.x

I am using the Imagecodecs Python library to convert TIFF files to JPEG files and one of my images are causing an error in the program. When it gets to the file I get the following error. Anyone know why I would be getting this?
Exception has occurred: ValueError
TIFF_DECODE: Sorry, can not handle image
PNG_DECODE: not a PNG image
GIF_DECODE: DGifCloseFile returned 'Data is not in GIF format'
WEBP_DECODE: WebPGetFeatures returned VP8_STATUS_BITSTREAM_ERROR
JPEG8_DECODE: Not a JPEG file: starts with 0x49 0x49
JPEG12_DECODE: Not a JPEG file: starts with 0x49 0x49
JPEGSOF3_DECODE: decode_jpegsof3 returned 'JPEG signature 0xFFD8FF not found'
JPEG2K_DECODE: not a J2K or JP2 data stream
JPEGLS_DECODE: charls_jpegls_decoder_read_header returned Invalid JPEG-LS stream, the leading start byte (0xFF) for a JPEG marker was not found
JPEGXR_DECODE: PKCodecFactory_CreateDecoderFromBytes returned WMP_errUnsupportedFormat
JPEGXL_DECODE: DecodeBrunsli returned 0
LERC_DECODE: lerc_getBlobInfo returned Failed
NUMPY_DECODE: not a numpy array
The code that is causing this is:
imwrite(filepath[:-4] + '.jpg', imread(filepath)[:,:,:3].copy())

Related

Can not build sphinx excerpts while keeping original langunge text

I need to build a excerpt from Arabic text and keep the original language for display purpose of the excerpt. But the problem is if I feed the Arabic text direct to BuildExcerpt function it gave the following error.
'{"دولة": "فلسطين", "مصدر": "وزارة الاقتصاد الوطني", "رقم الشركة": "563420595", "اسم الشركة": "شركة حسان الغرابلي وشركاه للتجارة العامة", "عنوان الشركة": "غزة - الشجاعية", "نوع الشركة": "شركة مسجلة", "تاريخ التسجيل": "1994-07-03", "الهاتف": "", "راس مال الشركة": "0 دينار أردني", "مفوضون": "الشركاء مجتمعين"}'
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 5: invalid start byte
As a workaround I used unidecode module and feed the converted text to BuildExcerpt. Then the original language encoding is missing and can not rebuild it again from the excerpt. See the output below.
[' ... -03", "lhtf": "", "rs ml lshrk#": "<b>0</b> dynr \'rdny", "mfwDwn": "lshrk mjtm ... ']
Is there way I can keep the original language encoding for the excerpt?

Python error upon exif data extraction via Pillow module: invalid continuation byte

I am writing a piece of code to extract exif data from images using Python. I downloaded the Pillow module using pip3 and am using some code I found online:
from PIL import Image
from PIL.ExifTags import TAGS
imagename = "path to file"
image = Image.open(imagename)
exifdata = image.getexif()
for tagid in exifdata:
tagname = TAGS.get(tagid, tagid)
data = exifdata.get(tagid)
if isinstance(data, bytes):
data = data.decode()
print(f"{tagname:25}: {data}")
On some images this code works. However, for images I took on my Olympus camera I get the following error:
GPSInfo : 734
Traceback (most recent call last):
File "_pathname redacted_", line 14, in <module>
data = data.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 30: invalid continuation byte
When I remove the data = data.decode() part, I get the following:
GPSInfo : 734
PrintImageMatching : b"PrintIM\x000300\x00\x00%\x00\x01\x00\x14\x00\x14\x00\x02\x00\x01\x00\x00\x00\x03\x00\xf0\x00\x00\x00\x07\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\n\x00\x00\x00\x00\x00\x0b\x008\x01\x00\x00\x0c\x00\x00\x00\x00\x00\r\x00\x00\x00\x00\x00\x0e\x00P\x01\x00\x00\x10\x00`\x01\x00\x00 \x00\xb4\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x01\xff\x00\x00\x00\x02\x01\x83\x00\x00\x00\x03\x01\x83\x00\x00\x00\x04\x01\x83\x00\x00\x00\x05\x01\x83\x00\x00\x00\x06\x01\x83\x00\x00\x00\x07\x01\x80\x80\x80\x00\x10\x01\x83\x00\x00\x00\x00\x02\x00\x00\x00\x00\x07\x02\x00\x00\x00\x00\x08\x02\x00\x00\x00\x00\t\x02\x00\x00\x00\x00\n\x02\x00\x00\x00\x00\x0b\x02\xf8\x01\x00\x00\r\x02\x00\x00\x00\x00 \x02\xd6\x01\x00\x00\x00\x03\x03\x00\x00\x00\x01\x03\xff\x00\x00\x00\x02\x03\x83\x00\x00\x00\x03\x03\x83\x00\x00\x00\x06\x03\x83\x00\x00\x00\x10\x03\x83\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\t\x11\x00\x00\x10'\x00\x00\x0b\x0f\x00\x00\x10'\x00\x00\x97\x05\x00\x00\x10'\x00\x00\xb0\x08\x00\x00\x10'\x00\x00\x01\x1c\x00\x00\x10'\x00\x00^\x02\x00\x00\x10'\x00\x00\x8b\x00\x00\x00\x10'\x00\x00\xcb\x03\x00\x00\x10'\x00\x00\xe5\x1b\x00\x00\x10'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x05\x05\x00\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x05\x05\x05\x00\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
ResolutionUnit : 2
ExifOffset : 230
ImageDescription : OLYMPUS DIGITAL CAMERA
Make : OLYMPUS CORPORATION
Model : E-M10MarkII
Software : Version 1.2
Orientation : 1
DateTime : 2020:02:13 15:02:57
YCbCrPositioning : 2
YResolution : 350.0
Copyright :
XResolution : 350.0
Artist :
How should I fix this problem? Should I use a different Python module?
I did some digging and figured out the answer to the problem I posted about. I originally postulated that the rest of the metadata was in the byte data:
b"PrintIM\x000300\x00\x00%\x00\x01\x00\x14\x00\x14\x00\x02\x00\x01\x00\x00\x00\x03\x00\xf0\x00\x00\x00\x07\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\n\x00\x00\x00\x00\x00\x0b\x008\x01\x00\x00\x0c\x00\x00\x00\x00\x00\r\x00\x00\x00\x00\x00\x0e\x00P\x01\x00\x00\x10\x00`\x01\x00\x00 \x00\xb4\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x01\xff\x00\x00\x00\x02\x01\x83\x00\x00\x00\x03\x01\x83\x00\x00\x00\x04\x01\x83\x00\x00\x00\x05\x01\x83\x00\x00\x00\x06\x01\x83\x00\x00\x00\x07\x01\x80\x80\x80\x00\x10\x01\x83\x00\x00\x00\x00\x02\x00\x00\x00\x00\x07\x02\x00\x00\x00\x00\x08\x02\x00\x00\x00\x00\t\x02\x00\x00\x00\x00\n\x02\x00\x00\x00\x00\x0b\x02\xf8\x01\x00\x00\r\x02\x00\x00\x00\x00 \x02\xd6\x01\x00\x00\x00\x03\x03\x00\x00\x00\x01\x03\xff\x00\x00\x00\x02\x03\x83\x00\x00\x00\x03\x03\x83\x00\x00\x00\x06\x03\x83\x00\x00\x00\x10\x03\x83\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\t\x11\x00\x00\x10'\x00\x00\x0b\x0f\x00\x00\x10'\x00\x00\x97\x05\x00\x00\x10'\x00\x00\xb0\x08\x00\x00\x10'\x00\x00\x01\x1c\x00\x00\x10'\x00\x00^\x02\x00\x00\x10'\x00\x00\x8b\x00\x00\x00\x10'\x00\x00\xcb\x03\x00\x00\x10'\x00\x00\xe5\x1b\x00\x00\x10'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x05\x05\x00\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x05\x05\x05\x00\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
That assumption wasn't correct. Although the above is metadata, it simply isn't the metadata I am looking for (in my case the FocalLength attribute). Rather it appears to be Olympus specific metadata. The answer to my solution was to find all the metadata. I found a piece of code that worked very well in Stack Overflow: In Python, how do I read the exif data for an image?.
I used the following code by Nicolas Gervais:
import os,sys
from PIL import Image
from PIL.ExifTags import TAGS
for (k,v) in Image.open(sys.argv[1])._getexif().items():
print('%s = %s' % (TAGS.get(k), v))
I replaced sys.argv[1] with the path name to the image file.
Alternate Solution
As MattDMo mentioned, there are also specific libraries for reading EXIF data in Python. One that I found that look promising is ExifRead which can be download by typing the following in the terminal:
pip install ExifRead

Python converting images in TIF format to PNG

I am running the code below to convert files from tifs to PNGs but the converted images isn't converting properly the new files are disrupted and with lines in half of the file as can be seen in the attached image. Any idea why would this happen please?.
import os
from PIL import Image
yourpath = os.getcwd()
for root, dirs, files in os.walk(yourpath, topdown=False):
for name in files:
print(os.path.join(root, name))
if os.path.splitext(os.path.join(root, name))[1].lower() == ".tif":
if os.path.isfile(os.path.splitext(os.path.join(root, name))[0] + ".png"):
print ("A png file already exists for %s" % name)
# If a PNG is *NOT* present, create one from the tiff.
else:
outfile = os.path.splitext(os.path.join(root, name))[0] + ".png"
try:
im = Image.open(os.path.join(root, name))
print("Generating jpeg for %s" % name)
im.thumbnail(im.size)
im.save(outfile, "png", quality=100)
except Exception as e:
print (e)
Running tiffinfo gives the following details:
TIFF Directory at offset 0x8 (8)
Image Width: 1024 Image Length: 1024
Resolution: 5000, 5000 pixels/cm
Bits/Sample: 8
Sample Format: unsigned integer
Compression Scheme: LZW
Photometric Interpretation: RGB color
Samples/Pixel: 3
Rows/Strip: 1
Planar Configuration: separate image planes
ImageDescription: <?xml version="1.0" encoding="UTF-8"?><!-- Warning: this comment is an OME-XML metadata block, which contains crucial dimensional parameters and other important metadata. Please edit cautiously (if at all), and back up the original data before doing so. For more information, see the OME-TIFF web site: https://docs.openmicroscopy.org/latest/ome-model/ome-tiff/. --><OME xmlns="http://www.openmicroscopy.org/Schemas/OME/2016-06" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Creator="OME Bio-Formats 5.9.2" UUID="urn:uuid:5d7dfdcc-e0e9-4899-9aa2-64717b405232" xsi:schemaLocation="http://www.openmicroscopy.org/Schemas/OME/2016-06 http://www.openmicroscopy.org/Schemas/OME/2016-06/ome.xsd"><Image ID="Image:0" Name="p1_Bromophenol_1mM - Mark_and_Find 001/Position026 (RGB rendering)"><Pixels BigEndian="true" DimensionOrder="XYCZT" ID="Pixels:0" PhysicalSizeX="2.0" PhysicalSizeXUnit="┬Ám" PhysicalSizeY="2.0" PhysicalSizeYUnit="┬Ám" PhysicalSizeZ="1.0" PhysicalSizeZUnit="┬Ám" SizeC="3" SizeT="1" SizeX="1024" SizeY="1024" SizeZ="1" TimeIncrement="1.0" TimeIncrementUnit="s" Type="uint8"><Channel ID="Channel:0:0" Name="red" SamplesPerPixel="3"><LightPath/></Channel><TiffData FirstC="0" FirstT="0" FirstZ="0" IFD="0" PlaneCount="1"><UUID FileName="Position026 (RGB rendering) - 1024 x 1024 x 1 x 1 - 3 ch (8 bits).tif">urn:uuid:5d7dfdcc-e0e9-4899-9aa2-64717b405232</UUID></TiffData></Pixels></Image></OME>
Software: OME Bio-Formats 5.9.2

Pillow : return a NoneType when extracting EXIF ​metadata

I tried to extract Exif metadata from a picture with Pillow.
When I import my picture on GIMP or XnView, the software returns to me Exif metadata :
EXIF metadata on GIMP
However, when I run my Python script like this :
from PIL import Image
from PIL.ExifTags import TAGS
def get_exif():
i = Image.open('./Datatest_img/DAFANCH96_023MIC07633_L.jpg')
info = i._getexif()
return {TAGS.get(tag): value for tag, value in info.items()}
print(get_exif())
the script returns to me an error as if the image did not contain EXIF ​​metadata :
Traceback (most recent call last):
File "test_exif.py", line 17, in <module>
print(get_exif())
File "test_exif.py", line 15, in get_exif
return {TAGS.get(tag): value for tag, value in info.items()}
AttributeError: 'NoneType' object has no attribute 'items'
I also tried printing .info in my script the code return :
{None: (200, 200)}
and I running exiftool in command Line, the terminal print :
$ exiftool DAFANCH96_023MIC07633_L.jpg
ExifTool Version Number : 11.99
File Name : DAFANCH96_023MIC07633_L.jpg
Directory : .
File Size : 791 kB
File Modification Date/Time : 2020:05:27 22:46:56+02:00
File Access Date/Time : 2020:05:28 10:54:31+02:00
File Inode Change Date/Time : 2020:05:27 22:46:57+02:00
File Permissions : rw-r--r--
File Type : JPEG
File Type Extension : jpg
MIME Type : image/jpeg
JFIF Version : 1.01
Resolution Unit : inches
X Resolution : 200
Y Resolution : 200
Image Width : 4096
Image Height : 2944
Encoding Process : Baseline DCT, Huffman coding
Bits Per Sample : 8
Color Components : 1
Image Size : 4096x2944
Megapixels : 12.1
Anyone have an idea ? Does anyone know what's going on ? thanks.

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b
using pyspark.
import array
from io import StringIO
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 4106)
def mapper(features):
a = array.array('f')
a.frombytes(features)
return a.tolist()
def byte_mapper(bytes):
return str(bytes)
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
When just product_id is selected from the rdd using
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
The output for product_id is
["b'1582480311'", "b'\\x00\\x00\\x00\\x00\\x88c-?\\xeb\\xe2'", "b'7#\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\xec/\\x0b?\\x00\\x00\\x00\\x00K\\xea'", "b'\\x00\\x00c\\x7f\\xd9?\\x00\\x00\\x00\\x00'", "b'L\\xa6\\n>\\x00\\x00\\x00\\x00\\xfe\\xd4'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\xe5\\xd0\\xa2='", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'"]
The file is hosted on s3.
The file in each row has first 10 bytes for product_id next 4096 bytes as image_features
I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.
EDIT:
Finally, the problem comes from the recordLength. It's not 4096 + 10 but 4096*4 + 10. Chaging to :
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 16394)
Should work.
Actually you can find this in the provided code from the web site you downloaded the binary file:
for i in range(4096):
feature.append(struct.unpack('f', f.read(4))) # <-- so 4096 * 4
Old answer:
I think the issue comes from your byte_mapper function.
That's not the correct way to convert bytes to string. You should be using decode:
bytes = b'1582480311'
print(str(bytes))
# output: "b'1582480311'"
print(bytes.decode("utf-8"))
# output: '1582480311'
If you're getting the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 4: invalid start byte
That means product_id string contains non-utf8 characters. If you don't know the input encoding, it's difficult to convert into strings.
However, you may want to ignore those characters by adding option ignore to decode function:
bytes.decode("utf-8", "ignore")

Resources