Python error upon exif data extraction via Pillow module: invalid continuation byte - python-3.x

I am writing a piece of code to extract exif data from images using Python. I downloaded the Pillow module using pip3 and am using some code I found online:
from PIL import Image
from PIL.ExifTags import TAGS
imagename = "path to file"
image = Image.open(imagename)
exifdata = image.getexif()
for tagid in exifdata:
tagname = TAGS.get(tagid, tagid)
data = exifdata.get(tagid)
if isinstance(data, bytes):
data = data.decode()
print(f"{tagname:25}: {data}")
On some images this code works. However, for images I took on my Olympus camera I get the following error:
GPSInfo : 734
Traceback (most recent call last):
File "_pathname redacted_", line 14, in <module>
data = data.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 30: invalid continuation byte
When I remove the data = data.decode() part, I get the following:
GPSInfo : 734
PrintImageMatching : b"PrintIM\x000300\x00\x00%\x00\x01\x00\x14\x00\x14\x00\x02\x00\x01\x00\x00\x00\x03\x00\xf0\x00\x00\x00\x07\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\n\x00\x00\x00\x00\x00\x0b\x008\x01\x00\x00\x0c\x00\x00\x00\x00\x00\r\x00\x00\x00\x00\x00\x0e\x00P\x01\x00\x00\x10\x00`\x01\x00\x00 \x00\xb4\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x01\xff\x00\x00\x00\x02\x01\x83\x00\x00\x00\x03\x01\x83\x00\x00\x00\x04\x01\x83\x00\x00\x00\x05\x01\x83\x00\x00\x00\x06\x01\x83\x00\x00\x00\x07\x01\x80\x80\x80\x00\x10\x01\x83\x00\x00\x00\x00\x02\x00\x00\x00\x00\x07\x02\x00\x00\x00\x00\x08\x02\x00\x00\x00\x00\t\x02\x00\x00\x00\x00\n\x02\x00\x00\x00\x00\x0b\x02\xf8\x01\x00\x00\r\x02\x00\x00\x00\x00 \x02\xd6\x01\x00\x00\x00\x03\x03\x00\x00\x00\x01\x03\xff\x00\x00\x00\x02\x03\x83\x00\x00\x00\x03\x03\x83\x00\x00\x00\x06\x03\x83\x00\x00\x00\x10\x03\x83\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\t\x11\x00\x00\x10'\x00\x00\x0b\x0f\x00\x00\x10'\x00\x00\x97\x05\x00\x00\x10'\x00\x00\xb0\x08\x00\x00\x10'\x00\x00\x01\x1c\x00\x00\x10'\x00\x00^\x02\x00\x00\x10'\x00\x00\x8b\x00\x00\x00\x10'\x00\x00\xcb\x03\x00\x00\x10'\x00\x00\xe5\x1b\x00\x00\x10'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x05\x05\x00\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x05\x05\x05\x00\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
ResolutionUnit : 2
ExifOffset : 230
ImageDescription : OLYMPUS DIGITAL CAMERA
Make : OLYMPUS CORPORATION
Model : E-M10MarkII
Software : Version 1.2
Orientation : 1
DateTime : 2020:02:13 15:02:57
YCbCrPositioning : 2
YResolution : 350.0
Copyright :
XResolution : 350.0
Artist :
How should I fix this problem? Should I use a different Python module?

I did some digging and figured out the answer to the problem I posted about. I originally postulated that the rest of the metadata was in the byte data:
b"PrintIM\x000300\x00\x00%\x00\x01\x00\x14\x00\x14\x00\x02\x00\x01\x00\x00\x00\x03\x00\xf0\x00\x00\x00\x07\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\n\x00\x00\x00\x00\x00\x0b\x008\x01\x00\x00\x0c\x00\x00\x00\x00\x00\r\x00\x00\x00\x00\x00\x0e\x00P\x01\x00\x00\x10\x00`\x01\x00\x00 \x00\xb4\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x01\xff\x00\x00\x00\x02\x01\x83\x00\x00\x00\x03\x01\x83\x00\x00\x00\x04\x01\x83\x00\x00\x00\x05\x01\x83\x00\x00\x00\x06\x01\x83\x00\x00\x00\x07\x01\x80\x80\x80\x00\x10\x01\x83\x00\x00\x00\x00\x02\x00\x00\x00\x00\x07\x02\x00\x00\x00\x00\x08\x02\x00\x00\x00\x00\t\x02\x00\x00\x00\x00\n\x02\x00\x00\x00\x00\x0b\x02\xf8\x01\x00\x00\r\x02\x00\x00\x00\x00 \x02\xd6\x01\x00\x00\x00\x03\x03\x00\x00\x00\x01\x03\xff\x00\x00\x00\x02\x03\x83\x00\x00\x00\x03\x03\x83\x00\x00\x00\x06\x03\x83\x00\x00\x00\x10\x03\x83\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\t\x11\x00\x00\x10'\x00\x00\x0b\x0f\x00\x00\x10'\x00\x00\x97\x05\x00\x00\x10'\x00\x00\xb0\x08\x00\x00\x10'\x00\x00\x01\x1c\x00\x00\x10'\x00\x00^\x02\x00\x00\x10'\x00\x00\x8b\x00\x00\x00\x10'\x00\x00\xcb\x03\x00\x00\x10'\x00\x00\xe5\x1b\x00\x00\x10'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x05\x05\x00\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x05\x05\x05\x00\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
That assumption wasn't correct. Although the above is metadata, it simply isn't the metadata I am looking for (in my case the FocalLength attribute). Rather it appears to be Olympus specific metadata. The answer to my solution was to find all the metadata. I found a piece of code that worked very well in Stack Overflow: In Python, how do I read the exif data for an image?.
I used the following code by Nicolas Gervais:
import os,sys
from PIL import Image
from PIL.ExifTags import TAGS
for (k,v) in Image.open(sys.argv[1])._getexif().items():
print('%s = %s' % (TAGS.get(k), v))
I replaced sys.argv[1] with the path name to the image file.
Alternate Solution
As MattDMo mentioned, there are also specific libraries for reading EXIF data in Python. One that I found that look promising is ExifRead which can be download by typing the following in the terminal:
pip install ExifRead

Related

Can not build sphinx excerpts while keeping original langunge text

I need to build a excerpt from Arabic text and keep the original language for display purpose of the excerpt. But the problem is if I feed the Arabic text direct to BuildExcerpt function it gave the following error.
'{"دولة": "فلسطين", "مصدر": "وزارة الاقتصاد الوطني", "رقم الشركة": "563420595", "اسم الشركة": "شركة حسان الغرابلي وشركاه للتجارة العامة", "عنوان الشركة": "غزة - الشجاعية", "نوع الشركة": "شركة مسجلة", "تاريخ التسجيل": "1994-07-03", "الهاتف": "", "راس مال الشركة": "0 دينار أردني", "مفوضون": "الشركاء مجتمعين"}'
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 5: invalid start byte
As a workaround I used unidecode module and feed the converted text to BuildExcerpt. Then the original language encoding is missing and can not rebuild it again from the excerpt. See the output below.
[' ... -03", "lhtf": "", "rs ml lshrk#": "<b>0</b> dynr \'rdny", "mfwDwn": "lshrk mjtm ... ']
Is there way I can keep the original language encoding for the excerpt?

Converting TIFF file to JPEG when using Imagecodecs Python library gives an error

I am using the Imagecodecs Python library to convert TIFF files to JPEG files and one of my images are causing an error in the program. When it gets to the file I get the following error. Anyone know why I would be getting this?
Exception has occurred: ValueError
TIFF_DECODE: Sorry, can not handle image
PNG_DECODE: not a PNG image
GIF_DECODE: DGifCloseFile returned 'Data is not in GIF format'
WEBP_DECODE: WebPGetFeatures returned VP8_STATUS_BITSTREAM_ERROR
JPEG8_DECODE: Not a JPEG file: starts with 0x49 0x49
JPEG12_DECODE: Not a JPEG file: starts with 0x49 0x49
JPEGSOF3_DECODE: decode_jpegsof3 returned 'JPEG signature 0xFFD8FF not found'
JPEG2K_DECODE: not a J2K or JP2 data stream
JPEGLS_DECODE: charls_jpegls_decoder_read_header returned Invalid JPEG-LS stream, the leading start byte (0xFF) for a JPEG marker was not found
JPEGXR_DECODE: PKCodecFactory_CreateDecoderFromBytes returned WMP_errUnsupportedFormat
JPEGXL_DECODE: DecodeBrunsli returned 0
LERC_DECODE: lerc_getBlobInfo returned Failed
NUMPY_DECODE: not a numpy array
The code that is causing this is:
imwrite(filepath[:-4] + '.jpg', imread(filepath)[:,:,:3].copy())

Pillow : return a NoneType when extracting EXIF ​metadata

I tried to extract Exif metadata from a picture with Pillow.
When I import my picture on GIMP or XnView, the software returns to me Exif metadata :
EXIF metadata on GIMP
However, when I run my Python script like this :
from PIL import Image
from PIL.ExifTags import TAGS
def get_exif():
i = Image.open('./Datatest_img/DAFANCH96_023MIC07633_L.jpg')
info = i._getexif()
return {TAGS.get(tag): value for tag, value in info.items()}
print(get_exif())
the script returns to me an error as if the image did not contain EXIF ​​metadata :
Traceback (most recent call last):
File "test_exif.py", line 17, in <module>
print(get_exif())
File "test_exif.py", line 15, in get_exif
return {TAGS.get(tag): value for tag, value in info.items()}
AttributeError: 'NoneType' object has no attribute 'items'
I also tried printing .info in my script the code return :
{None: (200, 200)}
and I running exiftool in command Line, the terminal print :
$ exiftool DAFANCH96_023MIC07633_L.jpg
ExifTool Version Number : 11.99
File Name : DAFANCH96_023MIC07633_L.jpg
Directory : .
File Size : 791 kB
File Modification Date/Time : 2020:05:27 22:46:56+02:00
File Access Date/Time : 2020:05:28 10:54:31+02:00
File Inode Change Date/Time : 2020:05:27 22:46:57+02:00
File Permissions : rw-r--r--
File Type : JPEG
File Type Extension : jpg
MIME Type : image/jpeg
JFIF Version : 1.01
Resolution Unit : inches
X Resolution : 200
Y Resolution : 200
Image Width : 4096
Image Height : 2944
Encoding Process : Baseline DCT, Huffman coding
Bits Per Sample : 8
Color Components : 1
Image Size : 4096x2944
Megapixels : 12.1
Anyone have an idea ? Does anyone know what's going on ? thanks.

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b
using pyspark.
import array
from io import StringIO
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 4106)
def mapper(features):
a = array.array('f')
a.frombytes(features)
return a.tolist()
def byte_mapper(bytes):
return str(bytes)
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
When just product_id is selected from the rdd using
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
The output for product_id is
["b'1582480311'", "b'\\x00\\x00\\x00\\x00\\x88c-?\\xeb\\xe2'", "b'7#\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\xec/\\x0b?\\x00\\x00\\x00\\x00K\\xea'", "b'\\x00\\x00c\\x7f\\xd9?\\x00\\x00\\x00\\x00'", "b'L\\xa6\\n>\\x00\\x00\\x00\\x00\\xfe\\xd4'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\xe5\\xd0\\xa2='", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'"]
The file is hosted on s3.
The file in each row has first 10 bytes for product_id next 4096 bytes as image_features
I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.
EDIT:
Finally, the problem comes from the recordLength. It's not 4096 + 10 but 4096*4 + 10. Chaging to :
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 16394)
Should work.
Actually you can find this in the provided code from the web site you downloaded the binary file:
for i in range(4096):
feature.append(struct.unpack('f', f.read(4))) # <-- so 4096 * 4
Old answer:
I think the issue comes from your byte_mapper function.
That's not the correct way to convert bytes to string. You should be using decode:
bytes = b'1582480311'
print(str(bytes))
# output: "b'1582480311'"
print(bytes.decode("utf-8"))
# output: '1582480311'
If you're getting the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 4: invalid start byte
That means product_id string contains non-utf8 characters. If you don't know the input encoding, it's difficult to convert into strings.
However, you may want to ignore those characters by adding option ignore to decode function:
bytes.decode("utf-8", "ignore")

"wave.Error: unknown format: 3" after using librosa.resample. Is there anything wrong with the output of librosa?

I have a .wav file with a sample rate of 44.1khz, I want to resample it into 16khz by using librosa.resample. Though the output.wav sounds great, and it is 16khz, but I got an error when I'm trying to read it by wave.open.
and this problem is quite similar to mine:
Opening a wave file in python: unknown format: 49. What's going wrong?
This is my code:
if __name__ == "__main__":
input_wav = '1d13eeb2febdb5fc41d3aa7db311fa33.wav'
output_wav = 'result.wav'
y, sr = librosa.load(input_wav, sr=None)
print(sr)
y = librosa.resample(y, orig_sr=sr, target_sr=16000)
librosa.output.write_wav(output_wav, y, sr=16000)
wave.open(output_wav)
And I got error in the last step wave.open(output_wav)
The Exception is following:
Traceback (most recent call last):
File "/Users/range/Code/PycharmProjects/Speaker/test.py", line 204, in <module>
wave.open(output_wav)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/wave.py", line 499, in open
return Wave_read(f)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/wave.py", line 163, in __init__
self.initfp(f)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/wave.py", line 143, in initfp
self._read_fmt_chunk(chunk)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/wave.py", line 260, in _read_fmt_chunk
raise Error('unknown format: %r' % (wFormatTag,))
wave.Error: unknown format: 3
I just don't know why can't wave.open read the wav_flie, and I have to resample the wav to do my further work.
I wonder if the librosa.output.write changed the type of wav.
So I have to write the resample function by myself. Fortunately, it works.
This is my code:
def resample(input_wav, output_wav, tar_fs=16000):
audio_file = wave.open(input_wav, 'rb')
audio_data = audio_file.readframes(audio_file.getnframes())
audio_data_short = np.fromstring(audio_data, np.short)
src_fs = audio_file.getframerate()
dtype = audio_data_short.dtype
audio_len = len(audio_data_short)
audio_time_max = 1.0*(audio_len-1) / src_fs
src_time = 1.0 * np.linspace(0, audio_len, audio_len) / src_fs
tar_time = 1.0 * np.linspace(0, np.int(audio_time_max*tar_fs), np.int(audio_time_max*tar_fs)) / tar_fs
output_signal = np.interp(tar_time, src_time, audio_data_short).astype(dtype)
with wave.open(output_wav, 'wb') as f:
f.setnchannels(1)
f.setsampwidth(2)
f.setframerate(tar_fs)
f.writeframes(output_signal)
I hope if you can help me understand what's wrong when resampling the wav by librosa, and I'm glad to see my code can help other people who have the same problem. :)
I was working on a project and had the same error so dug in a bit and found that the issue is due to the default way in which librosa writes the wave file using write_wav() in the output module.
The problem is that the encoding quantification is 24 bit since it is "Floating Point PCM".
You can change bit quantification easily by using SoX. SoX is cross-platform command line utility which you can use to control specifics like the encoding format.
For example, you would do something like this to go from 24 bit encoding to 16 bit encoding:
sox audio.wav -b 16 -e signed-integer modified_audio.wav
(For Linux users): An alternative to sox since I couldn't use it. But I'm successfully convert it with ffmpeg on terminal by using the command:
ffmpeg -i input_wav.wav -ar 44100 -ac 1 -acodec pcm_s16le output_wav.wav
where "ar" = audio rate, and "ac" = audio channels.

Resources