Repairing corrupted JPEG images from character replacement - jpeg

Recently I got some corrupted JPEG images after a mistakingly input command:
~$> sed -i 's/;/_/g' *
After that, in the working directory and the subdirectories, Every byte '0x3b' in JPEG images became '0x5f'. Viewer apps displays the images corrupted, such as below:
corrupted image sample
I could not identify which byte should be recovered, and when I tried to validate the warning/error flags from the images with toolkits such as EXIFtool, they just returns OK as the corrupted JPEG is not literally BROKEN not to be opened by a viewer.
Images should be repaired, since there is no duplicated image backup for them, but I don't know how to start. Just replacing 0x5f with 0x3b again is not effective, since the number of cases would be too big (2^n I guess where there are n candidate 0x5f) to take the trial-and-error replacing way. I've just started parsing huffman table in a JPEG image header and hoping to identify the conflict point between huffman coded statement and binary, but not sure.
How can I recover the images in this situation? I appreciate your help.

There appear to be 57 incidences of 0x5f in your corrupted image. If you can't find a better way, you could maybe "eyeball" the effects of replacing the incorrect bytes in the image fairly quickly like this:
open the image in binary mode and read it all with JPEG = open('PdQpR.jpg','rb').read()
use offsets = [m.start() for m in re.finditer(b'_', JPEG)] to find the byte offsets of the 57 occurrences
display the image with cv2.imdecode() and cv2.imshow() and then enter a loop accepting keypresses with cv2.waitkey()
p = move to previous one of 57 occurrences
n = move to next one of 57 occurrences
SPACE = toggle between 0x5f and 0x3b
s = save current state
q = quit
I had a quick attempt at this but haven't had much success using it yet:
#!/usr/bin/env python3
import cv2
import re
import numpy as np
# Load the image
filename = 'PdQpR.jpg'
JPEG = open(filename,'rb').read()
JPEG = bytearray(JPEG)
# Find the byte offsets of all the underscores
offsets = [m.start() for m in re.finditer(b'_', JPEG)]
N = len(offsets)
index = 0
while True:
# Show user which entry we are at
print(f'{index}/{N}: n=next, p=previous, space=toggle, q=quit')
# Decode and display the JPEG
im = cv2.imdecode(np.frombuffer(JPEG, dtype=np.uint8), cv2.IMREAD_COLOR)
cv2.imshow(filename, im)
key = cv2.waitKey(0)
# n = next offset
if key == ord('n'):
index = (index + 1) % N
next
# p = previous offset
if key == ord('p'):
index = index -1
if index < 0:
index = N - 1
next
# q = Quit
if key == ord('q'):
break
# space = toggle between underscore and semicolon
if key == ord(' '):
if JPEG[offsets[index]] == ord('_'):
print(f'{index}/{N}: Toggling to ;')
JPEG[offsets[index]] = ord(';')
else:
print(f'{index}/{N}: Toggling to _')
JPEG[offsets[index]] = ord('_')
next
Note: Toggling some bytes between '_' and ';' results in illegal images and error messages from cv2.imdecode() and/or cv2.imshow(). Ideally you would wrap these inside a try/except and back out the last change if they occur. I didn't do that, yet.
Note: I didn't implement save function, it is just something like open('corrected.jpg', 'wb').write(JPEG)

Related

How to convert Hex to original file format?

I have a .tgz file that was formatted as shell code, it looks like this (Hex):
"\x1F\x8B\x08\x00\x44\x7A\x91\x4F\x00\x03\xED\x59\xED\x72.."
It was generated this way (python3):
import os
def main():
dump_src = "MyPlugin.tgz"
fc = ""
try:
with open(dump_src, 'rb') as fd:
fcr = fd.read()
for byte in bytearray(fcr):
fc += "\\x{:02x}".format(byte)
except:
fcr = dump_src
for byte in bytearray(fcr):
fc += "\\x{:02x}".format(byte)
print(fc)
# failed attempt:
fcback = bytes(int(fc[i+2:i+4], 16) for i in range(0, len(fc), 4))
print (fcback)
if __name__ == "__main__":
main()
How can I convert this back to the original tgz archive?
Edit: failed attempt in the last section outputs this:
b'\x8b\x00\x10]\x03\x93o0\x85%\xe2!\xa4H\xf1Fi\xa7\x15\xf61&\x13N\xd9[\xfag\x11V\x97\xd3\xfb%\xf7\xe3\\\xae\xc2\xff\xa4>\xaf\x11\xcc\x93\xf1\x0c\x93\xa4\x1b\xefxj\xc3?\xf9\xc1\xe8\xd1\xd9\x01\x97qB"\x1a\x08\x9cO\x7f\xe9\x19\xe3\x9c\x05\xf2\x04a\xaa\x00A,\x15"RN-\xb6\x18K\x85\xa1\x11\x83\xac/\xffR\x8a\xa19\xde\x10\x0b\x08\x85\x93\xfc]\x8a^\xd2-T\x92\x9a\xcc-W\xc7|\xba\x9c\xb3\xa6V0V H1\x98\xde\x03#\x14\'\n 1Y\xf7R\x14\xe2#\xbe*:\xe0\xc8\xbb\xc9\x0bo\x8bm\xed.\xfd\xae\xef\x9fT&\xa1\xf4\xcf\xa7F\xf4\xef\xbb"8"\xb5\xab,\x9c\xbb\xfc3\x8b\xf5\x88\xf4A\x0ek%5eO\xf4:f\x0b\xd6\x1bi\xb6\xf3\xbf\xf7\xf9\xad\xb5[\xdba7\xb8\xf9\xcd\xba\xdd,;c\x0b\xaaT"\xd4\x96\x17\xda\x07\x87& \xceH\xd6\xbf\xd2\xeb\xb4\xaf\xbd\xc2\xee\xfc\'3zU\x17>\xde\x06u\xe3G\x7f\x1e\xf3\xdf\xb6\x04\x10A\x04\x10A\x04\x10A\x04\x10A\xff\x9f\xab\xe8(\x00'
And when I output it to a file (e.g. via python3 main.py > MyFile.tgz) the file is corrupted.
Since you know the format of the data (each byte is encoded as a string of 4 characters in the format "\xAB") it's easy to revert the conversion and get the original bytes again. It'll only take one line of Python code:
data = bytes(int(fc[i+2:i+4], 16) for i in range(0, len(fc), 4))
This uses:
range(start, stop, step) with step 4 to iterate in groups of 4 characters through your string
slicing to get each group of 2 hexadecimal digits
int(x, base) to convert the hexadecimal string to an integer
a generator expression to immediately pass the converted elements to:
bytes() to create a bytes object with the data
The variable data is now of type bytes and you could directly write it to a file (to decompress with an external zip program), or pass it to zlib.decompress() (to further process it in Python).
UPDATE (follow-up on the comments and updated question):
Firstly, I have tested the above code and it does result in the same bytes as the input. Are you really sure that the example output in your question is the actual result of the code in your question? Please try to be careful when copying code and/or output. A few remarks:
Your code is not properly formatted, so I cannot run it without making modifications. And when I have made modifications to the code, I might run different code than you do, yielding different results. So next time please copy-paste your exact (working, tested) code without modifications.
The format string in your code uses lowercase hexadecimal format, and your first example output uses uppercase. So that output cannot be from this code.
I don't have access to your file "MyPlugin.tgz", but when I test your code with another .tgz file (after fixing the IndentationErrors), my output is correct. It starts with \x1f\x8b as expected (this is the magic number in the gzip header). I can't explain why your output is different...
Secondly, it seems like you don't fully understand how bytes and string representations work. When you write print(fcback), a string representation of the Python object fcback (in this case a bytes object) is printed. The string representation of a bytes object is not the same as the binary data! When printing a bytes object, each byte that corresponds to a printable ASCII character is replaced by that character, other bytes are escaped (similar to the formatted string that your code generates). Also, it starts with b' and ends with '.
You cannot print binary data to your terminal and then pipe the output to a file. This will result in a different file. The correct way to write the data to a file is using file.write(data) in your Python code.
Here's a fully working example:
def binary_to_text(data):
"""Convert a bytes object to a formatted text string."""
text = ""
for byte in data:
text += "\\x{:02x}".format(byte)
return text
def text_to_binary(text):
"""Convert a formatted text string to a bytes object."""
return bytes(int(text[i+2:i+4], 16) for i in range(0, len(text), 4))
def main():
# Read the binary data from input file:
with open('MyPlugin.tgz', 'rb') as input_file:
input_data = input_file.read()
# Convert binary to text (based on your original code):
text = binary_to_text(input_data)
print(text[0:100])
# Convert the text back to binary:
output_data = text_to_binary(text)
print(output_data[0:100])
# Write the binary data back to a file:
with open('MyPlugin-restored.tgz', 'wb') as output_file:
output_file.write(output_data)
if __name__ == '__main__':
main()
Note that I only print the first 100 elements to keep the output short. Also notice that the second print-statement prints a much longer text. This is because the first print gets 100 characters (which are printed "as is"), while the second print gets 100 bytes (of which most bytes are escaped, causing the output to be longer).

Erasing part of a text file in Python

I have a text file in my hard disk which is really big. It has around 8 million json files which are separated by comma and I want to remove the last json ; however, because it is really big I cannot do it via regular editors (Notepad++, Sublime, Visual Studio Code, ...). So, I decided to use Python, but I have no clue how to erase part of an existing file using python. Any kind of help would be appreciated.
P.S: My file has such a structure:
json1, json2, json3, ...
when each json looks like {"a":"something", "b":"something", "c":"something"}
The easiest way would be to make the file content valid JSON by enclosing it with [ and ] so it becomes a list of dicts, and after removing the last item from the list, you can dump it back into a string and then remove its first and the last characters, which will be [ and ], which your original text file does not want:
import json
with open('file.txt', 'r') as r, open('newfile.txt', 'w') as w:
w.write(json.dumps(json.loads('[%s]' % r.read())[:-1])[1:-1])
Since you only want the last JSON object removed from the file, a much more efficient method would be to identify the first valid JSON object at the end of the file and truncate the file from where that JSON object's preceding comma is positioned.
This can be accomplished by seeking and reading backwards from the end of the file, one relatively small chunk at a time, split the buffer by { (since it marks the beginning of a JSON object), and prepend the fragments one at a time to a buffer until the buffer is parsable as a JSON object (this makes the code able to handle nested dict structures), at which point you should find the preceding comma from the preceding fragment and prepend the comma to the buffer, so that finally, you can seek the file to where the buffer starts and truncate the file:
import json
chunk_size = 1024
with open('file.txt', 'rb+') as f:
f.seek(-chunk_size, 2)
buffer = ''
while True:
fragments = f.read(chunk_size).decode().split('{')
f.seek(-chunk_size * 2, 1)
i = len(fragments)
for fragment in fragments[:0:-1]:
i -= 1
buffer = '{%s%s' % (fragment, buffer)
try:
json.loads(buffer)
break
except ValueError:
pass
else:
buffer = fragments[0] + buffer
continue
break
next_fragment = fragments[i - 1]
# if we don't have a comma in the preceding fragment and it is already the first
# fragment, we need to read backwards a little more
if i == 1 and ',' not in fragments[0]:
f.seek(-2, 1)
next_fragment = f.read(2).decode() + next_fragment
buffer = next_fragment[next_fragment.rindex(','):] + buffer
f.seek(-len(buffer.encode()), 2)
f.truncate()

Degree character (°) encoding/decoding

I am on windows platform and I use Python 3.
I have a text file which contains degree characters (°)
I want to read the whole text file, do some processing and write it back with the performed modifications. Here is sample of my code :
with io.open('myTextFile.txt',encoding='ASCII') as f:
for item in allItem:
i=0
myData = pd.DataFrame(data=np.zeros((n,1)))
for line in f:
myRegex = "(AD"+item+")"
if re.match(myRegex,line):
myData.loc[i,0] = line
i+=1
myData = myData[(myData.T != 0).any()]
myData = myData.append(pd.DataFrame(["\n"],index=[myData.index[-1]+1]))
myData = myData[0].map(lambda x: x.strip()).to_frame()
myData.to_csv('myModifiedTextFile.txt', header = False, index = False, mode='a', quoting=csv.QUOTE_NONE, escapechar=' ', encoding = 'ASCII')
However I am getting unicode errors although I tried specifying encoding/decoding :
'ascii' codec can't decode byte 0xe9 in position 512: ordinal not in range(128)
ascii is not very useful here, since it only knows 128 characters, the ones you can find in this table. Notice there is no degree sign in that table. I am unsure what the actual encoding of your file is – Unicode and commonly used Windows code pages (1250/1252) have the degree sign at 0xB0.
I assume in your file, there is a degree sign at position 512 and it is causing the error. If this is the case, you need to be more specific with your encoding argument. Figure out which code page / encoding was used to save the file. Confirm this by looking up the code page and finding the degree sign at 0xE9.
If there is a different character at position 512 ("é" is a good candidate), then simply specify an encoding like cp1250, cp1252, or cp1257.

Unicode manipulation and garbage '[]' characters

I have a 4GB text file which I can't even load to view so I'm trying to separate it but I need to manipulate the data a bit at a time.
The problem is I'm getting these garbage white vertical rectangular characters and I can't search for what they are in a search engine because it won't paste nor can I get rid of them.
They look like these square parenthesis '[]' but without that small amount of space in the middle.
Their Unicode values differ so I can't just select one value and get rid of it.
I want to get rid of all of these rectangles.
Two more questions.
1) Why are there any Unicode characters here (in the img below) at all? I decoded them. What am I missing? Note: Later on I get string output that looks like a normal string such as 'code1234' etc but there are those Unicode exceptions there as well.
2) Can you see why larger end values would get this exception list index out of range? This only happens towards the end of the range and it isn't constant i.e. if end is 100 then maybe the last 5 will throw that exception but if end is 1000 then ONLY the LAST let's say 10 throw that exception.
Some code:
from itertools import islice
def read_from_file(file, start, end):
with open(file,'rb') as f:
for line in islice(f, start, end):
data.append(line.strip().decode("utf-8"))
for i in range(len(data)-1):
try:
if '#' in data[i]:
a = data.pop(i)
mail.append(a)
else:
print(data[i], data[i].encode())
except Exception as e:
print(str(e))
data = []
mail = []
read_from_file('breachcompilationuniq.txt', 0, 10)
Some Output:
Image link here as it won't let me format after pasting.
There's also this stuff later on, I don't know what these are either.
It appears that you have a text file which is not in the default encoding assumed by python (UTF-8), but nevertheless uses bytes values in the range 128-255. Try:
f = open(file, encoding='latin_1')
content = f.read()

Read numpy data from GZip file over the network

I am attempting to download the MNIST dataset and decode it without writing it to disk (mostly for fun).
request_stream = urlopen('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz')
zip_file = GzipFile(fileobj=request_stream, mode='rb')
with zip_file as fd:
magic, numberOfItems = struct.unpack('>ii', fd.read(8))
rows, cols = struct.unpack('>II', fd.read(8))
images = np.fromfile(fd, dtype='uint8') # < here be dragons
images = images.reshape((numberOfItems, rows, cols))
return images
This code fails with OSError: obtaining file position failed, an error that seems to be ungoogleable. What could the problem be?
The problem seems to be, that what gzip and similar modules provide, aren't real file objects (unsurprisingly), but numpy attempts to read through the actual FILE* pointer, so this cannot work.
If it's ok to read the entire file into memory (which it might not be), then this can be worked around by reading all non-header information into a bytearray and deserializing from that:
rows, cols = struct.unpack('>II', fd.read(8))
b = bytearray(fd.read())
images = np.frombuffer(b, dtype='uint8')
images = images.reshape((numberOfItems, rows, cols))
return images

Resources