How can I decode a .bin into a .pdf - python-3.x

I extracted an embedded object from an excel spreadsheet that was a pdf but the excel zip file saves embedded objects as binary files.
I am trying to read the binary file and return it to it's original format as a pdf. I took some code from another question with a similar issue but when i try opening the pdf adobe gives error "can't open because file is damaged...not decoded correctly.."
Does anyone know of a way to do this?
with open('oleObject1.bin','rb') as f:
binaryData = f.read()
print(binaryData)
with open(os.path.expanduser('test1.pdf'), 'wb') as fout:
fout.write(base64.decodebytes(binaryData))
Link to the object file on github

Thanks Ryan, I was able to see what you are talking about. Here is solution for future reference.
str1 = b'%PDF-' # Begin PDF
str2 = b'%%EOF' # End PDF
with open('oleObject1.bin', 'rb') as f:
binary_data = f.read()
print(binary_data)
# Convert BYTE to BYTEARRAY
binary_byte_array = bytearray(binary_data)
# Find where PDF begins
result1 = binary_byte_array.find(str1)
print(result1)
# Remove all characters before PDF begins
del binary_byte_array[:result1]
print(binary_byte_array)
# Find where PDF ends
result2 = binary_byte_array.find(str2)
print(result2)
# Subtract the length of the array from the position of where PDF ends (add 5 for %%OEF characters)
# and delete that many characters from end of array
print(len(binary_byte_array))
to_remove = len(binary_byte_array) - (result2 + 5)
print(to_remove)
del binary_byte_array[-to_remove:]
print(binary_byte_array)
with open(os.path.expanduser('test1.pdf'), 'wb') as fout:
fout.write(binary_byte_array)

The bin file contains a valid PDF. There is no decoding required. The bin file though does have bytes before and after the PDF that need to be trimmed.
To get the first byte look for the first occurrence of string %PDF-
To get the final byte look for the last %%EOF.
Note, I do not know what "format" the leading/trailing bytes are, that are added by Excel. The solution above obliviously would not work if either of the ascii strings above could also be in the leading/trailing data.

You should try using a python library that allows you to write pdf files like reportlab or pyPDF

Related

Python how to extract the xml part from xml.p7m file

I have to extract information from a xml.p7m (Italian invoice with digital signature function, I think at least.).
The extraction part is already done and works fine with the usual xml from Italy, but since we get those xml.p7m too (which I just recently discovered), I'm stuck, because I can't figure out how to deal with those.
I just want the xml part so I start with those splits to remove the signature part:
with open(path, encoding='unicode_escape') as f:
txt = '<?xml version="1.0"' + re.split('<?xml version="1.0"',f.read())[1]
txt = re.split('</FatturaElettronica>', txt)[0] + "</FatturaElettronica>"
So what I'm stuck with now is that there are still parts like this in the xml:
""" <Anagrafica>
<Denominazione>AUTOCARROZZERIA CIANO S.R.L.</Denominazione>
</Anagraf♦♥èica>"""
which makes the xml not well formed, obviously and the data extraction is not working.
I have to use unicode_escape to open the file and remove those lines, because otherwise I would get an error because those signature parts can't be encoded in utf-8.
If I encode this part, I get:
b' <Anagrafica>\n <Denominazione>AUTOCARROZZERIA CIANO S.R.L.</Denominazione>\n </Anagraf\xe2\x99\xa6\xe2\x99\xa5\xc3\xa8ica>'
Anyone an idea on how to extract only the xml part from the xml?
Btw the xml should be: but if I open the xml, there are already characters that don't belong to the utf-8 charset or something?
Edit:
The way I did it at first was really not optimal. There was to much manual work, so I searched further for a real solution and found this:
from OpenSSL._util import (
ffi as _ffi,
lib as _lib,
)
def removeSignature(fileString):
p7 = crypto.load_pkcs7_data(crypto.FILETYPE_ASN1, fileString)
bio_out =crypto._new_mem_buf()
res = _lib.PKCS7_verify(p7._pkcs7, _ffi.NULL, _ffi.NULL, _ffi.NULL, bio_out, _lib.PKCS7_NOVERIFY|_lib.PKCS7_NOSIGS)
if res == 1:
return(crypto._bio_to_string(bio_out).decode('UTF-8'))
else:
errno = _lib.ERR_get_error()
errstrlib = _ffi.string(_lib.ERR_lib_error_string(errno))
errstrfunc = _ffi.string(_lib.ERR_func_error_string(errno))
errstrreason = _ffi.string(_lib.ERR_reason_error_string(errno))
return ""
What I'm doing now is checking the xml if it's allready in proper xml format, or if it has to be decoded at first, after that I remove the signature and form the xml tree, so I can do the xml stuff I need to do:
if filePath.lower().endswith('p7m'):
logger.infoLog(f"Try open file: {filePath}")
with open(filePath, 'rb') as f:
txt = f.read()
# no opening tag to find --> no xml --> decode the file, save it, and get the text
if not re.findall(b'<',txt):
image_64_decode = base64.decodebytes(txt)
image_result = open(path + 'decoded.xml', 'wb') # create a writable image and write the decoding result
image_result.write(image_64_decode)
image_result.close()
txt = open(path + 'decoded.xml', 'rb').read()
# try to parse the string
try:
logger.infoLog("Try parsing the first time")
txt = removeSignature(txt)
ET.fromstring(txt)
I had a similar problem, some chars in file were not decoded correctly.
It was caused by a BOM file type.
You can try to use utf-8-sig encoding to read the file, like this:
with open(path, encoding='utf-8-sig') as f:
...
The easiest system to use is openssl:
C:\OpenSSL-Win64\bin\openssl.exe smime -verify -noverify -in **your.xml.p7m** -inform DER -out **your.xml**

Colab OSError: [Errno 36] File name too long when reading a docx2text file

I am studying NLP techniques and while I have some experience with .txt files, using .docx has been troublesome. I am trying to use regex on strings, and since I am using a word document, this is my approach:
I will use textract to get a docx to txt and get the bytes to strings:
import textract
my_text = textract.process("1337.docx")
my_text = text.decode("utf-8")
I read the file:
def load_doc(filename):
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
I then try and do some regexs such as remove all numbers and etc, and when executing it in the main:
def regextest(doc):
...
...
text = load_doc(my_text)
tokens = regextest(text)
print(tokens)
I get the exception:
OSError: [Errno 36] File name too long: Are you buying a Tesla?\n\n\n\n - I believe the pricing is...(and more text from te file)
I know I am transforming my docx file to a text file and then, when I read the "filename", it is actually the whole text. How can I preserve the file and make it work? How would you guys approach this?
It seems that you are using the contents of the file - my_text as the filename parameter to load_doc and hence the error.
I would think that you rather want to use one of the actual file names as a parameter, possibly '1337.docx' and not the contents of this file.

How to read many files have a specific format in python

I am a little bit confused in how to read all lines in many files where the file names have format from "datalog.txt.98" to "datalog.txt.120".
This is my code:
import json
file = "datalog.txt."
i = 97
for line in file:
i+=1
f = open (line + str (i),'r')
for row in f:
print (row)
Here, you will find an example of one line in one of those files:
I need really to your help
I suggest using a loop for opening multiple files with different formats.
To better understand this project I would recommend researching the following topics
for loops,
String manipulation,
Opening a file and reading its content,
List manipulation,
String parsing.
This is one of my favourite beginner guides.
To set the parameters of the integers at the end of the file name I would look into python for loops.
I think this is what you are trying to do
# create a list to store all your file content
files_content = []
# the prefix is of type string
filename_prefix = "datalog.txt."
# loop from 0 to 13
for i in range(0,14):
# make the filename variable with the prefix and
# the integer i which you need to convert to a string type
filename = filename_prefix + str(i)
# open the file read all the lines to a variable
with open(filename) as f:
content = f.readlines()
# append the file content to the files_content list
files_content.append(content)
To get rid of white space from file parsing add the missing line
content = [x.strip() for x in content]
files_content.append(content)
Here's an example of printing out files_content
for file in files_content:
print(file)

How to convert Hex to original file format?

I have a .tgz file that was formatted as shell code, it looks like this (Hex):
"\x1F\x8B\x08\x00\x44\x7A\x91\x4F\x00\x03\xED\x59\xED\x72.."
It was generated this way (python3):
import os
def main():
dump_src = "MyPlugin.tgz"
fc = ""
try:
with open(dump_src, 'rb') as fd:
fcr = fd.read()
for byte in bytearray(fcr):
fc += "\\x{:02x}".format(byte)
except:
fcr = dump_src
for byte in bytearray(fcr):
fc += "\\x{:02x}".format(byte)
print(fc)
# failed attempt:
fcback = bytes(int(fc[i+2:i+4], 16) for i in range(0, len(fc), 4))
print (fcback)
if __name__ == "__main__":
main()
How can I convert this back to the original tgz archive?
Edit: failed attempt in the last section outputs this:
b'\x8b\x00\x10]\x03\x93o0\x85%\xe2!\xa4H\xf1Fi\xa7\x15\xf61&\x13N\xd9[\xfag\x11V\x97\xd3\xfb%\xf7\xe3\\\xae\xc2\xff\xa4>\xaf\x11\xcc\x93\xf1\x0c\x93\xa4\x1b\xefxj\xc3?\xf9\xc1\xe8\xd1\xd9\x01\x97qB"\x1a\x08\x9cO\x7f\xe9\x19\xe3\x9c\x05\xf2\x04a\xaa\x00A,\x15"RN-\xb6\x18K\x85\xa1\x11\x83\xac/\xffR\x8a\xa19\xde\x10\x0b\x08\x85\x93\xfc]\x8a^\xd2-T\x92\x9a\xcc-W\xc7|\xba\x9c\xb3\xa6V0V H1\x98\xde\x03#\x14\'\n 1Y\xf7R\x14\xe2#\xbe*:\xe0\xc8\xbb\xc9\x0bo\x8bm\xed.\xfd\xae\xef\x9fT&\xa1\xf4\xcf\xa7F\xf4\xef\xbb"8"\xb5\xab,\x9c\xbb\xfc3\x8b\xf5\x88\xf4A\x0ek%5eO\xf4:f\x0b\xd6\x1bi\xb6\xf3\xbf\xf7\xf9\xad\xb5[\xdba7\xb8\xf9\xcd\xba\xdd,;c\x0b\xaaT"\xd4\x96\x17\xda\x07\x87& \xceH\xd6\xbf\xd2\xeb\xb4\xaf\xbd\xc2\xee\xfc\'3zU\x17>\xde\x06u\xe3G\x7f\x1e\xf3\xdf\xb6\x04\x10A\x04\x10A\x04\x10A\x04\x10A\xff\x9f\xab\xe8(\x00'
And when I output it to a file (e.g. via python3 main.py > MyFile.tgz) the file is corrupted.
Since you know the format of the data (each byte is encoded as a string of 4 characters in the format "\xAB") it's easy to revert the conversion and get the original bytes again. It'll only take one line of Python code:
data = bytes(int(fc[i+2:i+4], 16) for i in range(0, len(fc), 4))
This uses:
range(start, stop, step) with step 4 to iterate in groups of 4 characters through your string
slicing to get each group of 2 hexadecimal digits
int(x, base) to convert the hexadecimal string to an integer
a generator expression to immediately pass the converted elements to:
bytes() to create a bytes object with the data
The variable data is now of type bytes and you could directly write it to a file (to decompress with an external zip program), or pass it to zlib.decompress() (to further process it in Python).
UPDATE (follow-up on the comments and updated question):
Firstly, I have tested the above code and it does result in the same bytes as the input. Are you really sure that the example output in your question is the actual result of the code in your question? Please try to be careful when copying code and/or output. A few remarks:
Your code is not properly formatted, so I cannot run it without making modifications. And when I have made modifications to the code, I might run different code than you do, yielding different results. So next time please copy-paste your exact (working, tested) code without modifications.
The format string in your code uses lowercase hexadecimal format, and your first example output uses uppercase. So that output cannot be from this code.
I don't have access to your file "MyPlugin.tgz", but when I test your code with another .tgz file (after fixing the IndentationErrors), my output is correct. It starts with \x1f\x8b as expected (this is the magic number in the gzip header). I can't explain why your output is different...
Secondly, it seems like you don't fully understand how bytes and string representations work. When you write print(fcback), a string representation of the Python object fcback (in this case a bytes object) is printed. The string representation of a bytes object is not the same as the binary data! When printing a bytes object, each byte that corresponds to a printable ASCII character is replaced by that character, other bytes are escaped (similar to the formatted string that your code generates). Also, it starts with b' and ends with '.
You cannot print binary data to your terminal and then pipe the output to a file. This will result in a different file. The correct way to write the data to a file is using file.write(data) in your Python code.
Here's a fully working example:
def binary_to_text(data):
"""Convert a bytes object to a formatted text string."""
text = ""
for byte in data:
text += "\\x{:02x}".format(byte)
return text
def text_to_binary(text):
"""Convert a formatted text string to a bytes object."""
return bytes(int(text[i+2:i+4], 16) for i in range(0, len(text), 4))
def main():
# Read the binary data from input file:
with open('MyPlugin.tgz', 'rb') as input_file:
input_data = input_file.read()
# Convert binary to text (based on your original code):
text = binary_to_text(input_data)
print(text[0:100])
# Convert the text back to binary:
output_data = text_to_binary(text)
print(output_data[0:100])
# Write the binary data back to a file:
with open('MyPlugin-restored.tgz', 'wb') as output_file:
output_file.write(output_data)
if __name__ == '__main__':
main()
Note that I only print the first 100 elements to keep the output short. Also notice that the second print-statement prints a much longer text. This is because the first print gets 100 characters (which are printed "as is"), while the second print gets 100 bytes (of which most bytes are escaped, causing the output to be longer).

Writing to files in ASCII with Python3, not UTF8

I have a program that I created with two sections.
The first one copies a text file with an integer in the middle of the file name in this format.
file = "Filename" + "str(int)" + ".txt"
the user can create as many copies of the file that they would like.
The second part of the program is what I am having the problem with. There is an integer at the very bottom of the file that is to correspond with the integer in the file name. After the first part is done, I open each file one at a time in "r+" read/write format. So I can file.seek(1000) to about where the integer is in the file.
Now in my opinion the next part should be easy. I should just simply have to write str(int) into the file right here. But it wasn't that easy. It worked just fine doing it like that in Linux at home, but at work on Windows it proved difficult. What I ended up having to do after file.seek(1000) is write to the file using Unicode UTF-8. I accomplished this with this code snippet of the rest of the program. I will document it so that it is able to be understood what is going on. Instead of having to write this in Unicode, I would love to be able to write this in good old regular English ASCII characters. Eventually this program will be expanded to include a lot more data at the bottom of each file. Having to write the data in Unicode is going to make things extremely difficult. If I just write the data without turning it into Unicode this is the result. This string is supposed to say #2 =1534, instead it says #2 =ㄠ㌵433.
If someone can show me what I am doing wrong that would be great. I would love to just use something like file.write('1534') to write the data to the file instead of having to do it in Unicode UTF-8.
while a1 < d1 :
file = "file" + str(a1) + ".par"
f = open(file, "r+")
f.seek(1011)
data = f.read() #reads the data from that point in the file into a variable.
numList= list(str(a1)) # "a1" is the integer in the file name. I had to turn the integer into a list to accomplish the next task.
replaceData = '\x00' + numList[0] + '\x00' + numList[1] + '\x00' + numList[2] + '\x00' + numList[3] + '\x00' #This line turns the integer into Utf 8 Unicode. I am by no means a Unicode expert.
currentData = data #probably didn't need to be done now that I'm looking at this.
data = data.replace(currentData, replaceData) #replaces the Utf 8 string in the "data" variable with the new Utf 8 string in "replaceData."
f.seek(1011) # Return to where I need to be in the file to write the data.
f.write(data) # Write the new Unicode data to the file
f.close() #close the file
f.close() #make sure the file is closed (sometimes it seems that this fails in Windows.)
a1 += 1 #advances the integer, and then return to the top of the loop
This is an example of writing to a file in ASCII. You need to open the file in byte mode, and using the .encode method for strings is a convenient way to get the end result you want.
s = '12345'
ascii = s.encode('ascii')
with open('somefile', 'wb') as f:
f.write(ascii)
You can obviously also open in rb+ (read and write byte mode) in your case if the file already exists.
with open('somefile', 'rb+') as f:
existing = f.read()
f.write(b'ascii without encoding!')
You can also just pass string literals with the b prefix, and they will be encoded with ascii as shown in the second example.

Resources