I have to extract information from a xml.p7m (Italian invoice with digital signature function, I think at least.).
The extraction part is already done and works fine with the usual xml from Italy, but since we get those xml.p7m too (which I just recently discovered), I'm stuck, because I can't figure out how to deal with those.
I just want the xml part so I start with those splits to remove the signature part:
with open(path, encoding='unicode_escape') as f:
txt = '<?xml version="1.0"' + re.split('<?xml version="1.0"',f.read())[1]
txt = re.split('</FatturaElettronica>', txt)[0] + "</FatturaElettronica>"
So what I'm stuck with now is that there are still parts like this in the xml:
""" <Anagrafica>
<Denominazione>AUTOCARROZZERIA CIANO S.R.L.</Denominazione>
</Anagraf♦♥èica>"""
which makes the xml not well formed, obviously and the data extraction is not working.
I have to use unicode_escape to open the file and remove those lines, because otherwise I would get an error because those signature parts can't be encoded in utf-8.
If I encode this part, I get:
b' <Anagrafica>\n <Denominazione>AUTOCARROZZERIA CIANO S.R.L.</Denominazione>\n </Anagraf\xe2\x99\xa6\xe2\x99\xa5\xc3\xa8ica>'
Anyone an idea on how to extract only the xml part from the xml?
Btw the xml should be: but if I open the xml, there are already characters that don't belong to the utf-8 charset or something?
Edit:
The way I did it at first was really not optimal. There was to much manual work, so I searched further for a real solution and found this:
from OpenSSL._util import (
ffi as _ffi,
lib as _lib,
)
def removeSignature(fileString):
p7 = crypto.load_pkcs7_data(crypto.FILETYPE_ASN1, fileString)
bio_out =crypto._new_mem_buf()
res = _lib.PKCS7_verify(p7._pkcs7, _ffi.NULL, _ffi.NULL, _ffi.NULL, bio_out, _lib.PKCS7_NOVERIFY|_lib.PKCS7_NOSIGS)
if res == 1:
return(crypto._bio_to_string(bio_out).decode('UTF-8'))
else:
errno = _lib.ERR_get_error()
errstrlib = _ffi.string(_lib.ERR_lib_error_string(errno))
errstrfunc = _ffi.string(_lib.ERR_func_error_string(errno))
errstrreason = _ffi.string(_lib.ERR_reason_error_string(errno))
return ""
What I'm doing now is checking the xml if it's allready in proper xml format, or if it has to be decoded at first, after that I remove the signature and form the xml tree, so I can do the xml stuff I need to do:
if filePath.lower().endswith('p7m'):
logger.infoLog(f"Try open file: {filePath}")
with open(filePath, 'rb') as f:
txt = f.read()
# no opening tag to find --> no xml --> decode the file, save it, and get the text
if not re.findall(b'<',txt):
image_64_decode = base64.decodebytes(txt)
image_result = open(path + 'decoded.xml', 'wb') # create a writable image and write the decoding result
image_result.write(image_64_decode)
image_result.close()
txt = open(path + 'decoded.xml', 'rb').read()
# try to parse the string
try:
logger.infoLog("Try parsing the first time")
txt = removeSignature(txt)
ET.fromstring(txt)
I had a similar problem, some chars in file were not decoded correctly.
It was caused by a BOM file type.
You can try to use utf-8-sig encoding to read the file, like this:
with open(path, encoding='utf-8-sig') as f:
...
The easiest system to use is openssl:
C:\OpenSSL-Win64\bin\openssl.exe smime -verify -noverify -in **your.xml.p7m** -inform DER -out **your.xml**
Related
I am studying NLP techniques and while I have some experience with .txt files, using .docx has been troublesome. I am trying to use regex on strings, and since I am using a word document, this is my approach:
I will use textract to get a docx to txt and get the bytes to strings:
import textract
my_text = textract.process("1337.docx")
my_text = text.decode("utf-8")
I read the file:
def load_doc(filename):
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
I then try and do some regexs such as remove all numbers and etc, and when executing it in the main:
def regextest(doc):
...
...
text = load_doc(my_text)
tokens = regextest(text)
print(tokens)
I get the exception:
OSError: [Errno 36] File name too long: Are you buying a Tesla?\n\n\n\n - I believe the pricing is...(and more text from te file)
I know I am transforming my docx file to a text file and then, when I read the "filename", it is actually the whole text. How can I preserve the file and make it work? How would you guys approach this?
It seems that you are using the contents of the file - my_text as the filename parameter to load_doc and hence the error.
I would think that you rather want to use one of the actual file names as a parameter, possibly '1337.docx' and not the contents of this file.
I extracted an embedded object from an excel spreadsheet that was a pdf but the excel zip file saves embedded objects as binary files.
I am trying to read the binary file and return it to it's original format as a pdf. I took some code from another question with a similar issue but when i try opening the pdf adobe gives error "can't open because file is damaged...not decoded correctly.."
Does anyone know of a way to do this?
with open('oleObject1.bin','rb') as f:
binaryData = f.read()
print(binaryData)
with open(os.path.expanduser('test1.pdf'), 'wb') as fout:
fout.write(base64.decodebytes(binaryData))
Link to the object file on github
Thanks Ryan, I was able to see what you are talking about. Here is solution for future reference.
str1 = b'%PDF-' # Begin PDF
str2 = b'%%EOF' # End PDF
with open('oleObject1.bin', 'rb') as f:
binary_data = f.read()
print(binary_data)
# Convert BYTE to BYTEARRAY
binary_byte_array = bytearray(binary_data)
# Find where PDF begins
result1 = binary_byte_array.find(str1)
print(result1)
# Remove all characters before PDF begins
del binary_byte_array[:result1]
print(binary_byte_array)
# Find where PDF ends
result2 = binary_byte_array.find(str2)
print(result2)
# Subtract the length of the array from the position of where PDF ends (add 5 for %%OEF characters)
# and delete that many characters from end of array
print(len(binary_byte_array))
to_remove = len(binary_byte_array) - (result2 + 5)
print(to_remove)
del binary_byte_array[-to_remove:]
print(binary_byte_array)
with open(os.path.expanduser('test1.pdf'), 'wb') as fout:
fout.write(binary_byte_array)
The bin file contains a valid PDF. There is no decoding required. The bin file though does have bytes before and after the PDF that need to be trimmed.
To get the first byte look for the first occurrence of string %PDF-
To get the final byte look for the last %%EOF.
Note, I do not know what "format" the leading/trailing bytes are, that are added by Excel. The solution above obliviously would not work if either of the ascii strings above could also be in the leading/trailing data.
You should try using a python library that allows you to write pdf files like reportlab or pyPDF
Environment :python3.
There are many files ,some of them encoding with gbk,others encoding with utf-8.
I want to extract all the jpg with regular expression
For s.html encoding with gbk.
tree = open("/tmp/s.html","r").read()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 135: invalid start byte
tree = open("/tmp/s.html","r",encoding="gbk").read()
pat = "http://.+\.jpg"
result = re.findall(pat,tree)
print(result)
['http://somesite/2017/06/0_56.jpg']
It is a huge job to open all the files with specified encoding,i want a smart way to extract jpg urls in all the files.
I had a similar problem, and how I solved this is as follows.
In get_file_encoding(filename), I first check if there is a BOM (Byte Order Mark) in the file, if so, get the encoding from the BOM. From the function: get_file_bom_encoding(filename)
If that did not return a value, I would get a list from the function: get_all_file_encodings(filename)
This list will have all encodings which the file can be opened in. For the purpose I was doing this, I just needed one, and I did not care about the rest, so I simply choose the first value of the list file_encoding = str(encoding_list[0][0])
This obviously is not accurate a 100% but at least it will give you either the correct encoding from the BOM, or it will give you a list of encodings in which the file can be opened. Or if you do this the same, it will give you one (first value) encoding the file can be opened with without errors.
Here it the code:
# -*- coding: utf-8 -*-
import codecs
def get_file_bom_encoding(filename):
with open (filename, 'rb') as openfileobject:
line = str(openfileobject.readline())
if line[2:14] == str(codecs.BOM_UTF8).split("'")[1]: return 'utf_8'
if line[2:10] == str(codecs.BOM_UTF16_BE).split("'")[1]: return 'utf_16'
if line[2:10] == str(codecs.BOM_UTF16_LE).split("'")[1]: return 'utf_16'
if line[2:18] == str(codecs.BOM_UTF32_BE).split("'")[1]: return 'utf_32'
if line[2:18] == str(codecs.BOM_UTF32_LE).split("'")[1]: return 'utf_32'
return ''
def get_all_file_encodings(filename):
encoding_list = []
encodings = ('utf_8', 'utf_16', 'utf_16_le', 'utf_16_be',
'utf_32', 'utf_32_be', 'utf_32_le',
'cp850' , 'cp437', 'cp852', 'cp1252', 'cp1250' , 'ascii',
'utf_8_sig', 'big5', 'big5hkscs', 'cp037', 'cp424', 'cp500',
'cp720', 'cp737', 'cp775', 'cp855', 'cp856', 'cp857',
'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864',
'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932',
'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1140', 'cp1251',
'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257',
'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213',
'euc_kr', 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp',
'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004',
'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1',
'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5',
'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9',
'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15',
'iso8859_16', 'johab', 'koi8_r', 'koi8_u', 'mac_cyrillic',
'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman',
'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004',
'shift_jisx0213'
)
for e in encodings:
try:
fh = codecs.open(filename, 'r', encoding=e)
fh.readlines()
except UnicodeDecodeError:
fh.close()
except UnicodeError:
fh.close()
else:
encoding_list.append([e])
fh.close()
continue
return encoding_list
def get_file_encoding(filename):
file_encoding = get_file_bom_encoding(filename)
if file_encoding != '':
return file_encoding
encoding_list = get_all_file_encodings(filename)
file_encoding = str(encoding_list[0][0])
if file_encoding[-3:] == '_be' or file_encoding[-3:] == '_le':
file_encoding = file_encoding[:-3]
return file_encoding
def main():
print('This Python script is only for import functionality, it does not run interactively')
if __name__ == '__main__':
main()
I am sure that there are modules/packages which can do this more accurately, but this is done with standard packages (which was another requirement I had)
It does mean that you are reading the files multiple times, which is not a very fast way of doing things. You may be able to use this to suite your own particular problem, or even improve on this, especially the "reading multiple times" is something which you could look at, and immediately open the file once one encoding is found.
If they have mixed encoding, you could try one encoding and fall back to another:
# first open as binary
with open(..., 'rb') as f:
f_contents = f.read()
try:
contents = f_contents.decode('UTF-8')
except UnicodeDecodeError:
contents = f_contents.decode('gbk')
...
If they are html files, you may also be able to find the encoding tag, or search them as binary with a binary regex:
contents = open(..., 'rb').read()
regex = re.compile(b'http://.+\.jpg')
result = regex.findall(contents)
# now you'll probably want to `.decode()` each of the urls, but you should be able to do that pretty trivially with even the `ASCII` codec
Though now that I think of it, you probably don't really want to use regex to capture the links as you'll then have to deal with html entities (&) and may do better with something like pyquery
Here's a quick example using pyquery
contents = open(..., 'rb').read()
pq = pyquery.PyQuery(contents)
images = pq.find('img')
for img in images:
img = pyquery.PyQuery(img)
if img.attr('src').endswith('.jpg')
print(img.attr('src'))
Not on my computer with things installed, so mileage with these code samples may vary
How do I split a text string according to an explicit newline ('\n')?
Unfortunately, instead of a properly formatted csv file, I am dealing with a long string of text with "\n" where the newline would be. (example format: "A0,B0\nA1,B1\nA2,B2\nA3,B3\n ...") I thought a simple bad_csv_list = text.split('\n') would give me a list of the two-valued cells (example split ['A0,B0', 'A1,B1', 'A2,B2', 'A3,B3', ...]). Instead I end up with one cell and "\n" gets converted to "\\n". I tried copy-pasting a section of the string and using split('\n') and it works as I had hoped. The print statement for the file object tells me the following:
<_io.TextIOWrapper name='stats.csv' mode='r' encoding='cp1252'>
...so I suspect the problem is with the cp1252 encoding? Of note tho: Notepad++ says the file I am working with is "UTF-8 without BOM"... I've looked in the docs and around SO and tried importing io and codec and prepending the open statement and declaring encoding='utf8' but I am at a loss and I don't really grok text encoding. Maybe there is a better solution?
from sys import argv
# import io, codec
filename = argv[1]
file_object = open(filename, 'r')
# file_object = io.open(filename, 'r', encoding='utf8')
# file_object = codec.open(filename, 'r', encoding='utf8')
file_contents = file_object.read()
file_list = file_contents.split('\n')
print("1.) Here's the name of the file: {}".format(filename))
print("2.) Here's the file object info: {}".format(file_object))
print("3.) Here's all the files contents:\n{}".format(file_contents))
print("4.) Here's a list of the file contents:\n{}".format(file_list))
Any help would be greatly appreciated, thank you.
If it helps to explain what I am dealing with, here's the contents of the stats.csv file:
Albuquerque,749\nAnaheim,371\nAnchorage,828\nArlington,503\nAtlanta,1379\nAurora,425\nAustin,408\nBakersfield,542\nBaltimore,1405\nBoston,835\nBuffalo,1288\nCharlotte-Mecklenburg,647\nCincinnati,974\nCleveland,1383\nColorado Springs,455\nCorpus Christi,658\nDallas,675\nDenver,615\nDetroit,2122\nEl Paso,423\nFort Wayne,362\nFort Worth,587\nFresno,543\nGreensboro,563\nHenderson,168\nHouston,992\nIndianapolis,1185\nJacksonville,617\nJersey City,734\nKansas City,1263\nLas Vegas,784\nLexington,352\nLincoln,397\nLong Beach,575\nLos Angeles,481\nLouisville Metro,598\nMemphis,1750\nMesa,399\nMiami,1172\nMilwaukee,1294\nMinneapolis,992\nMobile,522\nNashville,1216\nNew Orleans,815\nNew York,639\nNewark,1154\nOakland,1993\nOklahoma City,919\nOmaha,594\nPhiladelphia,1160\nPhoenix,636\nPittsburgh,752\nPlano,130\nPortland,517\nRaleigh,423\nRiverside,443\nSacramento,738\nSan Antonio,503\nSan Diego,413\nSan Francisco,704\nSan Jose,363\nSanta Ana,401\nSeattle,597\nSt. Louis,1776\nSt. Paul,722\nStockton,1548\nTampa,616\nToledo,1171\nTucson,724\nTulsa,990\nVirginia Beach,169\nWashington,1177\nWichita,742
And the result from the split('\n'):
['Albuquerque,749\\nAnaheim,371\\nAnchorage,828\\nArlington,503\\nAtlanta,1379\\nAurora,425\\nAustin,408\\nBakersfield,542\\nBaltimore,1405\\nBoston,835\\nBuffalo,1288\\nCharlotte-Mecklenburg,647\\nCincinnati,974\\nCleveland,1383\\nColorado Springs,455\\nCorpus Christi,658\\nDallas,675\\nDenver,615\\nDetroit,2122\\nEl Paso,423\\nFort Wayne,362\\nFort Worth,587\\nFresno,543\\nGreensboro,563\\nHenderson,168\\nHouston,992\\nIndianapolis,1185\\nJacksonville,617\\nJersey City,734\\nKansas City,1263\\nLas Vegas,784\\nLexington,352\\nLincoln,397\\nLong Beach,575\\nLos Angeles,481\\nLouisville Metro,598\\nMemphis,1750\\nMesa,399\\nMiami,1172\\nMilwaukee,1294\\nMinneapolis,992\\nMobile,522\\nNashville,1216\\nNew Orleans,815\\nNew York,639\\nNewark,1154\\nOakland,1993\\nOklahoma City,919\\nOmaha,594\\nPhiladelphia,1160\\nPhoenix,636\\nPittsburgh,752\\nPlano,130\\nPortland,517\\nRaleigh,423\\nRiverside,443\\nSacramento,738\\nSan Antonio,503\\nSan Diego,413\\nSan Francisco,704\\nSan Jose,363\\nSanta Ana,401\\nSeattle,597\\nSt. Louis,1776\\nSt. Paul,722\\nStockton,1548\\nTampa,616\\nToledo,1171\\nTucson,724\\nTulsa,990\\nVirginia Beach,169\\nWashington,1177\\nWichita,742']
Why does it ADD a \ ?
dOh!!! ROYAL FACE PALM! I just wrote all this out an then realized that all I needed to do was put an escape slash before the \newline:
file_list = file_contents.split('\\n')
I'm gonna post this anyways so y'all can have a chuckle ^_^
How to start creating my own filetype in Python ? I have a design in mind but how to pack my data into a file with a specific format ?
For example I would like my fileformat to be a mix of an archive ( like other format such as zip, apk, jar, etc etc, they are basically all archives ) with some room for packed files, plus a section of the file containing settings and serialized data that will not be accessed by an archive-manager application.
My requirement for this is about doing all this with the default modules for Cpython, without external modules.
I know that this can be long to explain and do, but I can't see how to start this in Python 3.x with Cpython.
Try this:
from zipfile import ZipFile
import json
data = json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])
with ZipFile('foo.filetype', 'w') as myzip:
myzip.writestr('digest.json', data)
The file is now a zip archive with a json file (thats easy to read in again in many lannguages) for data you can add files to the archive with myzip write or writestr. You can read data back with:
with ZipFile('foo.filetype', 'r') as myzip:
json_data_read = myzip.read('digest.json')
newdata = json.loads(json_data_read)
Edit: you can append arbitrary data to the file with:
f = open('foo.filetype', 'a')
f.write(data)
f.close()
this works for winrar but python can no longer process the zipfile.
Use this:
import base64
import gzip
import ast
def save(data):
data = "[{}]".format(data).encode()
data = base64.b64encode(data)
return gzip.compress(data)
def load(data):
data = gzip.decompress(data)
data = base64.b64decode(data)
return ast.literal_eval(data.decode())[0]
How to use this with file:
open(filename, "wb").write(save(data)) # save data
data = load(open(filename, "rb").read()) # load data
This might look like this is able to be open with archive program
but it cannot because it is base64 encoded and they have to decode it to access it.
Also you can store any type of variable in it!
example:
open(filename, "wb").write(save({"foo": "bar"})) # dict
open(filename, "wb").write(save("foo bar")) # string
open(filename, "wb").write(save(b"foo bar")) # bytes
# there's more you can store!
This may not be appropriate for your question but I think this may help you.
I have a similar problem faced... but end up with some thing like creating a zip file and then renamed the zip file format to my custom file format... But it can be opened with the winRar.