xml.parsers.expat.ExpatError: not well-formed (invalid token)

xml.parsers.expat.ExpatError: not well-formed (invalid token) - python-3.x

When I use xmltodict to load the xml file below I get an error:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 1
Here is my file:
<?xml version="1.0" encoding="utf-8"?>
<mydocument has="an attribute">
<and>
<many>elements</many>
<many>more elements</many>
</and>
<plus a="complex">
element as well
</plus>
</mydocument>
Source:
import xmltodict
with open('fileTEST.xml') as fd:
xmltodict.parse(fd.read())
I am on Windows 10, using Python 3.6 and xmltodict 0.11.0
If I use ElementTree it works
tree = ET.ElementTree(file='fileTEST.xml')
for elem in tree.iter():
print(elem.tag, elem.attrib)
mydocument {'has': 'an attribute'}
and {}
many {}
many {}
plus {'a': 'complex'}
Note: I might have encountered a new line problem.
Note2: I used Beyond Compare on two different files.
It crashes on the file that is UTF-8 BOM encoded, and works om the UTF-8 file.
UTF-8 BOM is a sequence of bytes (EF BB BF) that allows the reader to identify a file as being encoded in UTF-8.

I think you forgot to define the encoding type.
I suggest that you try to initialize that xml file to a string variable:
import xml.etree.ElementTree as ET
import xmltodict
import json
tree = ET.parse('your_data.xml')
xml_data = tree.getroot()
#here you can change the encoding type to be able to set it to the one you need
xmlstr = ET.tostring(xml_data, encoding='utf-8', method='xml')
data_dict = dict(xmltodict.parse(xmlstr))

In my case the file was being saved with a Byte Order Mark as is the default with notepad++
I resaved the file without the BOM to plain utf8.

Python 3
One Liner
data: dict = xmltodict.parse(ElementTree.tostring(ElementTree.parse(path).getroot()))
Helper for .json and .xml
I wrote a small helper function to load .json and .xml files from a given path.
I thought it might come in handy for some people here:
import json
import xml.etree.ElementTree
def load_json(path: str) -> dict:
if path.endswith(".json"):
print(f"> Loading JSON from '{path}'")
with open(path, mode="r") as open_file:
content = open_file.read()
return json.loads(content)
elif path.endswith(".xml"):
print(f"> Loading XML as JSON from '{path}'")
xml = ElementTree.tostring(ElementTree.parse(path).getroot())
return xmltodict.parse(xml, attr_prefix="#", cdata_key="#text", dict_constructor=dict)
print(f"> Loading failed for '{path}'")
return {}
Notes
if you want to get rid of the # and #text markers in the json output, use the parameters attr_prefix="" and cdata_key=""
normally xmltodict.parse() returns an OrderedDict but you can change that with the parameter dict_constructor=dict
Usage
path = "my_data.xml"
data = load_json(path)
print(json.dumps(data, indent=2))
# OUTPUT
#
# > Loading XML as JSON from 'my_data.xml'
# {
# "mydocument": {
# "#has": "an attribute",
# "and": {
# "many": [
# "elements",
# "more elements"
# ]
# },
# "plus": {
# "#a": "complex",
# "#text": "element as well"
# }
# }
# }
Sources
ElementTree.tostring()
ElementTree.parse()
xmltodict
json.dumps()

I had the same problem and I solved just specifying the encoding to the open function.
In this case it would be something like:
import xmltodict
with open('fileTEST.xml', encoding='utf8') as fd:
xmltodict.parse(fd.read())

In my case, the issue was with the first 3 characters. So removing them worked:
import xmltodict
from xml.parsers.expat import ExpatError
with open('your_data.xml') as f:
data = f.read()
try:
doc = xmltodict.parse(data)
except ExpatError:
doc = xmltodict.parse(data[3:])

xmltodict seems to not be able to parse <?xml version="1.0" encoding="utf-8"?>
If you remove this line, it works.

Not specific to the original post but for those who are also running into the same error at a different line, I was able to fix it by correcting the XML/XHTML error.
In my case, the document I was working with had a text description with an andpercent symbol "&" instead of "&" so to fix my issue, I had to edit the file first before running through the parser.

Related

Python how to extract the xml part from xml.p7m file

I have to extract information from a xml.p7m (Italian invoice with digital signature function, I think at least.).
The extraction part is already done and works fine with the usual xml from Italy, but since we get those xml.p7m too (which I just recently discovered), I'm stuck, because I can't figure out how to deal with those.
I just want the xml part so I start with those splits to remove the signature part:
with open(path, encoding='unicode_escape') as f:
txt = '<?xml version="1.0"' + re.split('<?xml version="1.0"',f.read())[1]
txt = re.split('</FatturaElettronica>', txt)[0] + "</FatturaElettronica>"
So what I'm stuck with now is that there are still parts like this in the xml:
""" <Anagrafica>
<Denominazione>AUTOCARROZZERIA CIANO S.R.L.</Denominazione>
</Anagraf♦♥èica>"""
which makes the xml not well formed, obviously and the data extraction is not working.
I have to use unicode_escape to open the file and remove those lines, because otherwise I would get an error because those signature parts can't be encoded in utf-8.
If I encode this part, I get:
b' <Anagrafica>\n <Denominazione>AUTOCARROZZERIA CIANO S.R.L.</Denominazione>\n </Anagraf\xe2\x99\xa6\xe2\x99\xa5\xc3\xa8ica>'
Anyone an idea on how to extract only the xml part from the xml?
Btw the xml should be: but if I open the xml, there are already characters that don't belong to the utf-8 charset or something?

Edit:
The way I did it at first was really not optimal. There was to much manual work, so I searched further for a real solution and found this:
from OpenSSL._util import (
ffi as _ffi,
lib as _lib,
)
def removeSignature(fileString):
p7 = crypto.load_pkcs7_data(crypto.FILETYPE_ASN1, fileString)
bio_out =crypto._new_mem_buf()
res = _lib.PKCS7_verify(p7._pkcs7, _ffi.NULL, _ffi.NULL, _ffi.NULL, bio_out, _lib.PKCS7_NOVERIFY|_lib.PKCS7_NOSIGS)
if res == 1:
return(crypto._bio_to_string(bio_out).decode('UTF-8'))
else:
errno = _lib.ERR_get_error()
errstrlib = _ffi.string(_lib.ERR_lib_error_string(errno))
errstrfunc = _ffi.string(_lib.ERR_func_error_string(errno))
errstrreason = _ffi.string(_lib.ERR_reason_error_string(errno))
return ""
What I'm doing now is checking the xml if it's allready in proper xml format, or if it has to be decoded at first, after that I remove the signature and form the xml tree, so I can do the xml stuff I need to do:
if filePath.lower().endswith('p7m'):
logger.infoLog(f"Try open file: {filePath}")
with open(filePath, 'rb') as f:
txt = f.read()
# no opening tag to find --> no xml --> decode the file, save it, and get the text
if not re.findall(b'<',txt):
image_64_decode = base64.decodebytes(txt)
image_result = open(path + 'decoded.xml', 'wb') # create a writable image and write the decoding result
image_result.write(image_64_decode)
image_result.close()
txt = open(path + 'decoded.xml', 'rb').read()
# try to parse the string
try:
logger.infoLog("Try parsing the first time")
txt = removeSignature(txt)
ET.fromstring(txt)

I had a similar problem, some chars in file were not decoded correctly.
It was caused by a BOM file type.
You can try to use utf-8-sig encoding to read the file, like this:
with open(path, encoding='utf-8-sig') as f:
...

The easiest system to use is openssl:
C:\OpenSSL-Win64\bin\openssl.exe smime -verify -noverify -in **your.xml.p7m** -inform DER -out **your.xml**

XML Parsing with Decode in Python

The XML File I'm trying to read starts with b':
b'<?xml version="1.0" encoding="UTF-8" ?><root><property_id type="dict"><n53987 type="int">54522</n53987><n65731 type="int">66266</n65731><n44322 type="int">44857</n44322><n11633 type="int">12148</n11633><n28192 type="int">28727</n28192><n69053 type="int">69588</n69053><n26529 type="int">27064</n26529><n4844 type="int">4865</n4844><n7625 type="int">7646</n7625><n54697 type="int">55232</n54697><n6210 type="int">6231</n6210><n26710 type="int">27245</n26710><n57915 type="int">58450</n57915
import xml.etree.ElementTree as etree
tree = etree.decode("UTF-8").parse("./property.xml")
How can I decode this file? And read the dict type afterwards?

so you can try this, but this returns an Element Instance
import ast
import xml.etree.ElementTree as etree
tree = None
with open("property.xml", "r") as xml_file:
f = xml_file.read()
# convert string representation of bytes back to bytes
raw_xml_bytes= ast.literal_eval(f)
# read XML from raw bytes
tree = etree.fromstring(raw_xml_bytes)
Another way is to read the file and convert it fully to a string file and then reread it again, this returns an ElementTree instance. You can achieve this using the following:
tree = None
with open("property.xml", "r") as xml_file:
f = xml_file.read()
# convert string representation of bytes back to bytes
raw_xml_bytes= ast.literal_eval(f)
# save the converted string version of the XML file
with open('output.xml', 'w') as file_obj:
file_obj.write(raw_xml_bytes.decode())
# read saved XML file
with open('output.xml', 'r') as xml_file:
tree = etree.parse(f)

Opening and reading an xml file will return data of type bytes, which has a .decode() method (cf. https://docs.python.org/3/library/stdtypes.html#bytes.decode). You can do the following, using the appropriate encoding name:
my_xml_text = xml_file.read().decode('utf-8')

How to read simple xml file with lxml with iso-8859-1 encoding using child.text function

I have xml files with follwing simple structure:
<?xml version="1.0" encoding="iso-8859-1"?>
<DICTIONARY>
<Tag1>
Übung1 Übersetzung1
Übung2 Übersetzung2
Übung3 Übersetzung3
Übung4 Übersetzung4
Übung5 Übersetzung5
</Tag1>
<Tag2>
Übung6 Übersetzung6
Übung7 Übersetzung7
Übung8 Übersetzung8
Übung9 Übersetzung9
Übung10 Übersetzung10
</Tag2>
</DICTIONARY>
I wanted to read these files with lxml because of its simplicity. I used child.text to read the text parts, but the encoding seems not be passed to the output string. See code and ouput below.
I already used codecs to read the file with iso-8859-1, but it didn't change anything.
from lxml import etree
import codecs
def read_xml():
taglist=[]
new_dicts=[]
with codecs.open("A:/test/test.txt", 'r',
encoding='iso-8859-1') as xmlfile:
try:
tree=etree.parse(xmlfile)
loaded=True
print ("XML-encoding: ",tree.docinfo.encoding)
except:
loaded=False
print ("""No dictionary loaded or xml structure is missing! Please try again!""")
if loaded:
root = tree.getroot()
for child in root:
new_dict={}
tagname=child.tag
taglist.append(tagname)
print ("Loading dictionary for tag: ",
tagname)
allstrings= child.text
allstrings=allstrings.split("\n")
for line in allstrings:
if line!=" " and line!="":
line=line.split("\t")
if line[0]!="" and line[1]!="":
enc_line0=line[0]
enc_line1=line[1]
new_dict.update({enc_line0:enc_line1})
new_dicts.append(new_dict)
return taglist, new_dicts
print (read_xml())
Output:
XML-encoding: iso-8859-1
Loading dictionary for tag: Tag1
Loading dictionary for tag: Tag2
(['Tag1', 'Tag2'], [{'Ã\x9cbung1': 'Ã\x9cbersetzung1', 'Ã\x9cbung2': 'Ã\x9cbersetzung2', 'Ã\x9cbung3': 'Ã\x9cbersetzung3', 'Ã\x9cbung4': 'Ã\x9cbersetzung4', 'Ã\x9cbung5': 'Ã\x9cbersetzung5'}, {'Ã\x9cbung6': 'Ã\x9cbersetzung6', 'Ã\x9cbung7': 'Ã\x9cbersetzung7', 'Ã\x9cbung8': 'Ã\x9cbersetzung8', 'Ã\x9cbung9': 'Ã\x9cbersetzung9', 'Ã\x9cbung10': 'Ã\x9cbersetzung10'}])
Whereas, I expected to get an ouput in the same way as with command print ("Übung"), for example. What did I wrong?

lxml works with binary files. Try to change
with codecs.open("A:/test/test.txt", 'r',
encoding='iso-8859-1') as xmlfile:
to
with codecs.open("A:/test/test.txt", 'rb',
encoding='iso-8859-1') as xmlfile:

Ok, I did not find a proper solution, but by converting all stuff in UTF-8, I had no problems with further steps - such as comparing words with umlaut from dictionary and other strings.

Parse multiple xml files in Python

I am stuck with a problem here. So I want to parse multiple xml files with the same structure within it. I was already able to get all the locations for each file and save them into three different lists, since there are three different types of xml structures. Now I want to create three functions (for each list), which is looping through the lists and parse the information I need. Somehow I am not able to do it. Anybody here who could give me a hint how to do it?
import os
import glob
import xml.etree.ElementTree as ET
import fnmatch
import re
import sys
#### Get the location of each XML file and save them into a list ####
all_xml_list =[]
def locate(pattern,root=os.curdir):
for path, dirs, files in os.walk(os.path.abspath(root)):
for filename in fnmatch.filter(files,pattern):
yield os.path.join(path,filename)
for files in locate('*.xml',r'C:\Users\Lars\Documents\XML-Files'):
all_xml_list.append(files)
#### Create lists by GameDay Events ####
xml_GameDay_Player = [x for x in all_xml_list if 'Player' in x]
xml_GameDay_Team = [x for x in all_xml_list if 'Team' in x]
xml_GameDay_Match = [x for x in all_xml_list if 'Match' in x]
The XML file looks like this:
<sports-content xmlns:imp="url">
<sports-metadata date-time="20160912T000000+0200" doc-id="sports_event_" publisher="somepublisher" language="en_EN" document-class="player-statistics">
<sports-title>player-statistics-165483</sports-title>
</sports-metadata>
<sports-event>
<event-metadata id="E_165483" event-key="165483" event-status="post-event" start-date-time="20160827T183000+0200" start-weekday="saturday" heat-number="1" site-attendance="52183" />
<team>
<team-metadata id="O_17" team-key="17">
<name full="TeamName" nickname="NicknameoftheTeam" imp:dfl-3-letter-code="NOT" official-3-letter-code="" />
</team-metadata>
<player>
<player-metadata player-key="33201" uniform-number="1">
<name first="Max" last="Mustermann" full="Max Mustermann" nickname="Mäxchen" imp:extensive="Name" />
</player-metadata>
<player-stats stats-coverage="standard" date-coverage-type="event" minutes-played="90" score="0">
<rating rating-type="standard" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="5.6" imp:rating-value-mid-fielder="5.8" imp:rating-value-forward="5.0" />
<rating rating-type="grade" rating-value="2.2" />
<rating rating-type="index" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="3.7" imp:rating-value-mid-fielder="2.5" imp:rating-value-forward="1.2" />
<rating rating-type="bemeister" rating-value="16.04086" />
<player-stats-soccer imp:duels-won="1" imp:duels-won-ground="0" imp:duels-won-header="1" imp:duels-lost-ground="0" imp:duels-lost-header="0" imp:duels-lost="0" imp:duels-won-percentage="100" imp:passes-completed="28" imp:passes-failed="4" imp:passes-completions-percentage="87.5" imp:passes-failed-percentage="12.5" imp:passes="32" imp:passes-short-total="22" imp:balls-touched="50" imp:tracking-distance="5579.80" imp:tracking-average-speed="3.41" imp:tracking-max-speed="23.49" imp:tracking-sprints="0" imp:tracking-sprints-distance="0.00" imp:tracking-fast-runs="3" imp:tracking-fast-runs-distance="37.08" imp:tracking-offensive-runs="0" imp:tracking-offensive-runs-distance="0.00" dfl-distance="5579.80" dfl-average-speed="3.41" dfl-max-speed="23.49">
<stats-soccer-defensive saves="5" imp:catches-punches-crosses="3" imp:catches-punches-corners="0" goals-against-total="1" imp:penalty-saves="0" imp:clear-cut-chance="0" />
<stats-soccer-offensive shots-total="0" shots-on-goal-total="0" imp:shots-off-post="0" offsides="0" corner-kicks="0" imp:crosses="0" assists-total="0" imp:shot-assists="0" imp:freekicks="3" imp:miss-chance="0" imp:throw-in="0" imp:punt="2" shots-penalty-shot-scored="0" shots-penalty-shot-missed="0" dfl-assists-total="0" imp:shots-total-outside-box="0" imp:shots-total-inside-box="0" imp:shots-foot-inside-box="0" imp:shots-foot-outside-box="0" imp:shots-total-header="0" />
<stats-soccer-foul fouls-commited="0" fouls-suffered="0" imp:yellow-red-cards="0" imp:red-cards="0" imp:yellow-cards="0" penalty-caused="0" />
</player-stats-soccer>
</player-stats>
</player>
</team>
</sports-event>
</sports-content>
I want to extract everything which is within the "player meta tag" and "player-stats coverage" and "player stats soccer" tag.

Improving on #Gnudiff's answer, here is a more resilient approach:
import os
from glob import glob
from lxml import etree
xml_GameDay = {
'Player': [],
'Team': [],
'Match': [],
}
# sort all files into the right buckets
for filename in glob(r'C:\Users\Lars\Documents\XML-Files\*.xml'):
for key in xml_GameDay.keys():
if key in os.path.basename(filename):
xml_GameDay[key].append(filename)
break
def select_first(context, path):
result = context.xpath(path)
if len(result):
return result[0]
return None
# extract data from Player files
for filename in xml_GameDay['Player']:
tree = etree.parse(filename)
for player in tree.xpath('.//player'):
player_data = {
'key': select_first(player, './player-metadata/#player-key'),
'lastname': select_first(player, './player-metadata/name/#last'),
'firstname': select_first(player, './player-metadata/name/#first'),
'nickname': select_first(player, './player-metadata/name/#nickname'),
}
print(player_data)
# ...
XML files can come in a variety of byte encodings and are prefixed by the XML declaration, which declares the encoding of the rest of the file.
<?xml version="1.0" encoding="UTF-8"?>
UTF-8 is a common encoding for XML files (it also is the default), but in reality it can be anything. It's impossible to predict and it's very bad practice to hard-code your program to expect a certain encoding.
XML parsers are designed to deal with this peculiarity in a transparent way, so you don't really have to worry about it, unless you do it wrong.
This is a good example of doing it wrong:
# BAD CODE, DO NOT USE
def file_get_contents(filename):
with open(filename) as f:
return f.read()
tree = etree.XML(file_get_contents('some_filename.xml'))
What happens here is this:
Python opens filename as a text file f
f.read() returns a string
etree.XML() parses that string and creates a DOM object tree
Doesn't sound so wrong, does it? But if the XML is like this:
<?xml version="1.0" encoding="UTF-8"?>
<Player nickname="Mäxchen">...</Player>
then the DOM you will end up with will be:
Player
#nickname="MÃ¤xchen"
You have just destroyed the data. And unless the XML contained an "extended" character like ä, you would not even have noticed that this approach is borked. This can easily slip into production unnoticed.
There is exactly one correct way of opening an XML file (and it's also simpler than the code above): Give the file name to the parser.
tree = etree.parse('some_filename.xml')
This way the parser can figure out the file encoding before it reads the data and you don't have to care about those details.

This won't be a complete solution for your particular case, because this is a bit of task to do and also I don't have keyboard, working from tablet.
In general, you can do it several ways, depending on whether you really need all data or extract specific subset, and whether you know all the possible structures in advance.
For example, one way:
from lxml import etree
Playerdata=[]
for F in xml_Gameday_Player:
tree=etree.XML(file_get_contents(F))
for player in tree.xpath('.//player'):
row=[]
row['player']=player.xpath('./player-metadata/name/#Last/text()')
for plrdata in player.xpath('.//player-stats'):
#do stuff with player data
Playerdata+=row
This is adapted from my existing script, however, it is more tailored to extracting only a specific subset of xml. If you need all data, it would probably be better to use some xml tree walker.
file_get_contents is a small helper function :
def file_get_contents(filename):
with open(filename) as f:
return f.read()
Xpath is a powerful language for finding nodes within xml.
Note that depending on Xpath you use, the result may be either an xml node as in "for player in..." statement, or a string, as in "row['player']=" statement.

you an use xml element tree library. first install it by pip install lxml. then follow the below code structure:
import xml.etree.ElementTree as ET
import os
my_dir = "your_directory"
for fn in os.listdir(my_dir):
tree = ET.parse(os.path.join(my_dir,fn))
root = tree.getroot()
btf = root.find('tag_name')
btf.text = new_value #modify the value of the tag to new_value, whatever you want to put
tree.write(os.path.join(my_dir,fn))
if you still need detail explaination, go through this link
https://www.datacamp.com/community/tutorials/python-xml-elementtree

Custom filetype in Python 3

How to start creating my own filetype in Python ? I have a design in mind but how to pack my data into a file with a specific format ?
For example I would like my fileformat to be a mix of an archive ( like other format such as zip, apk, jar, etc etc, they are basically all archives ) with some room for packed files, plus a section of the file containing settings and serialized data that will not be accessed by an archive-manager application.
My requirement for this is about doing all this with the default modules for Cpython, without external modules.
I know that this can be long to explain and do, but I can't see how to start this in Python 3.x with Cpython.

Try this:
from zipfile import ZipFile
import json
data = json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])
with ZipFile('foo.filetype', 'w') as myzip:
myzip.writestr('digest.json', data)
The file is now a zip archive with a json file (thats easy to read in again in many lannguages) for data you can add files to the archive with myzip write or writestr. You can read data back with:
with ZipFile('foo.filetype', 'r') as myzip:
json_data_read = myzip.read('digest.json')
newdata = json.loads(json_data_read)
Edit: you can append arbitrary data to the file with:
f = open('foo.filetype', 'a')
f.write(data)
f.close()
this works for winrar but python can no longer process the zipfile.

Use this:
import base64
import gzip
import ast
def save(data):
data = "[{}]".format(data).encode()
data = base64.b64encode(data)
return gzip.compress(data)
def load(data):
data = gzip.decompress(data)
data = base64.b64decode(data)
return ast.literal_eval(data.decode())[0]
How to use this with file:
open(filename, "wb").write(save(data)) # save data
data = load(open(filename, "rb").read()) # load data
This might look like this is able to be open with archive program
but it cannot because it is base64 encoded and they have to decode it to access it.
Also you can store any type of variable in it!
example:
open(filename, "wb").write(save({"foo": "bar"})) # dict
open(filename, "wb").write(save("foo bar")) # string
open(filename, "wb").write(save(b"foo bar")) # bytes
# there's more you can store!

This may not be appropriate for your question but I think this may help you.
I have a similar problem faced... but end up with some thing like creating a zip file and then renamed the zip file format to my custom file format... But it can be opened with the winRar.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

xml.parsers.expat.ExpatError: not well-formed (invalid token) - python-3.x

In my case the file was being saved with a Byte Order Mark as is the default with notepad++ I resaved the file without the BOM to plain utf8.

I had the same problem and I solved just specifying the encoding to the open function. In this case it would be something like: import xmltodict with open('fileTEST.xml', encoding='utf8') as fd: xmltodict.parse(fd.read())

In my case, the issue was with the first 3 characters. So removing them worked: import xmltodict from xml.parsers.expat import ExpatError with open('your_data.xml') as f: data = f.read() try: doc = xmltodict.parse(data) except ExpatError: doc = xmltodict.parse(data[3:])

xmltodict seems to not be able to parse <?xml version="1.0" encoding="utf-8"?> If you remove this line, it works.

Related

Python how to extract the xml part from xml.p7m file

XML Parsing with Decode in Python

How to read simple xml file with lxml with iso-8859-1 encoding using child.text function

Parse multiple xml files in Python

Custom filetype in Python 3

Categories

Resources