I am stuck with a problem here. So I want to parse multiple xml files with the same structure within it. I was already able to get all the locations for each file and save them into three different lists, since there are three different types of xml structures. Now I want to create three functions (for each list), which is looping through the lists and parse the information I need. Somehow I am not able to do it. Anybody here who could give me a hint how to do it?
import os
import glob
import xml.etree.ElementTree as ET
import fnmatch
import re
import sys
#### Get the location of each XML file and save them into a list ####
all_xml_list =[]
def locate(pattern,root=os.curdir):
for path, dirs, files in os.walk(os.path.abspath(root)):
for filename in fnmatch.filter(files,pattern):
yield os.path.join(path,filename)
for files in locate('*.xml',r'C:\Users\Lars\Documents\XML-Files'):
all_xml_list.append(files)
#### Create lists by GameDay Events ####
xml_GameDay_Player = [x for x in all_xml_list if 'Player' in x]
xml_GameDay_Team = [x for x in all_xml_list if 'Team' in x]
xml_GameDay_Match = [x for x in all_xml_list if 'Match' in x]
The XML file looks like this:
<sports-content xmlns:imp="url">
<sports-metadata date-time="20160912T000000+0200" doc-id="sports_event_" publisher="somepublisher" language="en_EN" document-class="player-statistics">
<sports-title>player-statistics-165483</sports-title>
</sports-metadata>
<sports-event>
<event-metadata id="E_165483" event-key="165483" event-status="post-event" start-date-time="20160827T183000+0200" start-weekday="saturday" heat-number="1" site-attendance="52183" />
<team>
<team-metadata id="O_17" team-key="17">
<name full="TeamName" nickname="NicknameoftheTeam" imp:dfl-3-letter-code="NOT" official-3-letter-code="" />
</team-metadata>
<player>
<player-metadata player-key="33201" uniform-number="1">
<name first="Max" last="Mustermann" full="Max Mustermann" nickname="Mäxchen" imp:extensive="Name" />
</player-metadata>
<player-stats stats-coverage="standard" date-coverage-type="event" minutes-played="90" score="0">
<rating rating-type="standard" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="5.6" imp:rating-value-mid-fielder="5.8" imp:rating-value-forward="5.0" />
<rating rating-type="grade" rating-value="2.2" />
<rating rating-type="index" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="3.7" imp:rating-value-mid-fielder="2.5" imp:rating-value-forward="1.2" />
<rating rating-type="bemeister" rating-value="16.04086" />
<player-stats-soccer imp:duels-won="1" imp:duels-won-ground="0" imp:duels-won-header="1" imp:duels-lost-ground="0" imp:duels-lost-header="0" imp:duels-lost="0" imp:duels-won-percentage="100" imp:passes-completed="28" imp:passes-failed="4" imp:passes-completions-percentage="87.5" imp:passes-failed-percentage="12.5" imp:passes="32" imp:passes-short-total="22" imp:balls-touched="50" imp:tracking-distance="5579.80" imp:tracking-average-speed="3.41" imp:tracking-max-speed="23.49" imp:tracking-sprints="0" imp:tracking-sprints-distance="0.00" imp:tracking-fast-runs="3" imp:tracking-fast-runs-distance="37.08" imp:tracking-offensive-runs="0" imp:tracking-offensive-runs-distance="0.00" dfl-distance="5579.80" dfl-average-speed="3.41" dfl-max-speed="23.49">
<stats-soccer-defensive saves="5" imp:catches-punches-crosses="3" imp:catches-punches-corners="0" goals-against-total="1" imp:penalty-saves="0" imp:clear-cut-chance="0" />
<stats-soccer-offensive shots-total="0" shots-on-goal-total="0" imp:shots-off-post="0" offsides="0" corner-kicks="0" imp:crosses="0" assists-total="0" imp:shot-assists="0" imp:freekicks="3" imp:miss-chance="0" imp:throw-in="0" imp:punt="2" shots-penalty-shot-scored="0" shots-penalty-shot-missed="0" dfl-assists-total="0" imp:shots-total-outside-box="0" imp:shots-total-inside-box="0" imp:shots-foot-inside-box="0" imp:shots-foot-outside-box="0" imp:shots-total-header="0" />
<stats-soccer-foul fouls-commited="0" fouls-suffered="0" imp:yellow-red-cards="0" imp:red-cards="0" imp:yellow-cards="0" penalty-caused="0" />
</player-stats-soccer>
</player-stats>
</player>
</team>
</sports-event>
</sports-content>
I want to extract everything which is within the "player meta tag" and "player-stats coverage" and "player stats soccer" tag.
Improving on #Gnudiff's answer, here is a more resilient approach:
import os
from glob import glob
from lxml import etree
xml_GameDay = {
'Player': [],
'Team': [],
'Match': [],
}
# sort all files into the right buckets
for filename in glob(r'C:\Users\Lars\Documents\XML-Files\*.xml'):
for key in xml_GameDay.keys():
if key in os.path.basename(filename):
xml_GameDay[key].append(filename)
break
def select_first(context, path):
result = context.xpath(path)
if len(result):
return result[0]
return None
# extract data from Player files
for filename in xml_GameDay['Player']:
tree = etree.parse(filename)
for player in tree.xpath('.//player'):
player_data = {
'key': select_first(player, './player-metadata/#player-key'),
'lastname': select_first(player, './player-metadata/name/#last'),
'firstname': select_first(player, './player-metadata/name/#first'),
'nickname': select_first(player, './player-metadata/name/#nickname'),
}
print(player_data)
# ...
XML files can come in a variety of byte encodings and are prefixed by the XML declaration, which declares the encoding of the rest of the file.
<?xml version="1.0" encoding="UTF-8"?>
UTF-8 is a common encoding for XML files (it also is the default), but in reality it can be anything. It's impossible to predict and it's very bad practice to hard-code your program to expect a certain encoding.
XML parsers are designed to deal with this peculiarity in a transparent way, so you don't really have to worry about it, unless you do it wrong.
This is a good example of doing it wrong:
# BAD CODE, DO NOT USE
def file_get_contents(filename):
with open(filename) as f:
return f.read()
tree = etree.XML(file_get_contents('some_filename.xml'))
What happens here is this:
Python opens filename as a text file f
f.read() returns a string
etree.XML() parses that string and creates a DOM object tree
Doesn't sound so wrong, does it? But if the XML is like this:
<?xml version="1.0" encoding="UTF-8"?>
<Player nickname="Mäxchen">...</Player>
then the DOM you will end up with will be:
Player
#nickname="Mäxchen"
You have just destroyed the data. And unless the XML contained an "extended" character like ä, you would not even have noticed that this approach is borked. This can easily slip into production unnoticed.
There is exactly one correct way of opening an XML file (and it's also simpler than the code above): Give the file name to the parser.
tree = etree.parse('some_filename.xml')
This way the parser can figure out the file encoding before it reads the data and you don't have to care about those details.
This won't be a complete solution for your particular case, because this is a bit of task to do and also I don't have keyboard, working from tablet.
In general, you can do it several ways, depending on whether you really need all data or extract specific subset, and whether you know all the possible structures in advance.
For example, one way:
from lxml import etree
Playerdata=[]
for F in xml_Gameday_Player:
tree=etree.XML(file_get_contents(F))
for player in tree.xpath('.//player'):
row=[]
row['player']=player.xpath('./player-metadata/name/#Last/text()')
for plrdata in player.xpath('.//player-stats'):
#do stuff with player data
Playerdata+=row
This is adapted from my existing script, however, it is more tailored to extracting only a specific subset of xml. If you need all data, it would probably be better to use some xml tree walker.
file_get_contents is a small helper function :
def file_get_contents(filename):
with open(filename) as f:
return f.read()
Xpath is a powerful language for finding nodes within xml.
Note that depending on Xpath you use, the result may be either an xml node as in "for player in..." statement, or a string, as in "row['player']=" statement.
you an use xml element tree library. first install it by pip install lxml. then follow the below code structure:
import xml.etree.ElementTree as ET
import os
my_dir = "your_directory"
for fn in os.listdir(my_dir):
tree = ET.parse(os.path.join(my_dir,fn))
root = tree.getroot()
btf = root.find('tag_name')
btf.text = new_value #modify the value of the tag to new_value, whatever you want to put
tree.write(os.path.join(my_dir,fn))
if you still need detail explaination, go through this link
https://www.datacamp.com/community/tutorials/python-xml-elementtree
Related
Below is line that i want to extract from file, i.e starting with <XYZ> and ending with </XYZ> but there may be any number of new lines in it
<XYZ>
<beta1>aaaaa</beta1>
<beta>aaaaa</beta>
<beta0>aaaaa</beta0>
<identity>key01_adent</identity>
<beta>aaaaa</beta>
<beta>aaaaa</beta>
<beta>aaaaa</beta>
</XYZ>
f=open('D:\\pyth_project\\policy.xml', 'r')
read_object=f.read()
f.close()
print(re.findall("<XYZ>\n+.*\n</XYZ>",read_object))
You shouldn't use regular expressions for XML-like files. You can use lxml instead.
from lxml import etree
root = etree.parse('D:\\pyth_project\\policy.xml')
xyzs = root.findall('.//xyz') # find all xyz tags recursively.
for xyz in xyzs:
print(etree.tostring(xyz))
See How to find recursively for a tag of XML using LXML? for more information.
As said in other answers, if you are dealng wth XML sintax there are better solutions than simple regex.
But if you really want to use regex, this is how you can do it:
f = open('yourfile', 'r')
read_object = f.read()
f.close()
print(re.findall(r"<XYZ>.*?</XYZ>", read_object, flags=re.DOTALL))
The re.DOTALL flag allows the . special character to match also newlines (by default, it matches all characters except newlines).
The *? is the non-greedy version of *, matching as few characters as possible. So if you have multiple <XYZ>...</XYZ> tags each one will be a separate match.
The assumption here is that you don't have nested <XYZ>...</XYZ> tags. If you have nested tags, better use lxml as in #blueteeth answer.
The following sample shows how to read the key01_adent value where stuff is the imaginary xml document
import xml.etree.ElementTree as ET
input = '''
<stuff>
<XYZ>
<beta1>aaaaa</beta1>
<beta>aaaaa</beta>
<beta0>aaaaa</beta0>
<identity>key01_adent</identity>
<beta>aaaaa</beta>
<beta>aaaaa</beta>
<beta>aaaaa</beta>
</XYZ>
</stuff>'''
stuff = ET.fromstring(input)
lst = stuff.findall('.XYZ')
print('count:', len(lst))
for item in lst:
print('identity = {}'.format(item.find('identity').text))
The item can have any number of items in it, i expect the tags will be unique
You can test the same here and play with it
I can't figure out how to work with YAML files, I have a db.yaml file with this content
beatport_links:
afro-house: "https://www.beatport.com/genre/afro-house/89/top-100"
big-room: "https://www.beatport.com/genre/big-room/79/top-100"
breaks: "https://www.beatport.com/genre/breaks/9/top-100"
My program reads genre name and link to top 100 from this file, then it scraps the web page for song names and adds it to dictionary
def load_yaml_file(self):
with open(self.yaml_file, "r") as file_content:
self.data = yaml.load(file_content)
def get_genres_and_links(self):
for genre, link in self.data.get("beatport_links").items():
self.beatport_links[genre] = link
Now I have a list with contents like this
["Adam_Beyer_-_Rome_Future_(Original_Mix)", "Veerus_-_Wheel_(Original_Mix)"]
I would like my program to update db.yaml file with contents from this list (append to it), so in the end I would like db.yaml to look like this:
beatport_links:
afro-house: "https://www.beatport.com/genre/afro-house/89/top-100"
big-room: "https://www.beatport.com/genre/big-room/79/top-100"
breaks: "https://www.beatport.com/genre/breaks/9/top-100"
downloaded:
Adam_Beyer_-_Rome_Future_(Original_Mix)
Veerus_-Wheel(Original_Mix)
How can I do that?
You don't need your get_genres_and_links you can directly update your self.data
by doing:
self.data['downloaded'] = some_data
The problem is that in your expected output you have as value for the key downloaded you have a multiline plain scalar and not a list. Although you can do some_data = ' '.join(["Adam_Beyer_-_Rome_Future_(Original_Mix)", "Veerus_-_Wheel_(Original_Mix)"]) will get you the string value, it is almost impossible to get the PyYAML to output the plain scalar multiline and non-compact (reading that is trivial. Instead I would look at dumping to a literal block style scalar and joining the list using "\n".join(). The output would then look like:
beatport_links:
afro-house: "https://www.beatport.com/genre/afro-house/89/top-100"
big-room: "https://www.beatport.com/genre/big-room/79/top-100"
breaks: "https://www.beatport.com/genre/breaks/9/top-100"
downloaded: |-
Adam_Beyer_-_Rome_Future_(Original_Mix)
Veerus_-Wheel(Original_Mix)
(you can get rid of the dash after | by appending a newline after the joining the list items).
If your expected output was acceptable looking like:
beatport_links:
afro-house: "https://www.beatport.com/genre/afro-house/89/top-100"
big-room: "https://www.beatport.com/genre/big-room/79/top-100"
breaks: "https://www.beatport.com/genre/breaks/9/top-100"
downloaded:
- Adam_Beyer_-_Rome_Future_(Original_Mix)
- Veerus_-Wheel(Original_Mix)
then things are easier and doing:
self.data['downloaded'] = ["Adam_Beyer_-_Rome_Future_(Original_Mix)", "Veerus_-_Wheel_(Original_Mix)"]
with open('some_file', 'w') as fp:
yaml.safe_dump(self.data, fp)
would be enough.
In any event, if you do this kind of loading, modifying, dumping, then you
should look seriously at ruamel.yaml (disclaimer: I am the author of that package). Not only does it implement the newer YAML 1.2, it also preserves comments, tags, special ids, key-order when doing this kind of round-tripping. It also has built-in support for literal style block scalars. And apart from that its default .load() is safe.
When I use xmltodict to load the xml file below I get an error:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 1
Here is my file:
<?xml version="1.0" encoding="utf-8"?>
<mydocument has="an attribute">
<and>
<many>elements</many>
<many>more elements</many>
</and>
<plus a="complex">
element as well
</plus>
</mydocument>
Source:
import xmltodict
with open('fileTEST.xml') as fd:
xmltodict.parse(fd.read())
I am on Windows 10, using Python 3.6 and xmltodict 0.11.0
If I use ElementTree it works
tree = ET.ElementTree(file='fileTEST.xml')
for elem in tree.iter():
print(elem.tag, elem.attrib)
mydocument {'has': 'an attribute'}
and {}
many {}
many {}
plus {'a': 'complex'}
Note: I might have encountered a new line problem.
Note2: I used Beyond Compare on two different files.
It crashes on the file that is UTF-8 BOM encoded, and works om the UTF-8 file.
UTF-8 BOM is a sequence of bytes (EF BB BF) that allows the reader to identify a file as being encoded in UTF-8.
I think you forgot to define the encoding type.
I suggest that you try to initialize that xml file to a string variable:
import xml.etree.ElementTree as ET
import xmltodict
import json
tree = ET.parse('your_data.xml')
xml_data = tree.getroot()
#here you can change the encoding type to be able to set it to the one you need
xmlstr = ET.tostring(xml_data, encoding='utf-8', method='xml')
data_dict = dict(xmltodict.parse(xmlstr))
In my case the file was being saved with a Byte Order Mark as is the default with notepad++
I resaved the file without the BOM to plain utf8.
Python 3
One Liner
data: dict = xmltodict.parse(ElementTree.tostring(ElementTree.parse(path).getroot()))
Helper for .json and .xml
I wrote a small helper function to load .json and .xml files from a given path.
I thought it might come in handy for some people here:
import json
import xml.etree.ElementTree
def load_json(path: str) -> dict:
if path.endswith(".json"):
print(f"> Loading JSON from '{path}'")
with open(path, mode="r") as open_file:
content = open_file.read()
return json.loads(content)
elif path.endswith(".xml"):
print(f"> Loading XML as JSON from '{path}'")
xml = ElementTree.tostring(ElementTree.parse(path).getroot())
return xmltodict.parse(xml, attr_prefix="#", cdata_key="#text", dict_constructor=dict)
print(f"> Loading failed for '{path}'")
return {}
Notes
if you want to get rid of the # and #text markers in the json output, use the parameters attr_prefix="" and cdata_key=""
normally xmltodict.parse() returns an OrderedDict but you can change that with the parameter dict_constructor=dict
Usage
path = "my_data.xml"
data = load_json(path)
print(json.dumps(data, indent=2))
# OUTPUT
#
# > Loading XML as JSON from 'my_data.xml'
# {
# "mydocument": {
# "#has": "an attribute",
# "and": {
# "many": [
# "elements",
# "more elements"
# ]
# },
# "plus": {
# "#a": "complex",
# "#text": "element as well"
# }
# }
# }
Sources
ElementTree.tostring()
ElementTree.parse()
xmltodict
json.dumps()
I had the same problem and I solved just specifying the encoding to the open function.
In this case it would be something like:
import xmltodict
with open('fileTEST.xml', encoding='utf8') as fd:
xmltodict.parse(fd.read())
In my case, the issue was with the first 3 characters. So removing them worked:
import xmltodict
from xml.parsers.expat import ExpatError
with open('your_data.xml') as f:
data = f.read()
try:
doc = xmltodict.parse(data)
except ExpatError:
doc = xmltodict.parse(data[3:])
xmltodict seems to not be able to parse <?xml version="1.0" encoding="utf-8"?>
If you remove this line, it works.
Not specific to the original post but for those who are also running into the same error at a different line, I was able to fix it by correcting the XML/XHTML error.
In my case, the document I was working with had a text description with an andpercent symbol "&" instead of "&" so to fix my issue, I had to edit the file first before running through the parser.
I want to extract text from word documents that were edited in "Track Changes" mode. I want to extract the inserted text and ignore the deleted text.
Running the below code I saw that paragraphs inserted in "track changes" mode return an empty Paragraph.text
import docx
doc = docx.Document('C:\\test track changes.docx')
for para in doc.paragraphs:
print(para)
print(para.text)
Is there a way to retrieve the text in revisioned inserts (w:ins elements) ?
I'm using python-docx 0.8.6, lxml 3.4.0, python 3.4, Win7
Thanks
I was having the same problem for years (maybe as long as this question existed).
By looking at the code of "etienned" posted by #yiftah and the attributes of Paragraph, I have found a solution to retrieve the text after accepting the changes.
The trick was to get p._p.xml to get the XML of the paragraph and then using "etienned" code on that (i.e retrieving all the <w:t> elements from the XML code, which contains both regular runs and <w:ins> blocks).
Hope it can help the souls lost like I was:
from docx import Document
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
WORD_NAMESPACE = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
TEXT = WORD_NAMESPACE + "t"
def get_accepted_text(p):
"""Return text of a paragraph after accepting all changes"""
xml = p._p.xml
if "w:del" in xml or "w:ins" in xml:
tree = XML(xml)
runs = (node.text for node in tree.getiterator(TEXT) if node.text)
return "".join(runs)
else:
return p.text
doc = Document("Hello.docx")
for p in doc.paragraphs:
print(p.text)
print("---")
print(get_accepted_text(p))
print("=========")
Not directly using python-docx; there's no API support yet for tracked changes/revisions.
It's a pretty tricky job, which you'll discover if you search on the element names, perhaps 'open xml w:ins' for a start, that brings up this document as the first result:
https://msdn.microsoft.com/en-us/library/ee836138(v=office.12).aspx
If I needed to do something like that in a pinch I'd get the body element using:
body = document._body._body
and then use XPath on that to return the elements I wanted, something vaguely like this aircode:
from docx.text.paragraph import Paragraph
inserted_ps = body.xpath('./w:ins//w:p')
for p in inserted_ps:
paragraph = Paragraph(p, None)
print(paragraph.text)
You'll be on your own for figuring out what XPath expression will get you the paragraphs you want.
opc-diag may be a friend in this, allowing you to quickly scan the XML of the .docx package. http://opc-diag.readthedocs.io/en/latest/index.html
the below code from Etienne worked for me, it's working directly with the document's xml (and not using python-docx)
http://etienned.github.io/posts/extract-text-from-word-docx-simply/
I needed a quick solution to make text surrounded by "smart tags" visible to docx's text property, and found that the solution could also be adapted to make some tracked changes visible.
It uses lxml.etree.strip_tags to remove surrounding "smartTag" and "ins" tags, and promote the contents; and lxml.etree.strip_elements to remove the whole "del" elements.
def para2text(p, quiet=False):
if not quiet:
unsafeText = p.text
lxml.etree.strip_tags(p._p, "{*}smartTag")
lxml.etree.strip_elements(p._p, "{*}del")
lxml.etree.strip_tags(p._p, "{*}ins")
safeText = p.text
if not quiet:
if safeText != unsafeText:
print()
print('para2text: unsafe:')
print(unsafeText)
print('para2text: safe:')
print(safeText)
print()
return safeText
docin = docx.Document(filePath)
for para in docin.paragraphs:
text = para2text(para)
Beware that this only works for a subset of "tracked changes", but it might be the basis of a more general solution.
If you want to see the xml for a docx file directly: rename it as .zip, extract the "document.xml", and view it by dropping into chrome or your favourite viewer.
How to start creating my own filetype in Python ? I have a design in mind but how to pack my data into a file with a specific format ?
For example I would like my fileformat to be a mix of an archive ( like other format such as zip, apk, jar, etc etc, they are basically all archives ) with some room for packed files, plus a section of the file containing settings and serialized data that will not be accessed by an archive-manager application.
My requirement for this is about doing all this with the default modules for Cpython, without external modules.
I know that this can be long to explain and do, but I can't see how to start this in Python 3.x with Cpython.
Try this:
from zipfile import ZipFile
import json
data = json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])
with ZipFile('foo.filetype', 'w') as myzip:
myzip.writestr('digest.json', data)
The file is now a zip archive with a json file (thats easy to read in again in many lannguages) for data you can add files to the archive with myzip write or writestr. You can read data back with:
with ZipFile('foo.filetype', 'r') as myzip:
json_data_read = myzip.read('digest.json')
newdata = json.loads(json_data_read)
Edit: you can append arbitrary data to the file with:
f = open('foo.filetype', 'a')
f.write(data)
f.close()
this works for winrar but python can no longer process the zipfile.
Use this:
import base64
import gzip
import ast
def save(data):
data = "[{}]".format(data).encode()
data = base64.b64encode(data)
return gzip.compress(data)
def load(data):
data = gzip.decompress(data)
data = base64.b64decode(data)
return ast.literal_eval(data.decode())[0]
How to use this with file:
open(filename, "wb").write(save(data)) # save data
data = load(open(filename, "rb").read()) # load data
This might look like this is able to be open with archive program
but it cannot because it is base64 encoded and they have to decode it to access it.
Also you can store any type of variable in it!
example:
open(filename, "wb").write(save({"foo": "bar"})) # dict
open(filename, "wb").write(save("foo bar")) # string
open(filename, "wb").write(save(b"foo bar")) # bytes
# there's more you can store!
This may not be appropriate for your question but I think this may help you.
I have a similar problem faced... but end up with some thing like creating a zip file and then renamed the zip file format to my custom file format... But it can be opened with the winRar.