How do I read the text of specific child nodes in ElementTree?

How do I read the text of specific child nodes in ElementTree? - python-3.x

I'm processing XML files with ElementTree that have about 5000 of these "asset" nodes per file
<asset id="83">
<name/>
<tag>0</tag>
<vin>3AKJGLBG6GSGZ6917</vin>
<fleet>131283</fleet>
<type id="0">Standard</type>
<subtype/>
<exsid/>
<mileage>0</mileage>
<location>B106</location>
<mileoffset>0</mileoffset>
<enginehouroffset>0</enginehouroffset>
<radioaddress/>
<mfg/>
<inservice>04 Apr 2017</inservice>
<inspdate/>
<status>1</status>
<opstatus timestamp="1491335031">unknown</opstatus>
<gps>567T646576</gps>
<homeloi/>
</asset>
I need
the value of the id attribute on the asset node
the text of the vin node
the text of the gps node
How can I read the text of the 'vin' and 'gps' child nodes directly without having to iterate over all of the child nodes?
for asset_xml in root.findall("./assetlist/asset"):
print(asset_xml.attrib['id'])
for asset_xml_children in asset_xml:
if (asset_xml_children.tag == 'vin'):
print(str(asset_xml_children.text))
if (asset_xml_children.tag == 'gps'):
print(str(asset_xml_children.text))

You can execute XPath relative to each asset element to get vin and gps directly without looping :
for asset_xml in root.findall("./assetlist/asset"):
print(asset_xml.attrib['id'])
vin = asset_xml.find("vin")
print(str(vin.text))
gps = asset_xml.find("gps")
print(str(gps.text))

Related

NetworkX problem with label and id when reading and writing GML

I have the following example, where I create a graph programmetically, write it to a GML file and read the file into a graph again.
I want to be able to use the graph loaded from file in place of the programmatically created one:
import networkx as nx
g = nx.Graph()
g.add_edge(1,4)
nx.write_gml(g, "test.gml")
gg = nx.read_gml("test.gml", label="label")
print(gg.edges(data=True))
The contents of test.gml is a follows:
graph [
node [
id 0
label "1"
]
node [
id 1
label "4"
]
edge [
source 0
target 1
]
]
Nodes 1 and 4 from the python code are now represented by two nodes with ID 0 and 1 and labels "1" and "4"
After reading the file, I now have to access node 4 as follows:
gg['4']
Instead of
g[4]
for the original graph.
I could of course make sure to cast every node to string before looking up the node, but this is not practical for huge graphs.
An alternative would be to programmatically create (yet another) graph that is identical to g but with integer keys, but this is even more cumbersome.
What should I do?

Try:
nx.read_gml(fpath, destringizer=int)
Ref:
https://networkx.org/documentation/stable/reference/readwrite/generated/networkx.readwrite.gml.read_gml.html

read couple of xml file and saved them as list of list

I have a 112 XML file, each contains a paragraph, like this: (this is one XML sample, we have 112 samples)
<?xml version='1.0' encoding='UTF-8'?>
<arggraph id="micro_b001" topic_id="waste_separation" stance="pro">
<edu id="e1"><![CDATA[Yes, it's annoying and cumbersome to separate your rubbish properly all the time.]]></edu>
<edu id="e2"><![CDATA[Three different bin bags stink away in the kitchen and have to be sorted into different wheelie bins.]]></edu>
<edu id="e3"><![CDATA[But still Germany produces way too much rubbish]]></edu>
<edu id="e4"><![CDATA[and too many resources are lost when what actually should be separated and recycled is burnt.]]></edu>
<edu id="e5"><![CDATA[We Berliners should take the chance and become pioneers in waste separation!]]></edu>
<adu id="a1" type="opp"/>
<adu id="a2" type="opp"/>
<adu id="a3" type="pro"/>
<adu id="a4" type="pro"/>
<adu id="a5" type="pro"/>
<edge id="c6" src="e1" trg="a1" type="seg"/>
<edge id="c7" src="e2" trg="a2" type="seg"/>
<edge id="c8" src="e3" trg="a3" type="seg"/>
<edge id="c9" src="e4" trg="a4" type="seg"/>
<edge id="c10" src="e5" trg="a5" type="seg"/>
<edge id="c1" src="a1" trg="a5" type="reb"/>
<edge id="c2" src="a2" trg="a1" type="sup"/>
<edge id="c3" src="a3" trg="c1" type="und"/>
<edge id="c4" src="a4" trg="c3" type="add"/>
</arggraph>
I want to read each of them in python and gather from each of the text that ends with "edu" ,and then saved them as
list of the list! like this
[[Yes, it's annoying and cumbersome to separate your rubbish properly all the time., Three different bin bags stink away in the kitchen and have to be sorted into different wheelie bins., But still Germany produces way too much rubbish
,and too many resources are lost when what actually should be separated and recycled is burnt , We Berliners should take the chance and become pioneers in waste separation!] , [
next XML content] ,[next, XML content],...
]]
I have tried this way
I have saved them all of them as list in myList
myList = []
myEdgesList=[]
#read the whole text from
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith('.xml'):
with open(os.path.join(root, file), encoding="UTF-8") as content:
tree = ET.parse(content)
myList.append(tree)
then:
ParaList=[]
EduList=[]
for k in myList:
a=k.findall('.//edu')
for l in a:
EduList.append(l.text)
ParaList.append(EduList)
but the result only gives me a flat list of all sentences (576) and not a list of 112 paragraphs
can someone help me?

Assuming that myList is a list of parsed XML documents, moving ParaList.append(EduList) inside the main for loop should fix it for you. You also need to reset the EduList once per document, so also move EduList=[] inside the main loop:
ParaList=[]
for k in myList:
EduList=[]
a=k.findall('.//edu')
for l in a:
EduList.append(l.text)
ParaList.append(EduList)
Now the extracted content of each XML document is appended to ParaList once per document.
A better way to write this code is to use a list comprehension to process the matching lines:
ParaList=[]
for k in myList:
ParaList.append([l.text for l in k.findall('.//edu')])
Or you could even do it in one line using a nested list comprehension:
ParaList = [[l.text for l in k.findall('.//edu')] for k in myList]

Beautiful Soup findAll() doesn't find the first one

I'm working on a coreference-resolution system based on Neural Networks for my Bachelor's Thesis, and i have a problem when i read the corpus.
The corpus is already preproccesed, and i only need to read it to do my stuff. I use Beautiful Soup 4 to read the xml files of each document that contains the data i need.
the files look like this:
<?xml version='1.0' encoding='ISO-8859-1'?>
<!DOCTYPE markables SYSTEM "markables.dtd">
<markables xmlns="www.eml.org/NameSpaces/markable">
<markable id="markable_102" span="word_390" grammatical_role="vc" coref_set="empty" visual="none" rel_type="none" np_form="indefnp" type="" entity="NO" nb="UNK" def="INDEF" sentenceid="19" lemmata="premia" pos="nn" head_pos="word_390" wikipedia="" mmax_level="markable"/>
<markable id="markable_15" span="word_48..word_49" grammatical_role="vc" coref_set="empty" visual="none" rel_type="none" np_form="defnp" type="" entity="NO" nb="SG" def="DEF" sentenceid="3" lemmata="Grozni hegoalde" pos="nnp nn" head_pos="word_48" wikipedia="Grozny" mmax_level="markable"/>
<markable id="markable_101" span="word_389" grammatical_role="sbj" coref_set="set_21" coref_type="named entities" visual="none" rel_type="coreferential" sub_type="exact repetition" np_form="ne_o" type="enamex" entity="LOC" nb="SG" def="DEF" sentenceid="19" lemmata="Mosku" pos="nnp" head_pos="word_389" wikipedia="" mmax_level="markable"/>
...
i need to extract all the spans here, so try to do it with this code (python3):
...
from bs4 import BeautifulSoup
...
file1 = markables+filename+"_markable_level.xml"
xml1 = open(file1) #markable
soup1 = BeautifulSoup(xml1, "html5lib") #markable
...
...
for markable in soup1.findAll('markable'):
try:
span = markable.contents[1]['span']
print(span)
spanA = span.split("..")[0]
spanB = span.split("..")[-1]
...
(I ignored most of the code, as they are 500 lines)
python3 aurreprozesaketaSTM.py
train
--- 28.329787254333496 seconds ---
&&&&&&&&&&&&&&&&&&&&&&&&& egun.06-1-p0002500.2000-06-01.europa
word_48..word_49
word_389
word_385..word_386
word_48..word_52
...
if you conpare the xml file with the output, you can see that word_390 is missing.
I get almost all the data that i need, then preproccess everything, build the system with neural networks, and finally i get scores and all...
But as I loose the first word of each document, my systems accuracy is a bit lower than what should be.
Can anyone help me with this? Any idea where is the problem?

You are parsing XML with html5lib. It is not supported for parsing XML.
lxml’s XML parser ... The only currently supported XML parser
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

Specific string sorting [Python 2.7]

I am fairly new to python, and I was trying to sort this string in a certain way (Taken off a database):
6392079|||| 1.0|03/09/2017|PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP|INSULIN INFUSION PUMP / SENSOR AUGMENTED|MEDTRONIC MINIMED|18000 DEVONSHIRE STREET||NORTHRIDGE|CA|91325||US|91325||MMT-723LNAH|MMT-723LNAH|||0LP|R|01/29/2014|OYC||Y
This is the standard format for these types of strings:
MDR_REPORT_KEY|DEVICE_EVENT_KEY|IMPLANT_FLAG|DATE_REMOVED_FLAG|DEVICE_SEQUENCE_NO|DATE_RECEIVED|BRAND_NAME|GENERIC_NAME|MANUFACTURER_D_NAME|MANUFACTURER_D_ADDRESS_1|MANUFACTURER_D_ADDRESS_2|MANUFACTURER_D_CITY|MANUFACTURER_D_STATE_CODE|MANUFACTURER_D_ZIP_CODE|MANUFACTURER_D_ZIP_CODE_EXT|MANUFACTURER_D_COUNTRY_CODE|MANUFACTURER_D_POSTAL_CODE|EXPIRATION_DATE_OF_DEVICE|MODEL_NUMBER|CATALOG_NUMBER|LOT_NUMBER|OTHER_ID_NUMBER|DEVICE_OPERATOR|DEVICE_AVAILABILITY|DATE_RETURNED_TO_MANUFACTURER|DEVICE_REPORT_PRODUCT_CODE|DEVICE_AGE_TEXT|DEVICE_EVALUATED_BY_MANUFACTUR
Is there any way I can print out this string sorted with the specific datatype next to the value?
For example as an output I would like to have
Report key: 6392079
Device sequence number: 1.0
Date received: 03/09/2017
Brand name: PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP
etc.etc. with the other values. I think I would need to use the "|" as a divider to separate the data, but I'm not sure how to. I also cannot use sorting with the index number, because there are many variations of the string above which are all different lengths.
Also as you can see in the string some of the data such as device_event_key, implant_flag, date_removed_flag, and device_sequence number are absent, but there are still corresponding empty vertical slashes.
Any help would be greatly appreciated, thanks.

#nsortur, you can try the below code to get the output.
I have used the concept of list comprehension, zip() function and split(), join() methods defined on string objects.
You can try to run code online at
http://rextester.com/MBDXB29573 (Code perfectly works with Python2/Python3).
string1 = "6392079|||| 1.0|03/09/2017|PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP|INSULIN INFUSION PUMP / SENSOR AUGMENTED|MEDTRONIC MINIMED|18000 DEVONSHIRE STREET||NORTHRIDGE|CA|91325||US|91325||MMT-723LNAH|MMT-723LNAH|||0LP|R|01/29/2014|OYC||Y"
keys = ["Report key", "Device sequence number","Date received", "Brand name"];
values = [key.strip() for key in string1.split("|") if key.strip()];
output = "\n".join([key + ": " + str(value) for key, value in zip(keys, values)]);
print(output);
Output:
Report key: 6392079
Device sequence number: 1.0
Date received: 03/09/2017
Brand name: PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP

Use zip to merge the two lists into tuple pairs:
data = '6392079|||| 1.0|03/09/2017|PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP|INSULIN INFUSION PUMP / SENSOR AUGMENTED|MEDTRONIC MINIMED|18000 DEVONSHIRE STREET||NORTHRIDGE|CA|91325||US|91325||MMT-723LNAH|MMT-723LNAH|||0LP|R|01/29/2014|OYC||Y'
format = 'MDR_REPORT_KEY|DEVICE_EVENT_KEY|IMPLANT_FLAG|DATE_REMOVED_FLAG|DEVICE_SEQUENCE_NO|DATE_RECEIVED|BRAND_NAME|GENERIC_NAME|MANUFACTURER_D_NAME|MANUFACTURER_D_ADDRESS_1|MANUFACTURER_D_ADDRESS_2|MANUFACTURER_D_CITY|MANUFACTURER_D_STATE_CODE|MANUFACTURER_D_ZIP_CODE|MANUFACTURER_D_ZIP_CODE_EXT|MANUFACTURER_D_COUNTRY_CODE|MANUFACTURER_D_POSTAL_CODE|EXPIRATION_DATE_OF_DEVICE|MODEL_NUMBER|CATALOG_NUMBER|LOT_NUMBER|OTHER_ID_NUMBER|DEVICE_OPERATOR|DEVICE_AVAILABILITY|DATE_RETURNED_TO_MANUFACTURER|DEVICE_REPORT_PRODUCT_CODE|DEVICE_AGE_TEXT|DEVICE_EVALUATED_BY_MANUFACTUR'
for label, value in zip(format.split('|'), data.split('|')):
print("%s: %s" % (label.replace('_', ' ').capitalize(), value))
This outputs:
Mdr report key: 6392079
Device event key:
Implant flag:
Date removed flag:
Device sequence no: 1.0
Date received: 03/09/2017
Brand name: PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP
Generic name: INSULIN INFUSION PUMP / SENSOR AUGMENTED
Manufacturer d name: MEDTRONIC MINIMED
Manufacturer d address 1: 18000 DEVONSHIRE STREET
Manufacturer d address 2:
Manufacturer d city: NORTHRIDGE
Manufacturer d state code: CA
Manufacturer d zip code: 91325
Manufacturer d zip code ext:
Manufacturer d country code: US
Manufacturer d postal code: 91325
Expiration date of device:
Model number: MMT-723LNAH
Catalog number: MMT-723LNAH
Lot number:
Other id number:
Device operator: 0LP
Device availability: R
Date returned to manufacturer: 01/29/2014
Device report product code: OYC
Device age text:
Device evaluated by manufactur: Y

This can be achieved by simple split() method of the str, split('|') would have empty strings for the empty values between two |, and then match it with dict having attribute as key and value as value of dict
query = '6392079|||| 1.0|03/09/2017|PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP|INSULIN INFUSION PUMP / SENSOR AUGMENTED|MEDTRONIC MINIMED|18000 DEVONSHIRE STREET||NORTHRIDGE|CA|91325||US|91325||MMT-723LNAH|MMT-723LNAH|||0LP|R|01/29/2014|OYC||Y'
def get_detail(str_):
key_finder = {'Report Key': 0, 'Device Sequence Number': 4, 'Device Recieved': 5, 'Brand Name': 6}
split_by = str_.split('|')
print('Report Key : {}'.format(split_by[key_finder['Report Key']]))
print('Device Seq Num : {}'.format(split_by[key_finder['Device Sequence Number']]))
print('Device Recieved : {}'.format(split_by[key_finder['Device Recieved']]))
print('Brand Name : {}'.format(split_by[key_finder['Brand Name']]))
>>> get_detail(query)
Report Key : 6392079
Device Seq Num : 1.0
Device Recieved : 03/09/2017
Brand Name : PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP
This works because the splited string will be indexed from 0, so the Report Key will have the value in 0th index of the splitted string and so on for other values. This will be matched with the dict key_finder which has the stored index for each value.

How to get All thread ids and names of a process

I wrote a program using c# that list all running process in window, i want to list all running process in window, and in each process, i want to list all running thread (both name and id). i can't find any function on Window Api to list thread name, how can i do it ?
Example: plz look at this picture:
lh4.googleusercontent.com/HwP6dpts5uRPJIElH7DgUd3x95aQKO36tynkfsaDMBbM=w607-h553-no
in the image, i want to list
FireFox ID: 123
Google Chorme ID 456
...
Explorer ID 789
Documents ID 654
Temp ID 231
...
Thankyou !

You can use the Systems.Diagnostic namespace and then use:
Process[] processlist = Process.GetProcesses();
foreach(Process theprocess in processlist){
Console.WriteLine(“Process: {0} ID: {1}”, theprocess.ProcessName, theprocess.Id);
}
Source
More info

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do I read the text of specific child nodes in ElementTree? - python-3.x

You can execute XPath relative to each asset element to get vin and gps directly without looping : for asset_xml in root.findall("./assetlist/asset"): print(asset_xml.attrib['id']) vin = asset_xml.find("vin") print(str(vin.text)) gps = asset_xml.find("gps") print(str(gps.text))

Related

NetworkX problem with label and id when reading and writing GML

read couple of xml file and saved them as list of list

Beautiful Soup findAll() doesn't find the first one

Specific string sorting [Python 2.7]

How to get All thread ids and names of a process

Categories

Resources