Processing XML in Python

Processing XML in Python - python-3.x

I'm looking how to grab all sub elements in a XML file, but something is going wrong.
import xml.etree.ElementTree as ET
tree = ET.parse('C:\\Users\\f6792150\\Documents\profile.xml')
root = tree.getroot()
for child in root:
print(child.tag, child.attrib)
Here is what I got:
INFO {}
INFO {}
INFO {}
INFO {}
INFO {}
INFO {}
I was expecting to get all childs inside the INFO tag, e.g. (TICKER, NAME, ADDRESS, PHONE, etc) but it comes empty. Below is the XML file i'm using:
<?xml version="1.0"?>
<collection shelf = 'profile'>
<INFO>
<TICKER>AAPL</TICKER>
<NAME> Apple Inc.</NAME>
<ADDRESS>1 Infinite Loop;Cupertino, CA 95014;United State</ADDRESS>
<PHONE>408-996-1010</PHONE>
<WEBSITE>http://www.apple.com</WEBSITE>
<SECTOR>Technology</SECTOR>
<INDUSTRY>Consumer Electronics</INDUSTRY>
<FULL_TIME>100,000</FULL_TIME>
<BUS_SUMM>Apple</BUS_SUMM>
<SOURCE>https://finance.yahoo.com/quote/AAPL/profile?p=AAPL</SOURCE>
</INFO>
<INFO>
<TICKER>T</TICKER>
<NAME> AT and T Inc.</NAME>
<ADDRESS>208 South Akard Street;Dallas, TX 75202;United States</ADDRESS>
<PHONE>210-821-4105</PHONE>
<WEBSITE>http://www.att.com</WEBSITE>
<SECTOR>Communication Services</SECTOR>
<INDUSTRY> Telecom Services</INDUSTRY>
<FULL_TIME>254,000</FULL_TIME>
<BUS_SUMM>at and t</BUS_SUMM>
<SOURCE>https://finance.yahoo.com/quote/T/profile?p=T</SOURCE>
</INFO>
</collection>
Cheers!

Try something like:
for child in root.findall('./INFO//'):
print(child.tag,child.text)
Output:
TICKER AAPL
NAME Apple Inc.
ADDRESS 1 Infinite Loop;Cupertino, CA 95014;United State
PHONE 408-996-1010
etc.

Related

How to parse XML namespaces in Python 3 and Beautiful Soup 4?

I am trying to parse XML with BS4 in Python 3.
For some reason, I am not able to parse namespaces. I tried to look for answers in this question, but it doesn't work for me and I don't get any error message either.
Why does the first part work, but the second does not?
import requests
from bs4 import BeautifulSoup
input = """
<?xml version="1.0" encoding="utf-8"?>
<wb:countries page="1" pages="6" per_page="50" total="299" xmlns:wb="http://www.worldbank.org">
<wb:country id="ABW">
<wb:iso2Code>AW</wb:iso2Code>
<wb:name>Aruba</wb:name>
<wb:region id="LCN" iso2code="ZJ">Latin America & Caribbean </wb:region>
<wb:adminregion id="" iso2code="" />
<wb:incomeLevel id="HIC" iso2code="XD">High income</wb:incomeLevel>
<wb:lendingType id="LNX" iso2code="XX">Not classified</wb:lendingType>
<wb:capitalCity>Oranjestad</wb:capitalCity>
<wb:longitude>-70.0167</wb:longitude>
<wb:latitude>12.5167</wb:latitude>
</wb:country>
<wb:country id="AFE">
<wb:iso2Code>ZH</wb:iso2Code>
<wb:name>Africa Eastern and Southern</wb:name>
<wb:region id="NA" iso2code="NA">Aggregates</wb:region>
<wb:adminregion id="" iso2code="" />
<wb:incomeLevel id="NA" iso2code="NA">Aggregates</wb:incomeLevel>
<wb:lendingType id="" iso2code="">Aggregates</wb:lendingType>
<wb:capitalCity />
<wb:longitude />
<wb:latitude />
</wb:country>
</wb:countries>
<item>
<title>Some string</title>
<pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
<guid isPermaLink="false">4574785</guid>
<link>https://somesite.com</link>
<itunes:subtitle>A subtitle</itunes:subtitle>
<enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
<itunes:image href="https://somesite.com/img.jpg"/>
<itunes:duration>7845</itunes:duration>
<itunes:explicit>no</itunes:explicit>
<itunes:episodeType>Full</itunes:episodeType>
</item>
"""
soup = BeautifulSoup(input, 'xml')
# Working
for x in soup.find_all('wb:country'):
print(x.find('wb:name').text)
# Not working
for x in soup.find_all('item'):
print(x.find('itunes:subtitle').text)

It looks like non conform XML were you have two documents mixed togehter - A namespace is expected in strict mode of XML parser if it is defined - Use lxml instead to get your expected result in this wild mix:
soup = BeautifulSoup(xml_string, 'lxml')
# Working
for x in soup.find_all('wb:country'):
print(x.find('wb:name').text)
# also working
for x in soup.find_all('item'):
print(x.find('itunes:subtitle').text)
Note: Avoid using python reserved terms (keywords), this could have unwanted effects on the results of your code.
If you have second document separat use:
for x in soup.find_all('item'):
print(x.find('subtitle').text)
Example
from bs4 import BeautifulSoup
xml_string = """
<?xml version="1.0" encoding="utf-8"?>
<item>
<title>Some string</title>
<pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
<guid isPermaLink="false">4574785</guid>
<link>https://somesite.com</link>
<itunes:subtitle>A subtitle</itunes:subtitle>
<enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
<itunes:image href="https://somesite.com/img.jpg"/>
<itunes:duration>7845</itunes:duration>
<itunes:explicit>no</itunes:explicit>
<itunes:episodeType>Full</itunes:episodeType>
</item>
"""
soup = BeautifulSoup(input, 'xml')
# working
for x in soup.find_all('item'):
print(x.find('subtitle').text)
Else you have to define a namespace for your item and can still use XML parser:
<?xml version="1.0" encoding="utf-8"?>
<item xmlns:itunes="http://www.w3.org/TR/html4/">
<title>Some string</title>
<pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
<guid isPermaLink="false">4574785</guid>
<link>https://somesite.com</link>
<itunes:subtitle>A subtitle</itunes:subtitle>
...
When a namespace is defined for an element, all child elements with
the same prefix are associated with the same namespace.

python xml write with using namespace

Are those two roots the same??
Output xml got changed about the order in the root.
If they are different each other, how could I fix it??
#Python 3.7
import xml.etree.ElementTree as ET
ET.register_namespace('xsi', "http://www.w3.org/2001/test")
ET.register_namespace('', "http://www.test.com/test/test/test")
tree = ET.parse('test.xml')
tree.write("test1.xml", encoding='utf-8', xml_declaration=True)
#input XML root
<root xmlns:xsi="http://www.w3.org/2001/test" schemaVersion="2.8" xmlns="http://www.test.com/test/test/test" labelVersion="1" xsi:schemaLocation="http://www.test.com/test/test/test ..\Schema\CLIFSchema.xsd" name="test.xml">
#output XML root
<root xmlns="http://www.test.com/test/test/test" xmlns:xsi="http://www.w3.org/2001/test" labelVersion="1" name="test.xml" schemaVersion="2.8" xsi:schemaLocation="http://www.test.com/test/test/test ..\Schema\CLIFSchema.xsd">

How do I read the text of specific child nodes in ElementTree?

I'm processing XML files with ElementTree that have about 5000 of these "asset" nodes per file
<asset id="83">
<name/>
<tag>0</tag>
<vin>3AKJGLBG6GSGZ6917</vin>
<fleet>131283</fleet>
<type id="0">Standard</type>
<subtype/>
<exsid/>
<mileage>0</mileage>
<location>B106</location>
<mileoffset>0</mileoffset>
<enginehouroffset>0</enginehouroffset>
<radioaddress/>
<mfg/>
<inservice>04 Apr 2017</inservice>
<inspdate/>
<status>1</status>
<opstatus timestamp="1491335031">unknown</opstatus>
<gps>567T646576</gps>
<homeloi/>
</asset>
I need
the value of the id attribute on the asset node
the text of the vin node
the text of the gps node
How can I read the text of the 'vin' and 'gps' child nodes directly without having to iterate over all of the child nodes?
for asset_xml in root.findall("./assetlist/asset"):
print(asset_xml.attrib['id'])
for asset_xml_children in asset_xml:
if (asset_xml_children.tag == 'vin'):
print(str(asset_xml_children.text))
if (asset_xml_children.tag == 'gps'):
print(str(asset_xml_children.text))

You can execute XPath relative to each asset element to get vin and gps directly without looping :
for asset_xml in root.findall("./assetlist/asset"):
print(asset_xml.attrib['id'])
vin = asset_xml.find("vin")
print(str(vin.text))
gps = asset_xml.find("gps")
print(str(gps.text))

Parsing XML attribute with namespace python3

I have looked at the other question over Parsing XML with namespace in Python via 'ElementTree' and reviewed the xml.etree.ElementTree documentation. The issue I'm having is admittedly similar so feel free to tag this as duplicate, but I can't figure it out.
The line of code I'm having issues with is
instance_alink = root.find('{http://www.w3.org/2005/Atom}link')
My code is as follows:
import xml.etree.cElementTree as ET
tree = ET.parse('../../external_data/rss.xml')
root = tree.getroot()
instance_title = root.find('channel/title').text
instance_link = root.find('channel/link').text
instance_alink = root.find('{http://www.w3.org/2005/Atom}link')
instance_description = root.find('channel/description').text
instance_language = root.find('channel/language').text
instance_pubDate = root.find('channel/pubDate').text
instance_lastBuildDate = root.find('channel/lastBuildDate').text
The XML file:
<?xml version="1.0" encoding="windows-1252"?>
<rss version="2.0">
<channel>
<title>Filings containing financial statements tagged using the US GAAP or IFRS taxonomies.</title>
<link>http://www.example.com</link>
<atom:link href="http://www.example.com" rel="self" type="application/rss+xml" xmlns:atom="http://www.w3.org/2005/Atom"/>
<description>This is a list of up to 200 of the latest filings containing financial statements tagged using the US GAAP or IFRS taxonomies, updated every 10 minutes.</description>
<language>en-us</language>
<pubDate>Mon, 20 Nov 2017 20:20:45 EST</pubDate>
<lastBuildDate>Mon, 20 Nov 2017 20:20:45 EST</lastBuildDate>
....
The attributes I'm trying to retrieve are in line 6; so 'href', 'type', etc.
<atom:link href="http://www.example.com" rel="self" type="application/rss+xml" xmlns:atom="http://www.w3.org/2005/Atom"/>
Obviously, I've tried
instance_alink = root.find('{http://www.w3.org/2005/Atom}link').attrib
but that doesn't work cause it's type None. My thought is that it's looking for children but there are none. I can grab the attributes in the other lines in XML but not these for some reason. I've also played with ElementTree and lxml (but lxml won't load properly on Windows for whatever reason)
Any help is greatly appreciated cause the documentation seems sparse.

I was able to solve with
alink = root.find('channel/{http://www.w3.org/2005/Atom}link').attrib
the issue is that I was looking for the tag {http://www.w3.org/2005/Atom}link at the same level of <channel>, which, of course, didn't exist.

How to get All thread ids and names of a process

I wrote a program using c# that list all running process in window, i want to list all running process in window, and in each process, i want to list all running thread (both name and id). i can't find any function on Window Api to list thread name, how can i do it ?
Example: plz look at this picture:
lh4.googleusercontent.com/HwP6dpts5uRPJIElH7DgUd3x95aQKO36tynkfsaDMBbM=w607-h553-no
in the image, i want to list
FireFox ID: 123
Google Chorme ID 456
...
Explorer ID 789
Documents ID 654
Temp ID 231
...
Thankyou !

You can use the Systems.Diagnostic namespace and then use:
Process[] processlist = Process.GetProcesses();
foreach(Process theprocess in processlist){
Console.WriteLine(“Process: {0} ID: {1}”, theprocess.ProcessName, theprocess.Id);
}
Source
More info

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Processing XML in Python - python-3.x

Try something like: for child in root.findall('./INFO//'): print(child.tag,child.text) Output: TICKER AAPL NAME Apple Inc. ADDRESS 1 Infinite Loop;Cupertino, CA 95014;United State PHONE 408-996-1010 etc.

Related

How to parse XML namespaces in Python 3 and Beautiful Soup 4?

python xml write with using namespace

How do I read the text of specific child nodes in ElementTree?

Parsing XML attribute with namespace python3

How to get All thread ids and names of a process

Categories

Resources