Removing the same element across all the nodes of an XML tree - python-3.x

For example sake, this is the xml file that I'm working with:
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
<description>Liechtenstein has a lot of flowers.</description>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
<description>Singapore has a lot of street markets.</description>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
<description>Panama has a lot of great food.</description>
</country>
</data>
How would I write the code such that I could delete one node element (i.e. year or description) across each of the country nodes. For example, in the following code:
# To remove
# for country in root.findall('country'):
# year = int(country.find('year').text)
# if year > 2010:
# root.remove(country)
# tree.write('sample.xml')
I can remove any country nodes whose attribute of the element year is greater than 2010. But that removes the entire node, not just the year element. I know that I can remove a single element of a node with the following:
# for country in root.findall('country'):
# description_node = country.find('description')
# if description_node.text == "Singapore has a lot of street markets.":
# country.remove(description_node)
# tree.write('sample.xml')
But now I want to create a condition where I delete the description element or the year element or the neighbor element throughout all of the country nodes present.

One option might be the following that uses .findall and .remove:
import xml.etree.ElementTree as ET
file = 'source.xml'
data = ET.parse(file)
for country in data.findall('country'):
for neighbor in country.findall('neighbor'):
country.remove(neighbor)
for year in country.findall('year'):
country.remove(year)
for description in country.findall('description'):
country.remove(description)
ET.dump(data)
Output:
python yourscript.py
<data>
<country name="Liechtenstein">
<rank>1</rank>
<gdppc>141100</gdppc>
</country>
<country name="Singapore">
<rank>4</rank>
<gdppc>59900</gdppc>
</country>
<country name="Panama">
<rank>68</rank>
<gdppc>13600</gdppc>
</country>
</data>

In XSLT 3.0 you can do, for example:
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="3.0">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="year[. > 2000]"/>
</xsl:transform>
The empty template rule causes elements that match the predicate to be removed; the xsl:mode instruction causes everything else to be retained.

Related

Stripping whitespaces from an xml element using ElementTree

I'm having difficulty removing leading and trailing whitespace, even white space between elements that are deemed excessive. For the sake of the example, this is the xml document I'm currently running test cases on:
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
<description>Liechtenstein has a lot of flowers. </description>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
<description>Singapore has a lot of street markets.</description>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
<description> Panama has a lot of great food.</description>
</country>
</data>
Notice how in description for country name = "Liechtenstein" there is excess whitespace at the end of the description or excess white space between neighbor and description in the second country element or excess leading whitespace in description of the third country node.
Every time I run my code:
# Remove whitespace for each element in the tree
for elem in root.iter():
elem.text = elem.text.strip()
elem.tail = elem.tail.strip()
I end up with the following error:
AttributeError: 'NoneType' object has no attribute 'strip'
import xml.etree.ElementTree as ET
file = 'source.xml'
root = ET.parse(file)
for elem in root.iter():
if elem.text is not None:
elem.text = elem.text.strip()
if elem.tail is not None:
elem.tail = elem.tail.strip()
# print XML with stripped out whitespace
ET.dump(root)
# pretty print XML with stripped out whitespace
ET.indent(root, space="\t", level=0)
ET.dump(root)
Output (stripped out whitespace):
<data><country name="Liechtenstein"><rank>1</rank><year>2008</year><gdppc>141100</gdppc><neighbor name="Austria" direction="E" /><neighbor name="Switzerland" direction="W" /><description>Liechtenstein has a lot of flowers.</description></country><country name="Singapore"><rank>4</rank><year>2011</year><gdppc>59900</gdppc><neighbor name="Malaysia" direction="N" /><description>Singapore has a lot of street markets.</description></country><country name="Panama"><rank>68</rank><year>2011</year><gdppc>13600</gdppc><neighbor name="Costa Rica" direction="W" /><neighbor name="Colombia" direction="E" /><description>Panama has a lot of great food.</description></country></data>
Output (pretty-printed with stripped out whitespace):
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E" />
<neighbor name="Switzerland" direction="W" />
<description>Liechtenstein has a lot of flowers.</description>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N" />
<description>Singapore has a lot of street markets.</description>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W" />
<neighbor name="Colombia" direction="E" />
<description>Panama has a lot of great food.</description>
</country>
</data>

placing child element of xml in variable

Hi I'm new to xml never used it before and I'm trying to place two child elements in a variable each. So here's the XML data I'm using:
<?xml version="1.0"?>
<data>
<counties>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
</countries>
<cities>
<city id="1036323110">
<city>Katherine</city>
<country>Australia</country>
<capital>Australia</capital>
<population>1488</population>
</city>
</cities>
</data>
So I'm trying to get a variable that contains each child branch and this is what I've tried so far:
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
country = root.find(".//countries")
city = root.find(".//cities")
Am I right in approaching it in this method? Thank you

WP All Import creates duplicate product attributes

I'm using WP All Import to import new products from an XML, or update them. The import works perfect, but I noticed that some attributes are double. They have been created again, after first import.
Maybe for new items. I checked XML file and it is normal.
I want to import toners and I'm using an attribute for color (Χρώμα in greek) and now i have 2 different attributes named Χρώμα with the same colors. Looking for CYAN color, I found 5 products in the first attribute and 11 products in the second one.
Any idea what I'm doing wrong?
Here is the xml from one product:
<product>
<id>1077</id>
<sku>TON-CLP320BK</sku>
<name>
Συμβατό Toner TON-CLP320BK για Samsung, CLT-K4072S, Black, 1.5K
</name>
<barcode>5202705407213</barcode>
<manufacturer>PREMIUM</manufacturer>
<descr>
<p>Συμβατό Toner TON-CLP320BK για Samsung, CLT-K4072S, Black, 1.5K</p><p></p><p>Συμβατά μοντέλα : </p><p>CLP325<br/>CLP320<br/>CLP320N<br/>CLP325W<br/>CLX3185FN<br/>CLX3185FN<br/>CLX318<br/>CLX318FN<br/>CLX3185FW<br/>CLX3185W</p>
</descr>
<availability>1</availability>
<dim1>40.0</dim1>
<dim2>9.5</dim2>
<dim3>18.5</dim3>
<weight>0.791</weight>
<tax>0.000</tax>
<stock_indicator>20</stock_indicator>
<minimum_quantity_to_order>1</minimum_quantity_to_order>
<RRP>15.00</RRP>
<url>
https://www.data-media.gr/product_det.asp?catid=263&subid=329&prid=1077
</url>
<thumb>
https://www.data-media.gr/photos/TON-CLP320BK.jpg
</thumb>
<image>
https://www.data-media.gr/photos/max/TON-CLP320BK.jpg
</image>
<volume>7030.000</volume>
<courier_weight>1.786</courier_weight>
<in_offer>0</in_offer>
<guarantee>12 μήνες</guarantee>
<group>
<id>4</id>
<name>Εκτυπωτές & Toner-Ink</name>
<category>
<id>263</id>
<name>Toner - Ribbon Μελάνια</name>
<subcategory>
<id>329</id>
<name>Toner</name>
</subcategory>
</category>
</group>
<filters>
<filter>
<name_id>9</name_id>
<name>Για Brand</name>
<value_id>8</value_id>
<value>SAMSUNG</value>
</filter>
<filter></filter>
<filter></filter>
<filter></filter>
</filters>
<price>10.16</price>
<price_without_offer>10.16</price_without_offer>
<retail_percent>20</retail_percent>
<retail_price>15.12</retail_price>
<vat>24</vat>
<pp>0</pp>
</product>

move an element to another element or create a new one if it does not exist using xslt-3

using xslt 3, i need to take all content elements' values, and move them to the title elements (if the title elements already exist in a record, they need to be appended with a separator like -) i now have inputted my real data, since the below solution does not solve the problem when implemented to something like:
example input:
<data>
<RECORD ID="31365">
<no>25099</no>
<seq>0</seq>
<date>2/4/2012</date>
<ver>2/4/2012</ver>
<access>021999</access>
<col>GS</col>
<call>889</call>
<pr>0</pr>
<days>0</days>
<stat>0</stat>
<ch>0</ch>
<title>1 title</title>
<content>1 content</content>
<sj>1956</sj>
</RECORD>
<RECORD ID="31366">
<no>25100</no>
<seq>0</seq>
<date>2/4/2012</date>
<ver>2/4/2012</ver>
<access>022004</access>
<col>GS</col>
<call>8764</call>
<pr>0</pr>
<days>0</days>
<stat>0</stat>
<ch>0</ch>
<sj>1956</sj>
<content>1 title</content>
</RECORD>
</data>
expected output:
<data>
<RECORD ID="31365">
<no>25099</no>
<seq>0</seq>
<date>2/4/2012</date>
<ver>2/4/2012</ver>
<access>021999</access>
<col>GS</col>
<call>889</call>
<pr>0</pr>
<days>0</days>
<stat>0</stat>
<ch>0</ch>
<title>1 title - 1 content</title>
<sj>1956</sj>
</RECORD>
<RECORD ID="31366">
<no>25100</no>
<seq>0</seq>
<date>2/4/2012</date>
<ver>2/4/2012</ver>
<access>022004</access>
<col>ΓΣ</col>
<call>8764</call>
<pr>0</pr>
<days>0</days>
<stat>0</stat>
<ch>0</ch>
<sj>1956</sj>
<title>1 title</title>
</RECORD>
<data>
with my attempt, i did not manage to move the elements, i just got an empty line where the content element existed, so please add the removal of blank lines in the suggested solution.
i believe the removal of blank lines could be fixed with the use of
<xsl:template match="text()"/>
One way to achieve this is the following template. It uses XSLT-3.0 content value templates.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0" expand-text="true">
<xsl:output method="xml" indent="yes" />
<xsl:mode on-no-match="shallow-copy" />
<xsl:strip-space elements="*" /> <!-- Remove space between elements -->
<xsl:template match="RECORD">
<xsl:copy>
<xsl:copy-of select="#*" />
<title>{title[1]}{if (title[1]) then ' - ' else ''}<xsl:value-of select="content" separator=" " /></title>
<xsl:apply-templates select="node() except (title,content)" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
It's output is as desired.
If you want to separate the <content> elements with a -, too, you can simplify the core <title> expression to
<xsl:value-of select="title|content" separator=" - " />
EDIT:
All I changed was replacing chapter with RECORD, and it's working fine with Saxon-HE 9.9.1.4J. The only difference in the output is that the title element is always at the first position, but that shouldn't matter. I also added a directive to remove space between elements.

I would like to find friends child element is present or not in specific parent element like 'Liechtenstein'

import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
ss= 'Liechtenstein'
tag_names = set (t.tag for t in root.findall(".//*[#name=ss]/friends"))
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
<friends>
<frined name="arun" />
</friends>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
I would like to find friends child element is present or not in specific parent element like 'Liechtenstein'
here, tag_names gives empty set. but expecting tag_names=(friends)

Resources