Removing XML Elements Based On Attribute Value Using Python Element Tree

Removing XML Elements Based On Attribute Value Using Python Element Tree - python-3.x

I've been working with this much of the day but haven't been able to find a solution. I have a fairly large XML file that I need to strip of some data. The 'fields' are annotated using attributes by Id (field name) and Num (unique number for the field name).
<?xml version="1.0" encoding="UTF-8"?>
<Data>
<Item Id="Type" Num="30">3</Item>
<Item Id="Version" Num="50">180</Item>
<Owner>
<Item Id="IdNumber" Num="20">00000000</Item>
<Item Id="race1" Num="160">01</Item>
<Item Id="race2" Num="161">88</Item>
<Item Id="race3" Num="162">88</Item>
<Dog>
<Item Id="Breed" Num="77">Mutt</Item>
<Item Id="Weight" Num="88">88</Item>
</Dog>
<Dog>
<Item Id="Breed" Num="77">Retriever</Item>
<Item Id="Weight" Num="88">77</Item>
</Dog>
</Owner>
<Owner>
<Item Id="IdNumber" Num="20">00033000</Item>
<Item Id="race1" Num="160">03</Item>
<Item Id="race2" Num="161">88</Item>
<Item Id="race3" Num="162">88</Item>
<Dog>
<Item Id="Breed" Num="77">Poodle</Item>
<Item Id="Weight" Num="88">21</Item>
</Dog>
</Owner>
</Data>
Here's the pretty simple python code that I assumed would do the trick, but it's stripping out the data as expected.
#!/usr/bin/env python3
import xml.etree.ElementTree as ET
# list of "Nums" to drop from each Owner and Dog
drops = ['160', '161', '162', '88']
# read in XML file
with open('dogowners.xml') as xmlin:
tree = ET.parse(xmlin)
root = tree.getroot()
for x in root:
for y in x:
# checking to make sure it's an Owner
if y.attrib:
# if the value for attribute Num is in the list of drops then remove it
if y.attrib["Num"] in drops:
x.remove(y)
# finally output new tree
tree.write('output.xml')
output.xml
The issue I'm running into is it's not removing all the drops/Nums listed. In the case with this small XML, it's only doing the first and third value on the Owner level, which is consistent with my large file because it appears to be only removing every other one.
<Data>
<Item Id="Type" Num="30">3</Item>
<Item Id="Version" Num="50">180</Item>
<Owner>
<Item Id="IdNumber" Num="20">00000000</Item>
<Item Id="race2" Num="161">88</Item>
<Dog>
<Item Id="Breed" Num="77">Mutt</Item>
<Item Id="Weight" Num="88">88</Item>
</Dog>
<Dog>
<Item Id="Breed" Num="77">Retriever</Item>
<Item Id="Weight" Num="88">77</Item>
</Dog>
</Owner>
<Owner>
<Item Id="IdNumber" Num="20">00033000</Item>
<Item Id="race2" Num="161">88</Item>
<Dog>
<Item Id="Breed" Num="77">Poodle</Item>
<Item Id="Weight" Num="88">21</Item>
</Dog>
</Owner>
</Data>
I feel like I might be missing something kind of obvious, but XML parsing is not my forte and I've been fighting with this for a couple of hours. Any help is greatly appreciated.

This seems like a good candidate for the identiy transform pattern. The following will copy the xml document, but will exclude the Item elements that match the empty template at the end of the xsl string.
owner-dog.py
#!/usr/bin/env python3
from lxml import etree
# list of "Nums" to drop from each Owner and Dog
drops = ('160', '161', '162', '88')
# we turn it into an xsl attribute pattern:
# #Num = '160' or #Num = '161' or #Num = '162' or #Num = '88'
attr_vals = list(map(lambda n: f'#Num = \'{n}\'', drops))
attr_expr = ' or '.join(attr_vals)
xsl = etree.XML('''<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent='yes'/>
<xsl:strip-space elements="*" />
<!-- copy all nodes ... -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- ... except for any elements matching the following template -->
<xsl:template match="//Item[ {attr_vals} ]" />
</xsl:stylesheet>'''.format(attr_vals=attr_expr))
transform = etree.XSLT(xsl)
with open('owner-dog.xml') as xml:
print(transform(etree.parse(xml)))
Output
<?xml version="1.0"?>
<Data>
<Item Id="Type" Num="30">3</Item>
<Item Id="Version" Num="50">180</Item>
<Owner>
<Item Id="IdNumber" Num="20">00000000</Item>
<Dog>
<Item Id="Breed" Num="77">Mutt</Item>
</Dog>
<Dog>
<Item Id="Breed" Num="77">Retriever</Item>
</Dog>
</Owner>
<Owner>
<Item Id="IdNumber" Num="20">00033000</Item>
<Dog>
<Item Id="Breed" Num="77">Poodle</Item>
</Dog>
</Owner>
</Data>
Comparing the original xml with the ouput
diff <(xmllint --format owner-dog.xml) <(./owner-dog.py)
1c1
< <?xml version="1.0" encoding="UTF-8"?>
---
> <?xml version="1.0"?>
7,9d6
< <Item Id="race1" Num="160">01</Item>
< <Item Id="race2" Num="161">88</Item>
< <Item Id="race3" Num="162">88</Item>
12d8
< <Item Id="Weight" Num="88">88</Item>
16d11
< <Item Id="Weight" Num="88">77</Item>
21,23d15
< <Item Id="race1" Num="160">03</Item>
< <Item Id="race2" Num="161">88</Item>
< <Item Id="race3" Num="162">88</Item>
26d17
< <Item Id="Weight" Num="88">21</Item>
29a21
>

Related

Removing element attributes from XML document using Linux tools

I do have a file that contains lines similar to the following:
<Item Name="INV_LIST" Justification="End" LowestAllowedValue="" DistanceBetweenRecords="0" Width="45" MaximumLength="12" SynchronizedItemName="INVOICE_AMT" PromptDisplayStyle="First Record"/>
<Item Name="INVOICE_AMT_LIST" Justification="End" LowestAllowedValue="" DistanceBetweenRecords="0" Width="48" MaximumLength="22" SynchronizedItemName="" PromptDisplayStyle="First Record"/>
<Item Name="INV_LIST2" Justification="End" LowestAllowedValue="" DistanceBetweenRecords="0" Width="233" MaximumLength="12" SynchronizedItemName="INVOICE_AMT2" PromptDisplayStyle="First Record"/>
I want to run a Linux command like sed or awk, to remove the attribute MaximumLength and its value (it does not matter what it contains between the quotes) whenever there is a line that contains a SynchronizedItemName with a value. If the line contains SynchronizedItemName="", the line will remain untouched.
I want to end with the following:
<Item Name="INV_LIST" Justification="End" LowestAllowedValue="" DistanceBetweenRecords="0" Width="45" SynchronizedItemName="INVOICE_AMT" PromptDisplayStyle="First Record"/>
<Item Name="INVOICE_AMT_LIST" Justification="End" LowestAllowedValue="" DistanceBetweenRecords="0" Width="48" MaximumLength="22" SynchronizedItemName="" PromptDisplayStyle="First Record"/>
<Item Name="INV_LIST2" Justification="End" LowestAllowedValue="" DistanceBetweenRecords="0" Width="233" SynchronizedItemName="INVOICE_AMT2" PromptDisplayStyle="First Record"/>

You can try with this awk script:
{
where = match($0, "SynchronizedItemName=\"\"")
if (where != 0) print
else{
gsub(/MaximumLength=\"[0-9]*\"/, ""); print
}
}
Given your input, I get output as:
<Item Name="INV_LIST" Justification="End" LowestAllowedValue="" DistanceBetweenRecords="0" Width="45" SynchronizedItemName="INVOICE_AMT" PromptDisplayStyle="First Record"/>
<Item Name="INVOICE_AMT_LIST" Justification="End" LowestAllowedValue="" DistanceBetweenRecords="0" Width="48" MaximumLength="22" SynchronizedItemName="" PromptDisplayStyle="First Record"/>
<Item Name="INV_LIST2" Justification="End" LowestAllowedValue="" DistanceBetweenRecords="0" Width="233" SynchronizedItemName="INVOICE_AMT2" PromptDisplayStyle="First Record"/>

If you have xmlstarlet available, you could use the ed command...
NOTE: The "-L" global option modifies the file in-place. Remove this if you do not want to modify the original input file.
xmlstarlet ed -L -d '//Item[normalize-space(#SynchronizedItemName)]/#MaximumLength' input.xml

Using Pankaj code above, put it me the right track.
His code worked with the small set of lines in my example and a few others I added, but on the larger file, it replaced all the elements with MaximumLength.
However, my solution is based on his idea, so thanks Pankaj again.
Here is what I ended up with:
cat tempo | awk '{ ret = match($0, "SynchronizedItemName=\"[0-9A-Za-z_]+\"")
if (ret > 0) {
gsub(/MaximumLength=\"[0-9]*\"/, ""); print $0
}
else {
print $0
}
}' > tmpo_file && mv tmpo_file tempo
The last bit (the redirection), just prints the file to a temporary one and then replaces the original.

Make Searchview width to full width

I want the length of searchview to occupy full width of action bar when clicked but instead it shows the others items as well. This is xml file of the view
<menu xmlns:android="http://schemas.android.com/apk/res/android"
xmlns:app="http://schemas.android.com/apk/res-auto">
<item
android:id="#+id/action_settings"
android:title="Settings"
app:showAsAction="never"/>
<item
android:id="#+id/about"
android:title="About"
app:showAsAction="never"
/>
<item
android:id="#+id/edit"
android:icon="#drawable/ic_baseline_sort_24"
app:showAsAction="always"
android:title="Edit"/>
<item android:id="#+id/search"
android:title="search_title"
android:icon="#drawable/ic_search"
app:showAsAction="always|collapseActionView"
app:actionViewClass="androidx.appcompat.widget.SearchView" />
</menu>
In the java file I am trying the following
androidx.appcompat.widget.SearchView searchView = (androidx.appcompat.widget.SearchView)
item.getActionView();
searchView.setMaxWidth(Integer.MAX_VALUE);
It is working as regular search bar with other items shown as well along the bar.

parsing the text in python of the given format

I want to parse a file which looks like this:
<item> <one-of> <item> deepa vats </item> <item> deepa <ruleref uri="#Dg-e_n_t41"/> </item> </one-of> <tag> out = "u-dvats"; </tag> </item>
<item> <one-of> <item> maitha al owais </item> <item> doctor maitha </item> <item> maitha <ruleref uri="#Dg-clinical_nutrition24"/> </item> </one-of> <tag> out = "u-mal_owais"; </tag> </item>
The result should be username:out for example:
deepa vats : u-dvats and maitha al owais : u-mal_owais
to extract the username i tried
print ([j for i,j in re.findall(r"(<item>)\s*(.*?)\s*(?!\1)(?:</item>)",line)])
if len(list1) != 0:
print(list1[0].split("<item>")[-1])

You can parse the xml with objectify from lxml.
To parse an XML string you could use objectify.fromstring(). Then you can use dot notation or square bracket notation to navigate through the element and use the text property to get the text inside the element. Like so:
item = objectify.fromstring(item_str)
item_text = item.itemchild['anotherchild'].otherchild.text
From there you can manipulate the string and format it.
In this case I can see that you want the text inside item >> one-of >> item and the text inside item >> tag. In order to get it we could do something like this:
>>> from lxml import objectify
>>> item_str = '<item> <one-of> <item> maitha al owais </item> <item> doctor maitha </item> <item> maitha <ruleref uri="#Dg-clinical_nutrition24"/> </item> </one-of> <tag> out = "u-mal_owais"; </tag> </item>'
>>> item = objectify.fromstring(item_str)
>>> item_text = item['one-of'].item.text
>>> tag_text = item['tag'].text
>>> item_text
' maitha al owais '
>>> tag_text
' out = "u-mal_owais"; '
Since python doesn't allow hyphens in variable names and since tag is a property of the objectify object you have to use bracket notation instead of dot notation in this case.

I suggest using BeautifulSoup:
import bs4
soup = bs4.BeautifulSoup(your_text, "lxml")
' '.join(x.strip() for x in soup.strings if x.strip())
#'deepa vats deepa out = "u-dvats"; maitha al owais doctor maitha maitha out = "u-mal_owais";'

xslt transformation in vba

I am trying to generate an xml from a set of xpaths. This is related to Generating XML from xpath using xslt
I found a XSLT that may solve my problem in this answer to another question
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:my="my:my">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:key name="kNSFor" match="namespace" use="#of"/>
<xsl:variable name="vStylesheet" select="document('')"/>
<xsl:variable name="vPop" as="element()*">
<item path="/create/article/#type">richtext</item>
<item path="/create/article/#lang">en-us</item>
<item path="/create/article[1]/id">1</item>
<item path="/create/article[1]/description">bar</item>
<item path="/create/article[1]/name[1]">foo</item>
<item path="/create/article[1]/price[1]/amount">00.00</item>
<item path="/create/article[1]/price[1]/currency">USD</item>
<item path="/create/article[1]/price[2]/amount">11.11</item>
<item path="/create/article[1]/price[2]/currency">AUD</item>
<item path="/create/article[2]/id">2</item>
<item path="/create/article[2]/description">some name</item>
<item path="/create/article[2]/name[1]">some description</item>
<item path="/create/article[2]/price[1]/amount">00.01</item>
<item path="/create/article[2]/price[1]/currency">USD</item>
<namespace of="create" prefix="ns1:"
url="http://predic8.com/wsdl/material/ArticleService/1/"/>
<namespace of="article" prefix="ns1:"
url="xmlns:ns1='http://predic8.com/material/1/"/>
<namespace of="#lang" prefix="xml:"
url="http://www.w3.org/XML/1998/namespace"/>
<namespace of="price" prefix="ns1:"
url="xmlns:ns1='http://predic8.com/material/1/"/>
<namespace of="id" prefix="ns1:"
url="xmlns:ns1='http://predic8.com/material/1/"/>
</xsl:variable>
<xsl:template match="/">
<xsl:sequence select="my:subTree($vPop/#path/concat(.,'/',string(..)))"/>
</xsl:template>
<xsl:function name="my:subTree" as="node()*">
<xsl:param name="pPaths" as="xs:string*"/>
<xsl:for-each-group select="$pPaths" group-adjacent=
"substring-before(substring-after(concat(., '/'), '/'), '/')">
<xsl:if test="current-grouping-key()">
<xsl:choose>
<xsl:when test=
"substring-after(current-group()[1], current-grouping-key())">
<xsl:variable name="vLocal-name" select=
"substring-before(concat(current-grouping-key(), '['), '[')"/>
<xsl:variable name="vNamespace"
select="key('kNSFor', $vLocal-name, $vStylesheet)"/>
<xsl:choose>
<xsl:when test="starts-with($vLocal-name, '#')">
<xsl:attribute name=
"{$vNamespace/#prefix}{substring($vLocal-name,2)}"
namespace="{$vNamespace/#url}">
<xsl:value-of select=
"substring(
substring-after(current-group(), current-grouping-key()),
2
)"/>
</xsl:attribute>
</xsl:when>
<xsl:otherwise>
<xsl:element name="{$vNamespace/#prefix}{$vLocal-name}"
namespace="{$vNamespace/#url}">
<xsl:sequence select=
"my:subTree(for $s in current-group()
return
concat('/',substring-after(substring($s, 2),'/'))
)
"/>
</xsl:element>
</xsl:otherwise>
</xsl:choose>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="current-grouping-key()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:if>
</xsl:for-each-group>
</xsl:function>
</xsl:stylesheet>
I am trying to transform this in VBA using code from http://www.oreillynet.com/xml/blog/2005/04/transforming_xml_in_microsoft.html
I got an error
Private Sub Transform(sourceFile, stylesheetFile, resultFile)
Dim source As New MSXML2.DOMDocument30
Dim stylesheet As New MSXML2.DOMDocument30
Dim result As New MSXML2.DOMDocument30
' Load data.
source.async = False
source.Load sourceFile
' Load style sheet.
stylesheet.async = False
stylesheet.Load stylesheetFile
If (source.parseError.ErrorCode <> 0) Then
MsgBox ("Error loading source document: " & source.parseError.reason)
Else
If (stylesheet.parseError.ErrorCode <> 0) Then
MsgBox ("Error loading stylesheet document: " & stylesheet.parseError.reason)
Else
' Do the transform.
source.transformNodeToObject stylesheet, result
result.Save resultFile
End If
End If
End Sub
I also tried to use transformNode instead to transformNodeToObject. I got
object doesn't support this property or method
Private Sub Transform(sourceFile, stylesheetFile, resultFile)
Dim source As New MSXML2.DOMDocument60
Dim stylesheet As New MSXML2.DOMDocument60
Dim result As New MSXML2.DOMDocument60
' Load data
source.async = False
source.Load sourceFile
' Load style sheet.
stylesheet.async = False
stylesheet.Load stylesheetFile
If (source.parseError.ErrorCode <> 0) Then
MsgBox ("Error loading source document: " & source.parseError.reason)
Else
If (stylesheet.parseError.ErrorCode <> 0) Then
MsgBox ("Error loading stylesheet document: " & stylesheet.parseError.reason)
Else
' Do the transform.
source.transformNode (stylesheet)
'result.Save resultFile
End If
End If
End Sub

Groovy Node vs Node List

I am having difficulty adding a node deeper in an xml structure. I am missing something between and node and nodeList. Any help would be greatly appreciated.
def xml='''<Root id="example" version="1" archived="false">
<Item name="one" value="test"/>
<Item name="two" value="test2"/>
<Item name="three" value="test3"/>
<AppSettings Name="foo" Id="foo1">
<roles>foo</roles>
</AppSettings>
<AppSettings Name="bar" Id="bar1">
<Item name="blue" value=""/>
<Item name="green" value=""/>
<Item name="yellow" value=""/>
<Roles>
<Role id="A"/>
<Role id="B"/>
<Role id="C"/>
</Roles>
</AppSettings>
</Root>'''
root = new XmlParser().parseText(xml)
def appSettings = root.'AppSettings'.find{it.#Name == "bar"}.'Roles'
appSettings.appendNode('Role', [id: 'D'])
def writer = new StringWriter()
def printer = new XmlNodePrinter(new PrintWriter(writer))
printer.preserveWhitespace = true
printer.print(root)
String result = writer.toString()
println result
Error
groovy.lang.MissingMethodException: No signature of method: groovy.util.NodeList.appendNode() is applicable for argument types: (java.lang.String, java.util.LinkedHashMap) values: [Role, [id:D]]

This line here:
def appSettings = root.'AppSettings'.find{it.#Name == "bar"}.'Roles'
is returning you a NodeList (containing a single node), so you want to call appendNode on the contents of this list, not on the list itself.
This can be done either by:
appSettings*.appendNode('Role', [id: 'D'])
Which will call appendNode on every element of the list, or by:
appSettings[0]?.appendNode('Role', [id: 'D'])
Which will call appendNode on the first element of the list (if there is a first element thanks to the null-safe operator ?).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Removing XML Elements Based On Attribute Value Using Python Element Tree - python-3.x

Related

Removing element attributes from XML document using Linux tools

Make Searchview width to full width

parsing the text in python of the given format

xslt transformation in vba

Groovy Node vs Node List

Categories

Resources