Working with an xml file that contains excess whitespace and empty columns that I need to delete. So far, I've been able to delete specific node objects and/or elements of a node. I've been told previously that with xml files, you can't necessarily pinpoint or target whitespace in the file, but rather, you must replace it. How would I go about that?
Listed below is my code to remove either an entire node or a specific element within a node object. For example sake, let's assume we are using the following document:
https://docs.python.org/3/library/xml.etree.elementtree.html#modifying-an-xml-file
I'm using elementtree, not panda or dom.
# To remove an entire node and all of its elements
# for country in root.findall('country'):
# using root.findall() to avoid removal during traversal
# description = country.find('description').text
# if description == " Yurr ":
# root.remove(country)
# To remove a specific element or node within
# a node
for country in root.findall('country'):
description_node = country.find('description')
# This will remove the specific node
# 'description' in the .xml file.
if description_node.text == 'Liechtenstein has a lot of flowers.':
country.remove(description_node)
tree.write('SampleData.xml')
In XSLT 3.0,
<xsl:transform version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output method="xml" indent="no"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="year"/>
</xsl:transform>
will strip all indentation whitespace and remove elements named year. Is that what you're after?
Related
I have an XML file that I'm trying to read with PowerShell. However when I read it, the output of some of the XML objects have the following characters in them: ​
I simply downloaded an XML file I needed from a third-party, which opens in Excel. Then I grab the columns I need and paste them into a new Excel Workbook. Then I map the fields with an XML Schema and then export it as an XML file, which I then use for scripting.
In the Excel spreadsheet my data looks clean, but then when I export it and run the PS script, these strange characters appear in the output. The characters even appear in the actual XML file after exporting. What am I doing wrong?
I tried using -Encoding UTF8, but I'm relatively new to PowerShell and am not sure how to appropriately apply it to my script. Appreciate any help!
PowerShell
$xmlpath = 'Path\To\The\File.xml'
[xml]$xmldata = (Get-Content $xmlpath)
$xmldata.applications.application.name
Example of Output
​ABC_DEF_GHI​.com​​
​JKL_MNO_PQRS​.com​
TUV_WXY_Z.com
AB_CD_EF_GH​.com
This is a prime example of why you shouldn't use the idiom [xml]$xmldata = (Get-Content $xmlpath) - as convenient as it is.[1] The problem is indeed one of character encoding: your file is UTF-8-encoded, but Windows PowerShell's Get-Content cmdlet interprets it as ANSI-encoded in the absence of a BOM - this answer explains the encoding part in detail.Thanks, choroba.
Instead, to ensure that the XML file's character encoding is interpreted correctly, use the following:
# Note: If you know that $xmlPath contains a *full*, native path,
# you don't need the Convert-Path call.
($xmlData = [xml]::new()).Load((Convert-Path -LiteralPath $xmlPath))
This delegates interpretation of the character encoding to the System.Xml.XmlDocument.Load .NET API method, which not only assumes the proper default for XML (UTF-8), but also respects any explicit encoding specification as part of the XML declaration, if present (e.g., <?xml version="1.0" encoding="iso-8859-1"?>)
See also:
the bottom section of this answer for background information.
GitHub proposal #14505, which proposes introducing a New-Xml cmdlet that robustly parses XML files.
[1] If you happen to know the encoding of the input file ahead of time, you can get away with using Get-Content's -Encoding parameter in your original approach ([xml]$xmldata = (Get-Content -Encoding utf8 $xmlpath), but the .Load()-based approach is much more robust.
I have a set of 100K XML-ish (more on that later) legacy files with a consistent structure - an Archive wrapper with multiple Date and Data pair records.
I need to extract the individual records and write them to individual text files, but am having trouble parsing the data due to illegal characters and random CR/space/tab leading and trailing data.
About the XML Files
The files are inherited from a retired system and can't be regenerated. Each file is pretty small (less then 5 MB).
There is one Date record for every Data record:
vendor-1-records.xml
<Archive>
<Date>10 Jan 2019</Date>
<Data>Vendor 1 Record 1</Data>
<Date>12 Jan 2019</Date>
<Data>Vendor 1 Record 2</Data>
(etc)
</Archive>
vendor-2-records.xml
<Archive>
<Date>22 September 2019</Date>
<Data>Vendor 2 Record 1</Data>
<Date>24 September 2019</Date>
<Data>Vendor 2 Record 2</Data>
(etc)
</Archive>
...
vendor-100000-records.xml
<Archive>
<Date>12 April 2019</Date>
<Data>Vendor 100000 Record 1</Data>
<Date>24 October 2019</Date>
<Data>Vendor 100000 Record 2</Data>
(etc)
</Archive>
I would like to extract each Data record out and use the Date entry to define a unique file name, then write the contents of the Data record to that file as so
filename: vendor-1-record-1-2019-1Jan-10.txt contains
file contents: Vendor 1 record 1
(no tags, just the record terminated by CR)
filename: vendor-1-record-2-2019-1Jan-12.txt contains
file contents: Vendor 1 record 2
filename: vendor-2-record-1-2019-9Sep-22.txt contains
file contents: Vendor 2 record 1
filename: vendor-2-record-2-2019-9Sep-24.txt contains
file contents: Vendor 2 record 2
Issue 1 : illegal characters in XML Data records
One issue is that the elements contain multiple characters that XML libraries like Etree/etc terminate on including control characters, formatting characters and various Alt+XXX type characters.
I've searched online and found all manner of workaround and regex and search and replace scripts but the only thing that seems to work in Python is lxml's etree with recover=True.
However, that doesn't even always work because some of the files are apparently not UTF-8, so I get the error:
lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Issue 2 - Data records have random amounts of leading and following CRs and spaces
For the files I can parse with lxml.etree, the actual Data records are also wrapped in CRs and random spaces:
<Data>
(random numbers of CR + spaces and sometimes tabs)
*content<CR>*
(random numbers of CR + spaces and sometimes tabs)
</Data>
and therefore when I run
parser = etree.XMLParser(recover=True)
tree = etree.parse('vendor-1-records.xml', parser=parser)
tags_needed = tree.iter('Data')
for it in tags_needed:
print (it.tag,it.attrib)
I get a collection of empty Data tags (one for each data record in the file) like
Data {}
Data {}
Questions
Is there a more efficient language/module than Python's lxml for ignoring the illegal characters? As I said, I've dug through a number of cookbook blog posts, SE articles, etc for pre-processing the XML and nothing seems to really work - there's always one more control character/etc that hangs the parser.
SE suggested a post about cleaning XML which references an old Atlassian tool ( Stripping Invalid XML characters in Java). I did some basic tests and it seems like it might work but open to other suggestions.
I have not used regex with Python much - any suggestions on how to handle cleaning the leading/trailing CR/space/tab randomness in the Data tags? The actual record string I want in that Data tag also has a CR at the end and may contain tabs as well so I can't just search and replace. Maybe there is a regex way to pull that but my regex-fu is pretty weak.
For my issues 1 and 2, I kind of solved my own problem:
Issue 1 (parsing and invalid characters)
I ran the entire set of files through the Atlassian jar referenced in (Stripping Invalid XML characters in Java) with a batch script:
for %%f in (*.xml) do (
java -jar atlassian-xml-cleaner-0.1.jar %%f > clean\%%~f
)
This utility standardized all of the XML files and made them parseable by lxml.
Issue 2 (CR, spaces, tabs inside the Data element)
This configuration for lxml stripped all whitespace and handled the invalid character issue
from lxml import etree
parser = etree.XMLParser(encoding = 'utf-8',recover=True,remove_blank_text=True)
tree = etree.parse(filepath, parser=parser)
With these two steps I'm now able to start extracting records and writing them to individual files:
# for each date, finding the next item gives me the Data element and I can strip the tab/CR/whitespace:
for item in tree.findall('Date'):
dt = parse_datestamp(item.text.strip())
content = item.getnext().text.strip()
I have been trying to make use of this module for some time now. I have many lists of dictionaries, that I want to convert into xml format. However, I want each list to essentially have its own 'table'. However When I try doing something along the lines of:
xml = dicttoxml.dictoxml(myList, root = False,
custom_root = "MyName",
attr_type = False)
I get every dict displayed as an <item> type. Shouldn't this produce what the module's owner refers to as an "xml snippet" that also is identified by the custom_root name?
Essentially I want each list to have its own identifier but not be created as 'root'. Basically where the following would have each item number associated to a certain list. Either encapsulating the whole list or each dict in the list would be suitable, I believe.
<root>
<item1>
#dict info
</item1>
<item2>
#dict info
</item2>
</root>
I fixed my problem by using just the custom_root variable in my call and leaving root = True. Then, I stripped the leading
b'<?xml version="1.0" encoding="UTF-8" ?>'
by calling
xml.partition(b'<?xml version="1.0" encoding="UTF-8" ?>')[2]
From then on, I created a file with <root> </root> tags and had the xml i created appended in between these tags.
I have a xml like shown below
<?xml version="1.0" encoding="UTF-8"?>
<schools>
<city>Marshall</city>
<state>Maryland</state>
<highschool>
<schoolname>Marshalls</schoolname>
<department id="1">
<deptCode seq="1">D1</deptCode>
<deptName seq="2">Chemistry</deptName>
<deptHead seq="3">Henry Carl</deptHead>
<deptRank seq="4">L</deptRank>
</department>
<department id="2">
..
..
..
</highschool>
</schools>
In XSL i am copying the contents from department based on deptCode using
<xsl:copy-of select="*">
This produces result with all the attributes in the element tags.
Is it possible to ignore the attributes while using xsl:copy-of?
The desired result is like shown below
<deptCode>D1</deptCode>
<deptName>Chemistry</deptName>
<deptHead>Henry Carl</deptHead>
<deptRank>L</deptRank>
xsl:valueOf is working as required but i am trying to know if it
can be done with in xsl:copy-of? As a note, in my requirement, there are nearly 5 or 6 attributes for each element. Can someone please help? Thanks in Advance..
regards
Udayakiran
xsl:valueOf is working as required but i am trying to know if it can
be done with in xsl:copy-of?
No. xsl:copy-of is a package deal, you cannot pick and choose. To avoid repetitive coding, use a template matching department/*.
I am not sure this is even possible with just using Movable Type tags but, how do I display random number with in certain range?
For example I have 10 images named 1~10 and every time I rebuild I want to display a random image from that range.
I use MT5.
Thank you in advance!
You can try my version of the MTCollate plugin with random filter. Original documentation is here: http://www.nonplus.net/software/mt/MTCollate.htm - difference is that it adds a sort="~" or "random" filter, but you'll probably be fine using the MTShuffleList block.
I think if you want to show one image and images count is ten, maybe you can show this cord.
<MTSetVarBlock name="imageID"><MTDate format="%S"></MTSetVarBlock>
<MTSetVarBlock name="imageID"><mt:GetVar name="imageID" op="div" value="6" sprintf="%d"></MTSetVarBlock>
<MTSetVar name="imageID" op="++">
src="/images/hoge<mt:GetVar name='imageID'>.jpg"
You can actually do this with PHP if you're so inclined. Movable Type supports the ability to publish to PHP and you can just put the content you want to be randomized inside of a PHP block. All you need to do is change the published archive file type to "php" in the blog settings. Here is the MTML sample:
<?php
$images = array();
<mt:Asset id="1">
$images[] = '<mt:AssetURL/>';
</mt:Asset>
<mt:Asset id="2">
$images[] = '<mt:AssetURL/>';
</mt:Asset>
<mt:Asset id="3">
$images[] = '<mt:AssetURL/>';
</mt:Asset>
$selected_asset = array_rand($images);
?>
Just repeat the Asset tag for each of the specific assets you want. That will generate ten operations to push each image asset's URL into the array. Alternatively, if you want to expose the last ten, you'd just to <mt:Assets lastn="10">