Extract XML with xmllint --xpath

Extract XML with xmllint --xpath - linux

I am having trouble extracting the "EXTRACT_THIS_PLEASE" from a similar XML file using xmllint --xpath. I understand sed and awk should not be used from some Googling. I also see that other XML parsers are usually recommended, but this is the only one I seem to have on my RHEL system. I have tried various things and understand that the issue has to do with white spaces.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<model-response-list xmlns="http://www.website.com/thing/link/linktothing/linklink" total-models="1" throttle="1" error="EndOfResults">
<model-responses>
<model mh="0x12345678">
<attribute id="0x12345">EXTRACT_THIS_PLEASE</attribute>
</model>
</model-responses>
</model-response-list>
EDIT: kjhughes and j_b, you guys are both wizards. Thank you so much. Could I also also extract 0x12345678 from "". I am looking to do this 5000+ times and ultimately have a list of devices in rows or columns like this:
"0x12345678
EXTRACT_THIS_PLEASE
0x99999999
EXTRACT_THIS_PLEASE
0x11111111
NOTHING
0x33333333
EXTRACT_THIS_PLEASE
0x22222222
NOTHING"

This xmllint command line,
xmllint --xpath "//*[#id='0x12345']/text()" file.xml
will select
EXTRACT_THIS_PLEASE
as requested.
See also
Daniel Haley's answer showing how to use XML namespaces in xmllint.

Another option to extract the contents of the <attribute> elemenet:
xmllint --xpath "//*[name()='attribute']/text()" x.xml
Output:
EXTRACT_THIS_PLEASE

Related

Excel: Export to XML - With XML in cells

I'm trying to export a spreadsheet that has some XML in some of the cells of the table.
ID (column A): 23455
FACT (column B) (this code is copied & pasted from a sample cell - they don't all have this simplicity or structure):
"<div class=""fact"">
<p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p>
</div>
"
I'd like to have XML like the following:
<record>
<ID>23455</ID>
<FACT><div class="fact"><p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p></div></FACT>
</record>
This is complex enough that I doubt that Excel's native XML schema export will work (that thing is persnickety enough that I can't get it to work with simplest of data values).
My current thought is to write a Perl script, to read this as a CSV file and export XML. However, I've noticed that CSV does a poor job handling XML that's been "embedded" like this.
I'm hoping someone else might have a better suggestion for how to pull this information out.
Edit: Finally figured out the mistake I made with export. Can export and get the following:
<record>
<ID>23455</ID>
<FACT><div class="fact"><p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p></div&gt
</FACT>
</record>
I think I can work with this...some regex and it might be good enough (looking for all < might put me at risk of killing a true less-than sign).
So I'm still open to suggestions

Just posting this as the answer...
If you export the column as text you can get the following:
<record>
<ID>23455</ID>
<FACT><div class="fact"><p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p></div&gt
</FACT>
</record>
In an XML editor I did a find and replace to get all the tags using the following regex: s/<(\/?[\w\s="-_]+?)>/<$1>/
It's a bit dangerous if there are actual signs in the document, but you'd need a case where it was < /maybe and text with common tag symbols ="-_ > - possible but most equations are of the form X < Y < Z. Our content doesn't use <> all that much, so I can be fairly confident it won't catch the edge case.
I also "fixed" all the HTML (s/<b>/<b/>/ and s/<img (.*?)>/<img $1/>/) and checked parsing (theoretically an edge case would cause a parsing error).
And yes, I now have a doc in mixed DTD that will make all true XML peeps quake with horror, but I can work with it.

Filtering a Block

I have multiple blocks of the below pattern
<APPLIANCE>
<ID>12233</ID>
<UUID>xxxx-xxxx-xxxx-xxxx-xxxxxxx</UUID>
<NAME>xxxxxxx</NAME>
<STATUS>Offline</STATUS>
</APPLIANCE>
<APPLIANCE>
<ID>12234</ID>
<UUID>xxxx-xxxx-xxxx-xxxx-xxxxxxx</UUID>
<NAME>yyyyy</NAME>
<STATUS>Offline</STATUS>
</APPLIANCE>
I want to extract a block with Particular ID and Particular Name.
The output should display
For example :-
<ID>12234</ID>
<NAME>yyyyy</NAME>
I wanted to do using grep, sed, awk
Thanks.

This sed should work for you:
sed -n '/<ID>12234/,/<NAME>/{//p}' file
But you'd better use an xml parser as xmllint or xmlstarlet to parse valid xml files.

SchemaCrawler show table comment on html output

I'm calling SchemaCrawler in the following way:
call java -classpath ../_schemacrawler/lib/*;lib/* schemacrawler.Main -server=mysql -database=db_db -host=localhost -user=user -password=pwd -infolevel=maximum -command=brief -portablenames=false -tabletypes=TABLE -routines=.*\.X.*.* -routines=.*\.X.*.* -outputformat=html -o=html.html %*
It generates a nice html output. But I would like to see the table COMMENT text. It appears for the case of columns but cannot find a way to see the same for tables.
I guess it's related to the -noremarks options but I have already tried it without success.
How should I proceed?

Dicttoxml Module tags

I have been trying to make use of this module for some time now. I have many lists of dictionaries, that I want to convert into xml format. However, I want each list to essentially have its own 'table'. However When I try doing something along the lines of:
xml = dicttoxml.dictoxml(myList, root = False,
custom_root = "MyName",
attr_type = False)
I get every dict displayed as an <item> type. Shouldn't this produce what the module's owner refers to as an "xml snippet" that also is identified by the custom_root name?
Essentially I want each list to have its own identifier but not be created as 'root'. Basically where the following would have each item number associated to a certain list. Either encapsulating the whole list or each dict in the list would be suitable, I believe.
<root>
<item1>
#dict info
</item1>
<item2>
#dict info
</item2>
</root>

I fixed my problem by using just the custom_root variable in my call and leaving root = True. Then, I stripped the leading
b'<?xml version="1.0" encoding="UTF-8" ?>'
by calling
xml.partition(b'<?xml version="1.0" encoding="UTF-8" ?>')[2]
From then on, I created a file with <root> </root> tags and had the xml i created appended in between these tags.

Copy elements using xsl:copy-of without attributes

I have a xml like shown below
<?xml version="1.0" encoding="UTF-8"?>
<schools>
<city>Marshall</city>
<state>Maryland</state>
<highschool>
<schoolname>Marshalls</schoolname>
<department id="1">
<deptCode seq="1">D1</deptCode>
<deptName seq="2">Chemistry</deptName>
<deptHead seq="3">Henry Carl</deptHead>
<deptRank seq="4">L</deptRank>
</department>
<department id="2">
..
..
..
</highschool>
</schools>
In XSL i am copying the contents from department based on deptCode using
<xsl:copy-of select="*">
This produces result with all the attributes in the element tags.
Is it possible to ignore the attributes while using xsl:copy-of?
The desired result is like shown below
<deptCode>D1</deptCode>
<deptName>Chemistry</deptName>
<deptHead>Henry Carl</deptHead>
<deptRank>L</deptRank>
xsl:valueOf is working as required but i am trying to know if it
can be done with in xsl:copy-of? As a note, in my requirement, there are nearly 5 or 6 attributes for each element. Can someone please help? Thanks in Advance..
regards
Udayakiran

xsl:valueOf is working as required but i am trying to know if it can
be done with in xsl:copy-of?
No. xsl:copy-of is a package deal, you cannot pick and choose. To avoid repetitive coding, use a template matching department/*.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extract XML with xmllint --xpath - linux

This xmllint command line, xmllint --xpath "//*[#id='0x12345']/text()" file.xml will select EXTRACT_THIS_PLEASE as requested. See also Daniel Haley's answer showing how to use XML namespaces in xmllint.

Another option to extract the contents of the <attribute> elemenet: xmllint --xpath "//*[name()='attribute']/text()" x.xml Output: EXTRACT_THIS_PLEASE

Related

Excel: Export to XML - With XML in cells

Filtering a Block

SchemaCrawler show table comment on html output

Dicttoxml Module tags

Copy elements using xsl:copy-of without attributes

Categories

Resources