I've a XML file that I need to output as text, how can I do that? What is the best and efficient way to get text output via u-sql?
INPUT XML:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>abstract.xml</title>
<link>download.wikimedia.org/enwiki/20171103</link>
<description>Wikimedia dump updates for enwiki</description>
<item>
<title>download.wikimedia.org/enwiki/20171103</title>
<link>download.wikimedia.org/enwiki/20171103</link>
<description>
<a href="download.wikimedia.org/enwiki/20171103/…" />
</description>
<pubDate>Sun, 05 Nov 2017 21:11:20 GMT</pubDate>
</item>
</channel>
</rss>
This is my xml and I'm trying with below code to retrieve its data.
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
#wiki = EXTRACT title string,
link string //abst string
FROM #input
USING new Microsoft.Analytics.Samples.Formats.Xml.XmlExtractor(
"doc",
new SQL.MAP<string,string> {
{"title","title" },
{"link","link" }
}
);
There is a great example from Microsoft showed here:
https://github.com/Azure/usql/tree/master/Examples/DataFormats/Microsoft.Analytics.Samples.Formats
You can just bind the .DLL file and refer to it in your u-sql project.
I've used it a lot in the past for extracting and outputting both json and xml.
Related
Sample XML file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<configuration xmlns="http://www.jooq.org/xsd/jooq-codegen-3.17.0.xsd">
<!-- Configure the database connection here -->
<jdbc>
<driver>com.mysql.cj.jdbc.Driver</driver>
<url></url>
<user></user>
<password></password>
</jdbc>
<!-- onError SILENT can be used with MYSQLDatabase for Memsql -->
<!-- <onError>SILENT</onError> -->
<generator>
<generate>
<records>true</records>
<instanceFields>true</instanceFields>
<generatedAnnotation>true</generatedAnnotation>
<generatedAnnotationType>DETECT_FROM_JDK</generatedAnnotationType>
</generate>
Earlier this file generated jOOQ code with records, and had columns. Now code generated but jooq-columns are not present
This looks like a regression in the MemSQLDatabase, probably introduced with jOOQ 3.16's support for computed columns: https://github.com/jOOQ/jOOQ/issues/13854
For example, my date is 12.04.2008 (DD.MM.YYYY) but, when I export it, the XML File somehow converts it to this: 39550.
My XML template looks like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<DataImport>
<No>001-MK</No>
<CustNo>10016</CustNo>
<PostingDate>05.12.2005</PostingDate>
</DataImport>
</root>
I'm trying to write an XML tree to disk using Python's xml.etree.ElementTree to reproduce an example document given to me. The target XML document has fields in it that look like:
<title>
This is a test of <br/> Hershey's <sup>&$174;</sup> chocolate factory machine <br/>
</title>
My problem is that whenever I try to write the text to disk using ElementTree's .write() method I can't achieve the above output. Either the html tags will get converted to <br> or the trademark symbol (the ® stuff) will show up as the actual symbol. Is there a way to encode my text to get the above output (where the trademark is represented by the ® characters but the html is html?). I've tried different encoding options in the write method but nothing seems to do the trick.
Edit: Here is a minimal working example. Take an input XML template file like:
<?xml version='1.0' encoding='UTF-8'?>
<document>
<title> Text to replace </title>
</document>
and we try to modify the text like so
import xml.etree.ElementTree as ET
tree = ET.parse('example.xml')
root = tree.getroot()
to_sub_text = "This is a test of <br/> Hershey's <sup>&$174;</sup> chocolate factory machine"
spot = root.find('title')
spot.text = to_sub_text
tree.write('example_mod.xml', encoding='UTF-8', xml_declaration=True)
this will write to file a file:
<?xml version='1.0' encoding='UTF-8'?>
<document>
<title>This is a test of <br/> Hershey's <sup>&$174;</sup> chocolate factory machine</title>
</document>
As I said, the document I'm trying to replicate leaves those html tags as tags. My questions are:
Can I modify my code to do that?
Is doing this good practice, or would have it been better to leave it as it currently is (and thus I need to talk to the team requesting I provide it to them in this way)?
The spot.text = to_sub_text assignment does not work. An element's text property contains plain text only. It is not possible to use it to add both text and subelements.
What you can do is to create a new <title> element object and append that to the root:
import xml.etree.ElementTree as ET
tree = ET.parse('example.xml')
root = tree.getroot()
# Remove the old title element
old_title = root.find('title')
root.remove(old_title)
# Add a new title
new_title = "<title>This is a test of <br/> Hershey's <sup>®</sup> chocolate factory machine</title>"
root.append(ET.fromstring(new_title))
# Prettify output (requires Python 3.9)
ET.indent(tree)
# Use encoding='US-ASCII' to force output of character references for non-ASCII characters
tree.write('example_mod.xml', encoding='US-ASCII', xml_declaration=True)
Output in example_mod.xml:
<?xml version='1.0' encoding='US-ASCII'?>
<document>
<title>This is a test of <br /> Hershey's <sup>®</sup> chocolate factory machine</title>
</document>
Does any XSD edge case allow (unescaped) XML element content inside a text node? E.g. can you put a CDATA element inside a tag defined as xs:string and have it validate (without declaring mixed content)?
If you have an element that contains a string i.e.
<?xml version="1.0" encoding="utf-8" ?>
<!--Created with Liquid Studio 2018 (https://www.liquid-technologies.com)-->
<xs:schema elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="Root" type="xs:string" />
</xs:schema>
Then that can contain CDATA i.e.
<?xml version="1.0" encoding="utf-8"?>
<!-- Created with Liquid Studio 2018 (https://www.liquid-technologies.com) -->
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="C:\Temp\XSDFile2.xsd">
Optional Text
<![CDATA[
<someXmlData></someXmlData>
]]>
Optional Text
</Root>
As this passes though some parsers it may get escaped back to this, but both are valid and equivalent.
<?xml version="1.0" encoding="utf-8"?>
<!-- Created with Liquid Studio 2018 (https://www.liquid-technologies.com) -->
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="XSDFile2.xsd">
Optional Text
<someXmlData></someXmlData>
Optional Text
</Root>
I want to export this in xml using jaxb but i am getting only last record two times using arraylist java.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <ROWS><row> <name>john</name><id>1</id><city>xyz</city></row> <row> <name>monu</name><id>2</id><city>abc</city></row> </rows>