Parsing Tableau xml does not preserve original file - python-3.x

I try to work programmatically on Tableau desktop file (which are just xml file in spite of their .twb extension). I have many problem with lxml which doesn't preserve original content. To facilitate the explanation, imagine you have a test.xml file which contain the following text:
<column caption='Choix Découpage' name='[Aujourd&apos;Hui Parameter (copy 2)]'>
<member name='Nb d&apos;annulations' default-format='n#,##0.00" annulations";-#,##0.00" annulations"' />
<run>
:</run>
<calculation formula='iif([FAC_TYPE] = &apos;Avoir&apos; , [Calculation_1378101492427309057], null)' />
<alias key='"Billetterie Ferroviaire"' value='Train ticketing' />
</column>
Now let's parse it:
tree = etree.parse('test.xml')
root = tree.getroot()
print(etree.tostring(root,pretty_print=True,).decode("utf-8"))
When you run the code we can notice:
' becomes "
é becomes é Edit: resolved for this part
&apos; becomes '
How could i preserve the original ? (It would help me a lot when i try to check the diff with git in spite of showing all the useless change that are operated automatically)
Edit: I notice an other problem, when i run the folowing code:
[node.attrib['key'] for node in root.xpath("//alias")]
I got the result: ['"Billetterie Ferroviaire"'] and I am now unable to query with xpath if i am looking for the node whose attribute "key" is the original "Billetterie Ferroviaire" (root.xpath('//[#key="Billetterie Ferroviaire"]) doesn't work)

Related

XBA parsing and update with Excel VBA

I'm trying to make an XML parser/updater through Excel VBA.
First of all, I have been going back and forth between Excel VBA and Python but it seemed like Excel VBA was a better option to me.
However, I am open to any method really so please let me know if anyone has a different suggestion that would work better.
So, what I want to do with this application.
Parse XML and note the information on Excel format
I need name and the value of each attributes along with the text value of each node
After getting the information in the Excel format, I want to be able to revise values and output back to the XML format
So, in a nutshell, I am really aiming for a XML editor I guess?
But I am stuck at a few issues from the startline.
Here's a brief implementation of the XML parsing portion:
'load xml document
Set xmlDoc = CreateObject("MSXML2.DOMDocument.6.0")
xmlDoc.async = False
xmlDoc.validateOnParse = False
xmlDoc.Load(xmlFilepath)
'get document elements
Set xmlDocElement = xmlDoc.DocumentElement
Debug.Print xmlDocElement.xml
For i = 0 To xmlDocElement.ChildNodes.Length - 1
Debug.Print xmlDocElement.ChildNodes(i).xml
For j = 0 To xmlDocElement.ChildNodes(i).Attributes.Length - 1
Debug.Print xmlDocElement.ChildNodes(i).Attributes.Item(j).Name
Debug.Print xmlDocElement.ChildNodes(i).Attributes.Item(j).Value
Next j
Debug.Print xmlDocElement.ChildNodes(i).Text
Next i
The above method works well more or less with an exception for two conditions, so far at least.
XML file cannot be loaded if the text includes &/>/<
XML file cannot be loaded if it includes more than 1 highest parent node.
Text including &/>/< sample:
<parenttag>
<childtag>I love mac&cheese</childtag>
</parenttag>
The answer I found online was quite conclusive:
Revise the text so that it does not use &/>/<.
But I cannot modify the text and need to keep the current format.
Any way to bypass this?
More than 1 highest parent node sample:
<parenttag>
<childtag>Text</childtag>
</parenttag>
<differenttag>
<childtag>Some other text</childtag>
</differenttag>
XML Load does not work with multiple parent tags in 1 XML file.
And again, I cannot modify the XML file content, so I need a way around the load error.
I also want to note that I have initially started this project
by reading XML file as a text and process line by line.
But, this did not work well with multi-line content
and thus trying to figure out a way to process XML file properly.
This question really includes multiple portions but I would really appreciate if I can get any help.
The issue is that any XML parser will only accept valid XML. And
<childtag>I love mac&cheese</childtag>
is just no valid XML. It should be encoded as
<childtag>I love mac&cheese</childtag>
So that is what you need to fix. You can only work with a standard (like XML standard) if everyone follow the XML standard rules and produces valid XML. Otherwise your code might look like XML but it is no XML (until it is valid).
Also multiple root elements is not allowed in XML. If it has multiple roots then it is no XML. So to get out of your issue the only thing you can do is fix those issues before loading the file into a parser. For example you can add a root tag to make your multiple parents become childs of that root:
<myroot>
<parenttag>
<childtag>Text</childtag>
</parenttag>
<differenttag>
<childtag>Some other text</childtag>
</differenttag>
</myroot>
And & that are not encoded yet need to be changed to & to make them valid.
The only other option is to write your own parser to parse that custom files which are not XML. But that will not be possible in 2 lines of code as you will need to develop a parser for your NON-XLM files.

How to use cTAKES from the command line?

I wonder how to use Apache cTAKES from the command line.
E.g. :
I have a file note.txt that contains some text like "Patient had
elevated blood sugar but tests confirm no diabetes. Patient's father had
adult onset diabetes."
I want to use the provided analysis engine
\apache-ctakes-3.2.2-bin\apache-ctakes-3.2.2\desc\ctakes-clinical-pipeline\desc\analysis_engine\AggregatePlaintextUMLSProcessor.xml
How can I get the analyse engine's output (viz. the annotations) using the
command line (i.e. without using graphical user interfaces such as UIMA CAS
Visual Debugger or the Collection Processing Engine)? I'd prefer to
use the provided JAR files rather than having to compile the code.
The question is fairly simple but I couldn't find the information in
cTAKES's README or
on Confluence.
Please try the following steps to use cTAKES CPE from the command line (the key class is "org.apache.uima.examples.cpe.SimpleRunCPE"):
Change directory to $CTAKES_HOME/desc/ctakes-clinical-pipeline/desc/collection_processing_engine/
Copy test_plaintext.xml to another file (e.g., "test_plaintext_test.xml").
Edit "test_plaintext_test.xml" to set input directory; find "nameValuePair" with name = "InputDirectory", and set the value string to the input directory. The following example set the input directory as "$CTAKES_HOME/note_input":
<nameValuePair>
<name>InputDirectory</name>
<value>
<string>note_input</string>
</value>
</nameValuePair>
Similarly, edit "test_plaintext_test.xml" to set the output directory ("$CTAKES_HOME/result_output" in the following example):
<nameValuePair>
<name>OutputDirectory</name>
<value>
<string>result_output</string>
</value>
</nameValuePair>
Save "test_plaintext_test.xml" and change directory to $CTAKES_HOME/bin.
Copy runctakesCPE.sh to another file (e.g., "runctakesCPE_CLI.sh").
Edit "runctakesCPE_CLI.sh"; replace the last line ("java ...") to the following line ("USER" and "PW" should be replaced by your UMLS Username and Password, and the memory setting Xms and Xms may be adjusted based on the size of memory on your machine):
java -Dctakes.umlsuser=USER -Dctakes.umlspw=PW -cp $CTAKES_HOME/lib/*:$CTAKES_HOME/desc/:$CTAKES_HOME/resources/ -Dlog4j.configuration=file:$CTAKES_HOME/config/log4j.xml -Xms2g -Xmx3g org.apache.uima.examples.cpe.SimpleRunCPE $CTAKES_HOME/desc/ctakes-clinical-pipeline/desc/collection_processing_engine/test_plaintext_test.xml
Save "runctakesCPE_CLI.sh", and then create the input directory ("$CTAKES_HOME/note_input") and the output directory ("$CTAKES_HOME/result_output").
Put your note.txt to the input directory (e.g., "$CTAKES_HOME/note_input/note.txt"), and then run "runctakesCPE_CLI.sh".
cTAKES CPE will start running under command line mode, and the resulting file will be generated in the output directory (e.g., "$CTAKES_HOME/result_output/note.txt.xml").
I actually used your note.txt to run the steps above and here are the first several lines of the generated note.txt.xml:
<?xml version="1.0" encoding="UTF-8"?><CAS version="2">
<uima.cas.Sofa _indexed="0" _id="3" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="Patient had elevated blood sugar but tests confirm no diabetes. Patient's father had adult onset diabetes.
"/>
<org.apache.ctakes.typesystem.type.structured.DocumentID _indexed="1" _id="1" documentID="note.txt"/>
<uima.tcas.DocumentAnnotation _indexed="1" _id="10" _ref_sofa="3" begin="0" end="107" language="x-unspecified"/>
<org.apache.ctakes.typesystem.type.textspan.Segment _indexed="1" _id="15" _ref_sofa="3" begin="0" end="107" id="SIMPLE_SEGMENT"/>
<org.apache.ctakes.typesystem.type.textspan.Sentence _indexed="1" _id="21" _ref_sofa="3" begin="0" end="63" sentenceNumber="0"/>
Hope this helps :-)
java -Dctakes.umlsuser=USER -Dctakes.umlspw=PW -cp $CTAKES_HOME/lib/*;$CTAKES_HOME/desc/;$CTAKES_HOME/resources‌​/ -
Dlog4j.configuration=file:$CTAKES_HOME/config/log4j.xml -Xms2g -Xmx3g to_replace $CTAKES_HOME/desc/ctakes-clinical-pipeline/desc/collection_p‌​rocessing_engine/tes‌​t_plaintext_test.xml
replace "to_replace" with either
org.apache.ctakes.ytex.tools.RunCPE or
org.apache.ctakes.core.cpe.CmdLineCpeRunner

Dicttoxml Module tags

I have been trying to make use of this module for some time now. I have many lists of dictionaries, that I want to convert into xml format. However, I want each list to essentially have its own 'table'. However When I try doing something along the lines of:
xml = dicttoxml.dictoxml(myList, root = False,
custom_root = "MyName",
attr_type = False)
I get every dict displayed as an <item> type. Shouldn't this produce what the module's owner refers to as an "xml snippet" that also is identified by the custom_root name?
Essentially I want each list to have its own identifier but not be created as 'root'. Basically where the following would have each item number associated to a certain list. Either encapsulating the whole list or each dict in the list would be suitable, I believe.
<root>
<item1>
#dict info
</item1>
<item2>
#dict info
</item2>
</root>
I fixed my problem by using just the custom_root variable in my call and leaving root = True. Then, I stripped the leading
b'<?xml version="1.0" encoding="UTF-8" ?>'
by calling
xml.partition(b'<?xml version="1.0" encoding="UTF-8" ?>')[2]
From then on, I created a file with <root> </root> tags and had the xml i created appended in between these tags.

SlowCheetah transform ignores multiple conditions

I have a WCF configuration file that I am trying to transform with SlowCheetah. For development use, we want to include the MEX endpoints, but when we release the product, these endpoints should be removed on all services except one. The server for which it should be left has the following endpoint:
<endpoint address="MEX"
binding="mexHttpBinding"
contract="IMetadataExchange" />
The ones that should be removed are as follows:
<endpoint address="net.tcp://computername:8001/WCFAttachmentService/MEX"
binding="netTcpBinding"
bindingConfiguration="UnsecureNetTcpBinding"
name="WCFAttachmentServiceMexEndpoint"
contract="IMetadataExchange" />
The transform I am using is:
<service>
<endpoint xdt:Locator="Condition(contains(#address, 'MEX') and not(contains(#binding, 'mexHttpBinding')))" xdt:Transform="RemoveAll" />
</service>
However, when I run this, ALL MEX endpoints are removed from the config file including the one that I wish to keep. How do I make this work properly?
The Locator Condition expression that selects the nodes seems to be correct. If you had only the two endpoints you posted in your example, this expression will select the second endpoint.
According to the documentation the Transform attribute RemoveAll should "remove the selected element or elements." Based on the information you posted it's not working as expected, since the first element was not selected and was removed anyway. Based on this StackOverflow answer it seems to me that the issue is with Condition. I'm not sure if that's a bug (it's poorly documented), but you could try some alternative solutions:
1) Using XPath instead of Condition. The effective XPath expression that is applied to your configuration file as a result of the Condition expression is:
/services/service/endpoint[contains(#address, 'MEX') and not(contains(#binding, 'mexHttpBinding'))]
You should also obtain the same result using the XPath attribute instead of Condition:
<endpoint xdt:Locator="XPath(/services/service/endpoint[contains(#address, 'MEX')
and not(contains(#binding, 'mexHttpBinding'))])" xdt:Transform="RemoveAll" />
2) Using Match and testing an attribute such as binding. This is a simpler test, and would be IMO the preferred way to perform the match. You could select the nodes you want to remove by the binding attribute
<endpoint binding="netTcpBinding" xdt:Locator="Match(binding)" xdt:Transform="RemoveAll" />
3) UsingXPath instead of Match in case you have many different bindings and only want to eliminate only those which are not mexHttpBinding:
<endpoint xdt:Locator="XPath(/services/service/endpoint[not(#binding='mexHttpBinding'))" xdt:Transform="RemoveAll" />
4) Finally, you could try using several separate statements with Condition() or Match() to individually select the <endpoint> elements you wish to remove, and use xdt:Transform="Remove" instead of RemoveAll.

How to specify the name of the generate file to the e:worksheet function

We use the jboss seam-->excel module integration for generating excel sheets using e:worksheet. But the downloaded file name comes out as ExportUsers.jxl.xls, I would rather see this as ExportUsers.xls. How do I customize this information.
filename attribute of the e:workbook tag
<e:workbook filename="ExportUsers.xls" />
Take a look at the Seam excel documentation.
filename — The filename to use for the download. The value is a string. Please note that if you map the DocumentServlet to some pattern, this file extension must also match.

Resources