Excel: Export to XML - With XML in cells - excel

I'm trying to export a spreadsheet that has some XML in some of the cells of the table.
ID (column A): 23455
FACT (column B) (this code is copied & pasted from a sample cell - they don't all have this simplicity or structure):
"<div class=""fact"">
<p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p>
</div>
"
I'd like to have XML like the following:
<record>
<ID>23455</ID>
<FACT><div class="fact"><p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p></div></FACT>
</record>
This is complex enough that I doubt that Excel's native XML schema export will work (that thing is persnickety enough that I can't get it to work with simplest of data values).
My current thought is to write a Perl script, to read this as a CSV file and export XML. However, I've noticed that CSV does a poor job handling XML that's been "embedded" like this.
I'm hoping someone else might have a better suggestion for how to pull this information out.
Edit: Finally figured out the mistake I made with export. Can export and get the following:
<record>
<ID>23455</ID>
<FACT><div class="fact"><p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p></div&gt
</FACT>
</record>
I think I can work with this...some regex and it might be good enough (looking for all < might put me at risk of killing a true less-than sign).
So I'm still open to suggestions

Just posting this as the answer...
If you export the column as text you can get the following:
<record>
<ID>23455</ID>
<FACT><div class="fact"><p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p></div&gt
</FACT>
</record>
In an XML editor I did a find and replace to get all the tags using the following regex: s/<(\/?[\w\s="-_]+?)>/<$1>/
It's a bit dangerous if there are actual signs in the document, but you'd need a case where it was < /maybe and text with common tag symbols ="-_ > - possible but most equations are of the form X < Y < Z. Our content doesn't use <> all that much, so I can be fairly confident it won't catch the edge case.
I also "fixed" all the HTML (s/<b>/<b/>/ and s/<img (.*?)>/<img $1/>/) and checked parsing (theoretically an edge case would cause a parsing error).
And yes, I now have a doc in mixed DTD that will make all true XML peeps quake with horror, but I can work with it.

Related

XBA parsing and update with Excel VBA

I'm trying to make an XML parser/updater through Excel VBA.
First of all, I have been going back and forth between Excel VBA and Python but it seemed like Excel VBA was a better option to me.
However, I am open to any method really so please let me know if anyone has a different suggestion that would work better.
So, what I want to do with this application.
Parse XML and note the information on Excel format
I need name and the value of each attributes along with the text value of each node
After getting the information in the Excel format, I want to be able to revise values and output back to the XML format
So, in a nutshell, I am really aiming for a XML editor I guess?
But I am stuck at a few issues from the startline.
Here's a brief implementation of the XML parsing portion:
'load xml document
Set xmlDoc = CreateObject("MSXML2.DOMDocument.6.0")
xmlDoc.async = False
xmlDoc.validateOnParse = False
xmlDoc.Load(xmlFilepath)
'get document elements
Set xmlDocElement = xmlDoc.DocumentElement
Debug.Print xmlDocElement.xml
For i = 0 To xmlDocElement.ChildNodes.Length - 1
Debug.Print xmlDocElement.ChildNodes(i).xml
For j = 0 To xmlDocElement.ChildNodes(i).Attributes.Length - 1
Debug.Print xmlDocElement.ChildNodes(i).Attributes.Item(j).Name
Debug.Print xmlDocElement.ChildNodes(i).Attributes.Item(j).Value
Next j
Debug.Print xmlDocElement.ChildNodes(i).Text
Next i
The above method works well more or less with an exception for two conditions, so far at least.
XML file cannot be loaded if the text includes &/>/<
XML file cannot be loaded if it includes more than 1 highest parent node.
Text including &/>/< sample:
<parenttag>
<childtag>I love mac&cheese</childtag>
</parenttag>
The answer I found online was quite conclusive:
Revise the text so that it does not use &/>/<.
But I cannot modify the text and need to keep the current format.
Any way to bypass this?
More than 1 highest parent node sample:
<parenttag>
<childtag>Text</childtag>
</parenttag>
<differenttag>
<childtag>Some other text</childtag>
</differenttag>
XML Load does not work with multiple parent tags in 1 XML file.
And again, I cannot modify the XML file content, so I need a way around the load error.
I also want to note that I have initially started this project
by reading XML file as a text and process line by line.
But, this did not work well with multi-line content
and thus trying to figure out a way to process XML file properly.
This question really includes multiple portions but I would really appreciate if I can get any help.
The issue is that any XML parser will only accept valid XML. And
<childtag>I love mac&cheese</childtag>
is just no valid XML. It should be encoded as
<childtag>I love mac&cheese</childtag>
So that is what you need to fix. You can only work with a standard (like XML standard) if everyone follow the XML standard rules and produces valid XML. Otherwise your code might look like XML but it is no XML (until it is valid).
Also multiple root elements is not allowed in XML. If it has multiple roots then it is no XML. So to get out of your issue the only thing you can do is fix those issues before loading the file into a parser. For example you can add a root tag to make your multiple parents become childs of that root:
<myroot>
<parenttag>
<childtag>Text</childtag>
</parenttag>
<differenttag>
<childtag>Some other text</childtag>
</differenttag>
</myroot>
And & that are not encoded yet need to be changed to & to make them valid.
The only other option is to write your own parser to parse that custom files which are not XML. But that will not be possible in 2 lines of code as you will need to develop a parser for your NON-XLM files.

batch file extract numbers from text file with little information

So This is related to my other two posts. Im dealing with extracting text from a text file and analyzing it and I've run into some problems. For A while I've been using a method that sets all the text between two other strings as a variable, but here is the situation I have. I need to extract the speed (numbers) from the below string: "etc...,query":{"ping":47855},"cmts":...etc. The problem is that the text cmts sometimes changes to something else so really I need to extract all the numbers from this:
,query":{"ping":47855},"
One more thing that makes this difficult is that the characters }," Are all over the file. Thank you for helping me! -Lucas EDG Programmer.
Here's the full file:
{"_id":53291,"ip":"158.69.22.95","domain":"jectile.com","port":25565,"url":"","date_add":1453897770,"status":1,"scan":1,"uptime":99.53,"last_update":1485436105,"geo":{"country":"US","country_name":"United States","city":"Lake Forest"},"info":{"name":" Jectile | jectile.com [1.8-1.11]\n Shoota (Call of Duty) \/ Zambies (Zombie Survival)","type":"FML","version":"1.10","plugins":[],"players":18,"max_players":420,"players_list":[],"map":"world","software":"BungeeCord 1.8.x, 1.9.x, 1.10.x, 1.11.x","avg_player_day":24.458333,"avg_load_day":5.8234,"platform":"MINECRAFT","icon":true},"counter":{"online":47871,"offline":228,"players":{"date":"2017-01-26","total":0},"last_offline":0,"query":{"ping":47855},"cmts":1},"rating":{"main":19.24,"difference":-0.64,"content_up":0.15,"K":0},"last":{"offline":1485415702,"online":1485436105},"chart":{"14:30":14,"14:40":16,"14:50":15,"15:00":18,"15:10":12,"15:20":13,"15:30":9,"15:40":9,"15:50":11,"16:00":12,"16:10":11,"16:20":11,"16:30":18,"16:40":25,"16:50":23,"17:00":27,"17:10":27,"17:20":23,"17:30":24,"17:40":26,"17:50":33,"18:00":31,"18:10":31,"18:20":32,"18:30":37,"18:40":38,"18:50":39,"19:00":38,"19:10":34,"19:20":33,"19:30":40,"19:40":36,"19:50":37,"20:00":38,"20:10":36,"20:20":38,"20:30":37,"20:40":37,"20:50":37,"21:00":34,"21:10":32,"21:20":33,"21:30":33,"21:40":29,"21:50":28,"22:00":26,"22:10":21,"22:20":24,"22:30":29,"22:40":22,"22:50":23,"23:00":27,"23:10":24,"23:20":26,"23:30":25,"23:40":28,"23:50":27,"00:00":32,"00:10":29,"00:20":33,"00:30":32,"00:40":31,"00:50":33,"01:00":40,"01:10":40,"01:20":40,"01:30":41,"01:40":45,"01:50":48,"02:00":43,"02:10":45,"02:20":46,"02:30":46,"02:40":43,"02:50":42,"03:00":39,"03:10":36,"03:20":44,"03:30":34,"03:40":0,"03:50":32,"04:00":35,"04:10":35,"04:20":33,"04:30":43,"04:40":37,"04:50":26,"05:00":31,"05:10":31,"05:20":27,"05:30":25,"05:40":26,"05:50":18,"06:00":13,"06:10":15,"06:20":17,"06:30":18,"06:40":17,"06:50":15,"07:00":16,"07:10":17,"07:20":16,"07:30":16,"07:40":18,"07:50":19,"08:00":14,"08:10":12,"08:20":12,"08:30":13,"08:40":17,"08:50":20,"09:00":18,"09:10":0,"09:20":0,"09:30":27,"09:40":18,"09:50":20,"10:00":15,"10:10":13,"10:20":12,"10:30":10,"10:40":10,"10:50":11,"11:00":13,"11:10":13,"11:20":16,"11:30":19,"11:40":17,"11:50":13,"12:00":10,"12:10":11,"12:20":12,"12:30":16,"12:40":15,"12:50":16,"13:00":14,"13:10":10,"13:20":13,"13:30":16,"13:40":16,"13:50":17,"14:00":20,"14:10":16,"14:20":16},"query":"ping","max_stat":{"max_online":{"date":1470764061,"players":129}},"status_query":"ok"}
By the way, the reason things change is because it looks at info from different servers
Very similar to ther answer I gave you to your first question:
#Echo Off
Set/P var=<some.json
Set var=%var:*:{"ping":=%
Set var=%var:},=&:%
Echo=%var%
Timeout -1

Does Saiku Analytics Plugin support 'Style' Attributes on Format String?

I am trying to apply some color formatting on my results using Saiku Analytics, without success. I tried everything, following the following recommendations to apply formatting in the Schema .xml
http://forums.pentaho.com/showthread.php?173272-Changing-Color-of-measures
http://forums.pentaho.com/showthread.php?73935-Calculated-Member-Format-String
as well as directly trying to apply custom format on the calculated members menu in Saiku Analytics.
Mondrian Documentation reads:
"The format string can even contain 'style' attributes which are interpreted specially by JPivot. If present, JPivot will render cells in color."
Apart from that, I haven't been able to find any evidence that Saiku supports 'style' as well. Any ideas?
EDIT:
Following the advice of this post
https://groups.google.com/a/saiku.meteorite.bi/forum/#!topic/user/Gb4AGqU87GY
I managed to apply formatting on my fields by changing my schema.
My calculated measure is the following:
<CalculatedMember name="Exclude Appraiser Role 2 Average Score" dimension="Measures" formula="IIF(([Measures].[Appraiser Role 2 Count],[Appraisee].[Appraisee]) < 3,NULL,IIF([Measures].[Score_1] + [Measures].[Score_2] + [Measures].[Score_3] + [Measures].[Score_4] + [Measures].[Score_5] = 0,NULL,[Measures].[Total Question Score] / ([Measures].[Score_1] + [Measures].[Score_2] + [Measures].[Score_3] + [Measures].[Score_4] + [Measures].[Score_5])))">
<CalculatedMemberProperty name="FORMAT_STRING" expression="IIF([Measures].[Appraiser Role 2 Average Score] < 3, "|#,###.##|style=red" , "|#,###.##|style=green")"/>
</CalculatedMember>
My question now is the following? Where can I find a complete documentation (or more examples at least) about the formatting options that are available?
For example, I am aware about 'arrow' and 'style' attributes but which is the complete list of values they take? Which paradigm do they follow? Mondrian documentation states that format string follows from Visual Basic formatting but this hint didn't prove helpful at all.
EDIT 2: As expected, colors are HTML. Question remains about arrows and other possible available attributes.
You can read this : http://jira.meteorite.bi/browse/SKU-1112
The difficulty is that this formatting must be done by Saiku (not Mondrian), so it must implement an interpreter of the formatting clauses and this not a simple task.

Labelling text using Notepad++ or any other tool

I have several .dat, containing information about hotel reviews as below
/*
<Author> simmotours
<Content> review......goes here
<Date>Nov 18, 2008
<No. Reader>-1
<No. Helpful>-1
<Overall>4`enter code here`
<Value>4
<Rooms>3
<Location>4
<Cleanliness>4
<Check in / front desk>4
<Service>4
<Business service>-1
*/
I want to classify the review into two pos and neg , i.e. have two folder pos and neg containing several files with reviews above 3 classified as positive and below 3 classified as negative.
How can I quickly and efficiently automate this process?
You could write up a python script to read the overall score. Do this by looping over the the lines using readline() See here. Find the "Overall" Score using some string parsing. Then move the file into the right directory. All very simple things to do in Python, just break it down into steps and search for answers to those steps.
Notepad++ can do replacements with regular expressions. And allows the definition of macros. Use them to convert the file to an XML file. Check out the help file.
Then you can read it with any scripting language and do what you want.
Alternatively you could change the file to a form where you can load it into Excel and do the analysis there.

Not using colnames when reading .xls files with RODBC

I have another puzzling problem.
I need to read .xls files with RODBC. Basically I need a matrix of all the cells in one sheet, and then use greps and strsplits etc to get the data out. As each sheet contains multiple tables in different order, and some text fields with other options inbetween, I need something that functions like readLines(), but then for excel sheets. I believe RODBC the best way to do that.
The core of my code is following function :
.read.info.default <- function(file,sheet){
fc <- odbcConnectExcel(file) # file connection
tryCatch({
x <- sqlFetch(fc,
sqtable=sheet,
as.is=TRUE,
colnames=FALSE,
rownames=FALSE
)
},
error = function(e) {stop(e)},
finally=close(fc)
)
return(x)
}
Yet, whatever I tried, it always takes the first row of the mentioned sheet as the variable names of the returned data frame. No clue how to get that solved. According to the documentation, colnames=FALSE should prevent that.
I'd like to avoid the xlsReadWrite package. Edit : and the gdata package. Client doesn't have Perl on the system and won't install it.
Edit:
I gave up and went with read.xls() from the xlsReadWrite package. Apart from the name problem, it turned out RODBC can't really read cells with special signs like slashes. A date in the format "dd/mm/yyyy" just gave NA.
Looking at the source code of sqlFetch, sqlQuery and sqlGetResults, I realized the problem is more than likely in the drivers. Somehow the first line of the sheet is seen as some column feature instead of an ordinary cell. So instead of colnames, they're equivalent to DB field names. And that's an option you can't set...
Can you use the Perl-based solution in the gdata instead? That happens to be portable too...

Resources