What structured text format is that? - text

I am getting a string formatted as follows:
a:2:
{
i:0;
a:0:{}s:2:"o1";
a:14:
{
s:8:"duration";
i:299;
s:12:"content_hash";
s:32:"af1e3e5707da79c2et96d8280db6343e";
s:12:"content_type";
s:9:"video/mp4";
s:9:"extension";
s:3:"mp4";
s:5:"width";
i:1280;
s:6:"height";
i:720;
s:6:"aspect";
d:1.7800000000000000266453525910037569701671600341796875;
s:15:"is_video_bucket";
i:1;
s:8:"revision";
i:0;
This comes from Tumblr's API.
it's all made of chunks that have a letter to identify the type, followed by the size:
s:8:"duration";
i:299;
is really string (8 bytes) : "duration"; integer 299
but I'm not familiar with that structure; is it common?

I finally found, it's UBJSON; I have never heard of it before
https://en.wikipedia.org/wiki/UBJSON

Related

String formating and store the values

I have a string similar to below
"OPR_NAME:CODE=value,:DESC=value,:NUMBER=value,:INITIATOR=value,:RESP"
I am using StringTokenizer to split the string into tokens based on the delimiter(,:),I need the values of
CODE,DESC and NUMBER.
Can someone pls tell how to achieve this ? The values may come in random order in my string
For eg my string may be like below as well :
"OPR_NAME:DESC=value,:NUMBER=value,:CODE=value,:INITIATOR=value,:RESP" and still it should be able to fetch the values.
I did below to split the string into tokens
StringTokenizer st = new StringTokenizer(str,",:");
while (st.hasMoreTokens()) {
System.out.println(st.nextToken());
}
But not sure how to store these tokens to just get the value of 3 fields as mentioned above.
Thanks !!
Okay so what I meant is, detect where is the "=" and then apply a substring to get the value you want.
rough example
System.out.println(st.nextToken().substring(st.nextToken().indexOf('=')+1,st.nextToken().length()));
Use split instead :
String[] parts = X.split(",:");
for (String x:parts) {
System.out.println(x.substring(x.indexOf('=')+1));
}

Treat all cells as strings while using the Apache POI XSSF API

I'm using the Apache POI framework for parsing large Excel spreadsheets. I'm using this example code as a guide: XLSX2CSV.java
I'm finding that cells that contain just numbers are implicitly being treated as numeric fields, while I wanted them to be treated always as strings. So rather than getting 1.00E+13 (which I'm currently getting) I'll get the original string value: 10020300000000.
The example code uses a XSSFSheetXMLHandler which is passed an instance of DataFormatter. Is there a way to use that DataFormatter to treat all cells as strings?
Or as an alternative: in the implementation of the interface SheetContentsHandler.cell method there is string value that is the cellReference. Is there a way to convert a cellReference into an index so that I can use the SharedStringsTable.getEntryAt(int idx) method to read directly from the strings table?
To reproduce the issue, just run the sample code on an xlsx file of your choice with a number like the one in my example above.
UPDATE: It turns out that the string value I get seems to match what you would see in Excel. So I guess that's going to be "good enough" generally. I'd expect the data I'm sent to "look right" and therefore it'll get parsed correctly. However, I'm sure there will be mistakes and in those cases it'd be nice if I could get at the raw string value using the streaming API.
To resolve this issue I created my own class based on XSSFSheetXMLHandler
I copied that class, renamed it and then in the endElement method I changed this part of the code which is formatting the raw string:
case NUMBER:
String n = value.toString();
if (this.formatString != null && n.length() > 0)
thisStr = formatter.formatRawCellContents(Double.parseDouble(n), this.formatIndex, this.formatString);
else
thisStr = n;
break;
I changed it so that it would not format the raw string:
case NUMBER:
thisStr = value.toString();
break;
Now every number in my spreadsheet has its raw value returned rather than a formatted version.

Reading a text file into an structure

I'm trying to read some information from a .txt file and then store it in a structure, but I don't know how to do it only selecting the words I need.
The .txt is the following
fstream write_messages("C:\\Messages.txt");
A line in the .txt looks like this:
18 [#deniseeee]: hello, how are you? 2016-04-26.23:37:58
So, the thing is that I have a list of strutures
list<Message*> g_messages;
where
struct Message {
static unsigned int s_last_id; // keep track of IDs to assign it automatically
unsigned int id;
string user_name;
string content;
string hour;
Message(const string& a_user_name, const string& a_content) :
user_name(a_user_name), content(a_content), id(++s_last_id), hour(currentDateTime())
{
}
Message(){}
};
I want to read the file so that I can store the number into the id of the list, the word between "[#" and "]:" into the user_name, the next sentence into content and the date into the hour.
How can I do this?
Any help is appreciated
Thank you
The problem is: you are writing this file so it is easy to read by a person, while you should write it so it is easy to load by your program.
You are using spaces to separate your elements, while spaces can very well happen inside those elements.
I would suggest to use a symbol that is not likely to be used in the text, for example the pipe '|'. Also, I would write fixed-size elements first:
18|2016-04-26.23:37:58|deniseeee|hello, how are you?
Then you can read entire line, split it on those pipe symbols and load into your struct fields.

Convert a text file to UTF8 in D

I'm attempting to use the Phobos standard library functions to read in any valid UTF file (UTF-8, UTF-16, or UTF-32) and get it back as a UTF-8 string (aka D's string). After looking through the docs, the most concise function I could think of to do so is
using std.file, std.utf;
string readToUTF8(in string filename)
{
try {
return readText(filename);
}
catch (UTFException e) {
try {
return toUTF8(readText!wstring(filename));
}
catch (UTFException e) {
return toUTF8(readText!dstring(filename));
}
}
}
However, catching a cascading series of exceptions seems extremely hackish. Is there a "cleaner" way to go about it without relying on catching a series of exceptions?
Additionally, the above function seems to return a one-byte BOM in the resulting string if the source file was UTF-16 or UTF-32, which I would like to omit given that it's UTF-8. Is there a way to omit that besides explicitly stripping it?
One of your questions answers the other: the BOM allows you to identify the exact UTF encoding used in the file.
Ideally, readText would do this for you. Currently, it doesn't, so you'd have to implement it yourself.
I'd recommend using std.file.read, casting the returned void[] to a ubyte[], then looking at the first few bytes to see if they start with a BOM, then cast the result to the appropriate string type and convert it to a string (using toUTF8 or to!string).

How can I achieve fast and effective String compression in Actionscript 3?

I have an Object which stores pairs for a find and replace that I perform on up to 1500 Strings at a time.
The Object is populated with pairs using a method that will accept a String and then store this as a property with the value being an automatically assigned base 36 number, like this:
function addShort(long:String):void
{
_pairs[long] = _nextShort;
}
_nextShort returns an automatically incremented value being the subject of .toString(36), so running the above a few times might make _pairs look like this:
_pairs:Object = {
"class": "0",
"testing.objects.TestBlock": "1",
"skin.x": "2",
"skin.y": "3",
...........
"someString": "az1"
};
This Object could realistically end up being really large, having over a couple hundred pairs stored.
I then have a method that will take a "long" String (which will include the Strings I've given to addShort() previously) and return a new String where these have been replaced with their respective short value.
The method looks like this:
public function shorten(long:String):String
{
for(var i:String in _pairs)
long = long.split(i).join(_pairs[i]);
return long;
}
Nice an simple, however in my mind I foresee a massive problem in a case where I might want to "shorten" 2000+ Strings and the _pairs Object has at the same time has over 500 pairs.
That ends up being 1,000,000 iterations all up which obviously doesn't seem very efficient at all.
How can I improve this process significantly?
Based on comments from #kapep I realized what I needed is actually a compression library that will do this work for me.
I stumbled across an LZW compression class within a package called Calista which works great.
I did notice that the compression was really slow, which is understandable, but if there are any suggestions for something quicker I'm open to them.
How about Regular Expressions for replacing String patterns? Catch some code.

Resources