Protocol buffers handling very large String message? - protocols

I was finally able to write protocol buffers code over REST and did some comparison with XStream which we are currently uses.
Everything seems great, only stumble with one thing.
We have very large messages in one particular attributes, say something like this
message Data {
optional string datavalue=1;
}
Datavalue above are extremely huge text messages. Size is 512kb - 5 Mb.
Protocol buffers deserialize just fine, with superb performance comparing to XStream.
However, I notice when I send this message to wire (via REST), it took longer to get response. Always twice longer than XStream.
I am thinking this might come from serializing time.
From google documents, it says Protocol buffers is not designed to handle very large messages, although it can handle very large data set.
I was wondering if anyone has some opinion or maybe solution from my case above?
Thanks

I was benchmarking different serialization tools a while ago and noticed that the Protobuf Java library took about 1.7x as long to serialize strings as java.io.DataOutputStream did. When I looked into it, it seemed to have to do with weird artifact of how the JVM optimizes certain code paths. However, in my benchmarking, XStream was always slower, even with really long strings.
One quick thing to try is the format-compatible Protostuff library in place of Google's Protobuf library.

I remember reading somewhere (trying to locate the article) that protobuf is very good if you have a mix of binary and textual data types. When you are working purely on textual data then you could get better performance and size by compressing it.

Related

Is there an advantage of sending binary packets over strings?

The title says it all. Basically I'm using TCP for a client-server setup and I'm wondering if there is an advantage of transforming strings to binary before sending the data over tcp?
Strings are binary data, or can at least be easily converted to such, byte[], with
static byte[] GetStringBytes(string str)
{
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
return bytes;
}
If you compress/encode the data you send, whether it starts out life as a string or binary data you will most likely be sending the same total number of bytes.
No real advantage in a vast majority of cases. Also, binary data tends to be more platform dependent, so if you want to extend your client/server to a multi-platform environment, you're probably better off sticking with Strings.
There are some data that are encoded in binary, like DER.
If your application needs to send these kinds of data, you can send binary directly. For example, the certificate of CA. If you send it as string like base64, it needs more size to represent it.
In some lower level, like TCP, sharing information (handshaking) between two entity is using binary because of the size, performance and fault tolerance.
But as an application developer which needs to evolve agilely, there are lots of things other than sizes you need to consider, like compatibility and maintenance. Let's say you need to response some metadata with the binary data, you might need to use JSON to represent:
{
"version": 1,
"data": "base64 encoded data"
}
Things will be easier to clients, because there are tons of tools to help parsing JSON and it is easier to debug when you can see the data without tools (human readable). In other hand, binary encoded (like DER) you have to upgrade your protocol document to let clients can easily upgrade their code for new format and you also need to consider backward/forward compatible.
Although there are some tools help binary encoded more compatible and maintainable, like Apache Thrift, Protocol Buffer, Apache Avro.
Looking deeper on performance aspect. Unless you are comparing with and without encode/decode process (that is easy to compare which is better). You must do some load-test to check it.
For example there are some extremely fast JSON parser, it can easily beat some binary parser which claims to have better performance than others.

Protocol buffers logging

In our business, we require to log every request/response which coming to our server.
At this time being, we are using xml as standard implementation.
Log files are used if we need to debug/trace some error.
I am kind of curious if we switch to protocol buffers, since it is binary, what will be the best way to log request/response to file?
For example:
FileOutputStream output = new FileOutputStream("\\files\log.txt");
request.build().writeTo(outout);
For anyone who has used protocol buffers in your application, how do you log your request/response, just in case we need it for debugging purpose?
TL;DR: write debugging logs in text, write long-term logs in binary.
There are at least two ways you can do this logging (and maybe, in fact, you should do both):
Writing your logs in text format. This is good for debugging and quickly checking for problems with your eyes.
Writing your logs in binary format - this will make future analysis much quicker since you can load the data using same protocol buffers code and do all kinds of things on them.
Quite honestly, this is more or less the way this is done at the place this technology came from.
We use the ShortDebugString() method on the C++ object to write down a human-readable version of all incoming and outgoing messages to a text-file. ShortDebugString() returns a one-line version of the same string returned by the toString() method in Java. Not sure how easy it is to accomplish the same thing in Java.
If you have competing needs for logging and performance then I suppose you could dump your binary data to the file as-is, with perhaps each record preceded by a tag containing a timestamp and a length value so you'll know where this particular bit of data ends. But I hasten to admit this is very ugly. You will need to write a utility to read and analyze this file, and will be helpless without that utility.
A more reasonable solution would be to dump your binary data in text form. I'm thinking of "lines" of text, again starting with whatever tagging information you find relevant, followed by some length information in decimal or hex, followed by as many hex bytes as needed to dump your buffer - thus you could end up with some fairly long lines. But since the file is line structured, you can use text-oriented tools (an editor in the simplest case) to work with it. Hex dumping essentially means you are using two bytes in the log to represent one byte of data (plus a bit of overhead). Heh, disk space is cheap these days.
If those binary buffers have a fairly consistent structure, you could even break out and label fields (or something like that) so your data becomes a little more human readable and, more importantly, better searchable. Of course it's up to you how much effort you want to sink into making your log records look pretty; but the time spent here may well pay off a little later in analysis.
If you've non-ASCII character strings in your messages, simply logging them by using implicit or explicit call to toString would escape the characters.
"오늘은 무슨 요일입니까?" becomes "\354\230\244\353\212\230\354\235\200 \353\254\264\354\212\250 \354\232\224\354\235\274\354\236\205\353\213\210\352\271\214?"
If you want to retain the non-ASCII characters, use TextFormat.printer().escapingNonAscii(false).printToString(message).
See this answer for more details.

binary file formats: need for error correction?

I need to serialize some data in a binary format for efficiency (datalog where 10-100MB files are typical), and I'm working out the formatting details. I'm wondering if realistically I need to worry about file corruption / error correction / etc.
What are circumstances where file corruption can happen? Should I be building robustness to corruption into my binary format? Or should I wrap my nonrobust-to-corruption stream of bytes with some kind of error correcting code? (any suggestions? I'm using Java) Or should I just not worry about this?
edit: preliminary binary format, as I have it right now, contains a bunch of variable-length segments, so I am slightly worried that if I do ever have data corruption then upon reading it back, I could get out of sync, and cannot recover + I lose the rest of the file.
You should at least add checksum. BER is good on modern hard drives, but this is not so for other media. Power loss during write usually corrupts file ends. If the data is important, you will need error correction codes, tripple and unbuffered writes, etc to commit transactions.
EXE do not have error correction, while single bit change can have drastic consequences.
If a file is to be transferred over TCP, you may assume zero errors.
I have seen it happen once or twice that a file transferred over the Internet became corrupted. You can do error detection using a checksum, such as SHA256.
You might be interested in the notes on error detecting codes in HDF5. Where and what kind of checksum depends on how you are accessing and updating the data as well as what is a useful chunk to detect an error in.
I went with a Reed-Solomon encoding system. There's a fairly easy-to-use Java implementation of it in Java in the Google zxing library.

Why is the software world full of status codes?

Why did programmers ever start using status codes? I mean, I guess I could imagine this might be useful back in the days when a text string was an expensive resource. WAYYY back then. But even after we had megabytes of memory to work with, we continued to use them. What possible advantage could there be for obfuscating the meaning of an error message or status message behind a status code?
It's easy to provide different translations of a status code. Having to look up a string to find the translation in another language is a little silly.
Besides, status codes are often used in code and typing:
var result = OpenFile(...);
if (result == "File not fond") {
...
}
Cannot be detected as a mistake by the compiler, where as,
var result = OpenFile(...);
if (result == FILE_NOT_FOND) {
...
}
Will be.
It allows for localization and changes to the text of an error message.
It's the same reason as ever. Numbers are cheap, strings are expensive, even in today's mega/gigabyte world.
I don't think status codes constitute obfuscation; it's simply an abstraction of a state.
A great use of integer status codes is in a Finite-State Machine. Having the states be integers allow for an efficient switch statement to jump to the correct code.
Integers also allow for more efficient use of bandwidth, which is still a very important issue in mobile applications.
Yet another example of integer codes over strings is for comparison. If you have similar statuses grouped together (say status 10000-10999) performing range comparisons to know the type of status is a major win. Could you imaging doing string comparisons just to know if an error code is fatal or just a warning, eww.
Numbers can be easily compared, including by another program (e.g. was there a failure). Human readable strings cannot.
Consider some of the things you might include in a string comparison, and sometimes might not:
Encoding options
Range of supported characters (compare ASCII and Unicode)
Case Sensitivity
Accent Sensitivity
Accent encoding (decomposed or composed unicode forms).
And that is before allowing for the majority of humans who don't speak English.
404 is universal on the web. If there were no status codes, imagine if every different piece of server software had its own error string?
"File not found"
"No file exists"
"Couldn't access file"
"Error presenting file to user"
"Rendering of file failed"
"Could not send file to client"
etc...
Even when we disregard data length, it's still better to have standard numeric representations that are universally understood, even when accompanied by actual error messages.
Computers are still binary machines and numeric operations are still cheaper and faster than string operations.
Integer representation in memory is a far more consistent thing than string representation. To begin with just think about all those null-terminated and Pascal strings. Than you can think of ASCII and the characters from 128 to 255 that were different according to different codepages and end up with Unicode characters and all of their little endian big endians, UTF-8 etc.
As it comes, returning an integer and having a separate resource stating how to interpret those integers is a far more universal approach.
Well when talking to a customer over the telephone a Number is much better then a string, and string can be in many different languages a number can't, try googeling some error text in lets say Swedish and then try googling it in English guess where you get the best hit.
Because not everyone speaks English. It's easier to translate error codes to multiple languages than to litter your code base with strings.
It's also easier for machines to understand codes as you can assign classes of errors to a range of numbers. E.g 1-10 are I/o issues, 11-20 are db, etc
Status codes are unique, whereas strings may be not. There is just one status code for example "213", but there may be many interpretation of for example "file not found", like "File not found", "File not found!", "Datei nicht gefunden", "File does not exist"....
Thus, status codes keep the information as simple as possible!
How about backwards compatibility with 30+ years of software libraries? After all, some code is still written in C ...
Also ... having megabytes of memory available is no justification for using them. And that's assuming you're not programming an embedded device.
And ... it's just pointless busy work for the CPU. If a computer is blindingly fast at processing strings, imagine the speed boost from efficient coding techniques.
I work on mainframes, and there it's common for applications to have every message prepended by a code (usually 3-4 letters by product, 4-5 numbers by specific message, and then a letter indicating severity of the message). And I wish this would be a standard practice on PC too.
There are several advantages aside from translation (as mentioned by others) to this:
It's easy to find the message in the manual; usually, the software are accompanied with the message manual explaining all the messages (and possible solution, etc.).
It's possible for automation software to react in the specific messages in the log in a specific way.
It's easy to find the source of the message in the source code. You can have further error codes per specific message; in that case, this is again helpful in debugging.
For all practical purposes numbers are the best representation for statuses, even today, and I imagine would be so for a while.
The most important thing about status codes is probably conciseness and acceptance. Two different systems can talk to each other all they want using numbers but if they don't agree on what the numbers mean, it's going to be a mess. Conciseness is another important aspect as the status code must not be more verbose than the meaning it's trying to convey. The world might agree on using
The resource that you asked for in the HTTP request does not exist on this server
as a status code in lieu of 404, but that's just plain nuisance.
So why do we still use numbers, specifically the English numerals? Because that is the most concise form of representation today and all computers are built upon these.
There might come a day when we start using images, or even videos, or something totally unimaginable for representing status codes in a highly abstracted form, but we are not there yet. Not even for strings.
Make it easier for an end user to understand what is happening when things go wrong.
It helps to have a very basic method of giving statuses clearly and universally. Whereas strings can easily be typed differently depending on dialect and can also have grammatical changes, Numerals do not have grammatical formatting and do not change with dialect. There is also the storage and transfer issue, a string is larger and thus takes longer to transfer over a network and store (even if it is a few thousandths of a millisecond). Because of this, we can assign numbers as universal identifiers for statuses, for they can transfer quicker and are more punctual, and for the programmes that read them can identify them however they wish (Being multilingual).
Plus, it is easier to read computationally:
switch($status) {
case '404':
echo 'File not found!';
break;
case '500':
echo 'Broken server!';
break;
}
etc.

Combining resources into a single binary file

How does one combine several resources for an application (images, sounds, scripts, xmls, etc.) into a single/multiple binary file so that they're protected from user's hands? What are the typical steps (organizing, loading, encryption, etc...)?
This is particularly common in game development, yet a lot of the game frameworks and engines out there don't provide an easy way to do this, nor describe a general approach. I've been meaning to learn how to do it, but I don't know where to begin. Could anyone point me in the right direction?
There are lots of ways to do this. m_pGladiator has some good ideas, especially with seralization. I would like to make a few other comments.
First, if you are going to pack a bunch of resources into a single file (I call these packfiles), then I think that you should work to avoid loading the whole file and then deseralizing out of that file into memory. The simple reason is that it's more memory. That's really not a problem on PC's I guess, but it's good practice, and it's essential when working on the console. While we don't (currently) serialize objects as m_pGladiator has suggested, we are moving towards that.
There are two types of packfiles that you might have. One would be a file where you want arbitrary access to the contents of the files. A second type might be a collection of files where you need all of those files when loading a level. A basic example might be:
An audio packfile might contain all the audio for your game. You might only need to load certain kinds of audio for the menus or interface screens and different sets of audio for the levels. This might fall intot he first category above.
A type that falls into the second category might be all models/textures/etc for a level. You basically want to load the entire contents of this file into the game at load time because you will (likely) need all of it's contents while a player is playing that level or section.
many of the packfiles that we build fall into the second category. We basically package up the level contents, and then compresses them with something like zlib. When we load one of these at game time, we read a small amount of the file, uncompress what we've read into a memory buffer, and then repeat until the full file has been read into memory. The buffer we read into is relatively small while final destination buffer is large enough to hold the largest set of uncompressed data that we need. This method is tricky, but again, it saves on RAM, it's an interesting exercise to get working, and you feel all nice and warm inside because you are being a good steward of system resources. once the packfile has been completely uncompressed into it's destinatino buffer, we run a final pass on the buffer to fix up pointer locations, etc. This method only works when you write out your packfile as structures that the game knows. In other words, our packfile writing tools share struct (or classses) with the game code. We are basically writing out and compressing exact representations of data structures.
If you simply want to cut down on the number of files that you are shipping and installing on a users machine, you can do with something like the first kind of packfile that I describe. Maybe you have 1000s of textures and would just simply like to cut down on the sheer number of files that you have to zip up and package. You can write a small utility that will basically read the files that you want to package together and then write a header containing the files and their offsets in the packfile, and then you can write the contents of the file, one at a time, one after the other, in your large binary file. At game time, you can simply load the header of this packfile and store the filenames and offsets in a hash. When you need to read a file, you can hash the filename and see if it exists in your packfile, and if so, you can read the contents directly from the packfile by seeking to the offset and then reading from that location in the packfile. Again, this method is basically a way to pack data together without regards for encryption, etc. It's simply an organizational method.
But again, I do want to stress that if you are going a route like I or m_pGladiator suggests, I would work hard to not have to pull the whole file into RAM and then deserialize to another location in RAM. That's a waste of resources (that you perhaps have plenty of). I would say that you can do this to get it working, and then once it's working, you can work on a method that only reads part of the file at a time and then decompresses to your destination buffer. You must use a comprsesion scheme that will work like this though. zlib and lzw both do (I believe). I'm not sure about an MD5 algorithm.
Hope that this helps.
do as Java: pack it all in a zip, and use an filesystem-like API to read directly from there.
Personally, I never used the already available tools to do that. If you want to prevent your game to be hacked easily, then you have to develop your own resource manipulation engine.
First of all read about serializing objects. When you load a resource from file (graphic, sound or whatever), it is stored in some object instance in the memory. A game usually uses dozens of graphical and sound objects. You have to make a tool, which loads them all and stores them in collections in the memory. Then serialize those collections into a binary file and you have every resource there.
Then you can use for example MD5 or any other encryption algorithm to encrypt this file.
Also, you can use zlib or other compression library to make this big binary file a bit smaller.
In the game, you should load the encrypted binary file and unpack it. Then decrypt it. Then deserialize the object collections and you have all resources back in memory.
Of course you can make this more comprehensive by storing in different binary files the resources for different levels and so on - there are plenty of variants, depending on what you want. Also you can first zip, then encrypt, or make other combinations of the steps.
Short answer: yes.
In Mac OS 6,7,8 there was a substantial API devoted to this exact task. Lookup the "Resource Manager" if you are interested. Edit: So does the ROOT physics analysis package.
Not that I know of a good tool right now. What platform(s) do you want it to work on?
Edited to add: All of the two-or-three tools of this sort that I am away of share a similar struture:
The file starts with a header and index
There are a series of blocks some of which may have there own headers and indicies, some of which are leaves
Each leaf is a simple serialization of the data to be stored.
The whole file (or sometimes individual blocks) may be compressed.
Not terribly hard to implement your own, but I'd look for a good existing one that meets your needs first.
For future people, like me, who are wondering about this same topic, check out the two following links:
http://www.sfml-dev.org/wiki/en/tutorials/formatdat
http://archive.gamedev.net/reference/programming/features/pak/

Resources