Why is the software world full of status codes? - history

Why did programmers ever start using status codes? I mean, I guess I could imagine this might be useful back in the days when a text string was an expensive resource. WAYYY back then. But even after we had megabytes of memory to work with, we continued to use them. What possible advantage could there be for obfuscating the meaning of an error message or status message behind a status code?

It's easy to provide different translations of a status code. Having to look up a string to find the translation in another language is a little silly.
Besides, status codes are often used in code and typing:
var result = OpenFile(...);
if (result == "File not fond") {
...
}
Cannot be detected as a mistake by the compiler, where as,
var result = OpenFile(...);
if (result == FILE_NOT_FOND) {
...
}
Will be.

It allows for localization and changes to the text of an error message.

It's the same reason as ever. Numbers are cheap, strings are expensive, even in today's mega/gigabyte world.

I don't think status codes constitute obfuscation; it's simply an abstraction of a state.
A great use of integer status codes is in a Finite-State Machine. Having the states be integers allow for an efficient switch statement to jump to the correct code.
Integers also allow for more efficient use of bandwidth, which is still a very important issue in mobile applications.
Yet another example of integer codes over strings is for comparison. If you have similar statuses grouped together (say status 10000-10999) performing range comparisons to know the type of status is a major win. Could you imaging doing string comparisons just to know if an error code is fatal or just a warning, eww.

Numbers can be easily compared, including by another program (e.g. was there a failure). Human readable strings cannot.
Consider some of the things you might include in a string comparison, and sometimes might not:
Encoding options
Range of supported characters (compare ASCII and Unicode)
Case Sensitivity
Accent Sensitivity
Accent encoding (decomposed or composed unicode forms).
And that is before allowing for the majority of humans who don't speak English.

404 is universal on the web. If there were no status codes, imagine if every different piece of server software had its own error string?
"File not found"
"No file exists"
"Couldn't access file"
"Error presenting file to user"
"Rendering of file failed"
"Could not send file to client"
etc...
Even when we disregard data length, it's still better to have standard numeric representations that are universally understood, even when accompanied by actual error messages.

Computers are still binary machines and numeric operations are still cheaper and faster than string operations.

Integer representation in memory is a far more consistent thing than string representation. To begin with just think about all those null-terminated and Pascal strings. Than you can think of ASCII and the characters from 128 to 255 that were different according to different codepages and end up with Unicode characters and all of their little endian big endians, UTF-8 etc.
As it comes, returning an integer and having a separate resource stating how to interpret those integers is a far more universal approach.

Well when talking to a customer over the telephone a Number is much better then a string, and string can be in many different languages a number can't, try googeling some error text in lets say Swedish and then try googling it in English guess where you get the best hit.

Because not everyone speaks English. It's easier to translate error codes to multiple languages than to litter your code base with strings.
It's also easier for machines to understand codes as you can assign classes of errors to a range of numbers. E.g 1-10 are I/o issues, 11-20 are db, etc

Status codes are unique, whereas strings may be not. There is just one status code for example "213", but there may be many interpretation of for example "file not found", like "File not found", "File not found!", "Datei nicht gefunden", "File does not exist"....
Thus, status codes keep the information as simple as possible!

How about backwards compatibility with 30+ years of software libraries? After all, some code is still written in C ...
Also ... having megabytes of memory available is no justification for using them. And that's assuming you're not programming an embedded device.
And ... it's just pointless busy work for the CPU. If a computer is blindingly fast at processing strings, imagine the speed boost from efficient coding techniques.

I work on mainframes, and there it's common for applications to have every message prepended by a code (usually 3-4 letters by product, 4-5 numbers by specific message, and then a letter indicating severity of the message). And I wish this would be a standard practice on PC too.
There are several advantages aside from translation (as mentioned by others) to this:
It's easy to find the message in the manual; usually, the software are accompanied with the message manual explaining all the messages (and possible solution, etc.).
It's possible for automation software to react in the specific messages in the log in a specific way.
It's easy to find the source of the message in the source code. You can have further error codes per specific message; in that case, this is again helpful in debugging.

For all practical purposes numbers are the best representation for statuses, even today, and I imagine would be so for a while.
The most important thing about status codes is probably conciseness and acceptance. Two different systems can talk to each other all they want using numbers but if they don't agree on what the numbers mean, it's going to be a mess. Conciseness is another important aspect as the status code must not be more verbose than the meaning it's trying to convey. The world might agree on using
The resource that you asked for in the HTTP request does not exist on this server
as a status code in lieu of 404, but that's just plain nuisance.
So why do we still use numbers, specifically the English numerals? Because that is the most concise form of representation today and all computers are built upon these.
There might come a day when we start using images, or even videos, or something totally unimaginable for representing status codes in a highly abstracted form, but we are not there yet. Not even for strings.

Make it easier for an end user to understand what is happening when things go wrong.

It helps to have a very basic method of giving statuses clearly and universally. Whereas strings can easily be typed differently depending on dialect and can also have grammatical changes, Numerals do not have grammatical formatting and do not change with dialect. There is also the storage and transfer issue, a string is larger and thus takes longer to transfer over a network and store (even if it is a few thousandths of a millisecond). Because of this, we can assign numbers as universal identifiers for statuses, for they can transfer quicker and are more punctual, and for the programmes that read them can identify them however they wish (Being multilingual).
Plus, it is easier to read computationally:
switch($status) {
case '404':
echo 'File not found!';
break;
case '500':
echo 'Broken server!';
break;
}
etc.

Related

A new idea on how to beat the 32,767 text limit in Excel

So as many others have asked in the past is there a way to beat the 32k limit per cell in Excel?
I have found ways to do it by splitting the work load into two different .txt files and then merging the two .txt files, however it is a giant PITA and more often then not I end up only using excel to its limits as I do not have time to validate the data after .txt file merges anymore this is a long process and tedious IMO.
However I think that if the limitation is there it is there because it was coded when Microsoft developed Excel, and since they have yet to raise it (2013 version the limit is still the same limit so it would do no good to upgrade)
I also know that many will say if you have a need for information in a single cell in that length then you should use ACCESS well I have no idea how to use ACCESS or how to import a tab delimited file into ACCESS like you would into EXCEL, and then even if I could figure that out I still now have to figure out how to learn all the new commands and he EXCEL equivalents if there is even such a thing.
So I was browsing some blog posts the other day on how to beat limitations by software and I read something about reverse engineering.
Would it be possible to load excel into a hex editor, go in and change every instance of 32767 to something greater?
While 32767 may seem like an arbitrary number, it's actually the upper limit of a 16-bit signed integer (called a short in C). The range of a short goes from -32768 to 32767.
A 16-bit integer can also be unsigned, in which case its range is 0 to 65535.
Since it's impossible for a cell to have a negative number of characters, it seems odd that Microsoft would limit a cell's length based on a signed rather than unsigned 16-bit integer. When they wrote the original program, they probably couldn't imagine anyone storing so much information in a single cell. Using shorts may have simplified the code. (My first computer had only 4K of memory, so it's still amazing to me that Excel can store 8 times that much information in a single cell.)
Microsoft may have kept the 32767 limit to maintain backward compatibility with previous versions of Excel. However, that doesn't really make sense, because the row and column counts greatly increased in recent versions of Excel, making large spreadsheets incompatible with previous versions.
Now to your question of reverse-engineering Excel. It would be a gargantuan task, but not impossible. In the early '90s, I reverse-engineered and wrote vaccines for a few small computer viruses (several hundred bytes). In the '80s, I reverse-engineered an 8KB computer chess program.
When reverse-engineering an executable, you'll need a good disassembler or decompiler. Depending on what you use, you may get assembly-language or C code as the output. But note that this will not be commented code, and you will not see meaningful variable or function names. You'll have to read every line of code to determine what it does. And you'll quickly discover that the executable is the least of your worries. Excel's executable links in a number of DLL files, which would also need reverse-engineering.
To be successful, you will need an extensive knowledge of Windows programming in addition to C or Intel assembly code – not to mention a large amount of patience. Learning Access would be a much simpler task.
I'd be interested in why 32767 is insufficient for your needs. A database may make more sense, and it wouldn't necessarily need to duplicate the functionality of Excel. I store information in a database for output to Web pages, in which case I use HTML+JavaScript for anything that needs to be interactive.
In case anyone is still having this issue:
I had the same problem with generating a pipe-separated file of longitudinal research data. The header row exceeded the 32767 limit. Not an issue unless the end-user opens the file in excel. Work around is to have end-user open file in google sheets, perform the text-to-columns transformation, then download and open file in excel.
https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Length-limit-of-cell-contents-in-Excel-when-opening-exported-bibliographic-data?language=en_US
Jack Straw from Wichita (https://stackoverflow.com/users/10327211/jack-straw-from-wichita) surely you can do an import of a pipe separated file directly into Excel, using Data>Get Data? For me it finds the pipe and treats the piped file in the same way as a CSV. Even if for you it did not, you have an option on the import to specify the separator that you are using in your text file.
Kind regards
Sefton Hall

py3k default to byte instead of string

Just switching from Python2 to Python3 and the new string system is a real pain (or rather I'm not understanding its true benefit).
Is there any way to make it default to the old style bytes system without having to put a b before every string. I send a lot of commands via sockets and the code looks just ugly - i.e.
conn.sendall(b'k\n')
I tend to use this more than I worry about unicode
No, there isn't. And from what I gather you don't think it is a pain, and you do understand the benefit, you just think the b'' is ugly, which doesn't seem to be a very good reason to me.
Separating binary and text data is a great simplification in almost all cases. The need to prefix binary data with a b is a small price to pay for that.

How to determine codepage of a file (that had some codepage transformation applied to it)

For example if I know that ć should be ć, how can I find out the codepage transformation that occurred there?
It would be nice if there was an online site for this, but any tool will do the job. The final goal is to reverse the codepage transformation (with iconv or recode, but tools are not important, I'll take anything that works including python scripts)
EDIT:
Could you please be a little more verbose? You know for certain that some substring should be exactly. Or know just the language? Or just guessing? And the transformation that was applied, was it correct (i.e. it's valid in the other charset)? Or was it single transformation from charset X to Y but the text was actually in Z, so it's now wrong? Or was it a series of such transformations?
Actually, ideally I am looking for a tool that will tell me what happened (or what possibly happened) so I can try to transform it back to proper encoding.
What (I presume) happened in the problem I am trying to fix now is what is described in this answer - utf-8 text file got opened as ascii text file and then exported as csv.
It's extremely hard to do this generally. The main problem is that all the ascii-based encodings (iso-8859-*, dos and windows codepages) use the same range of codepoints, so no particular codepoint or set of codepoints will tell you what codepage the text is in.
There is one encoding that is easy to tell. If it's valid UTF-8, than it's almost certainly no iso-8859-* nor any windows codepage, because while all byte values are valid in them, the chance of valid utf-8 multi-byte sequence appearing in a text in them is almost zero.
Than it depends on which further encodings may can be involved. Valid sequence in Shift-JIS or Big-5 is also unlikely to be valid in any other encoding while telling apart similar encodings like cp1250 and iso-8859-2 requires spell-checking the words that contain the 3 or so characters that differ and seeing which way you get fewer errors.
If you can limit the number of transformation that may have happened, it shouldn't be too hard to put up a python script that will try them out, eliminate the obvious wrongs and uses a spell-checker to pick the most likely. I don't know about any tool that would do it.
The tools like that were quite popular decade ago. But now it's quite rare to see damaged text.
As I know it could be effectively done at least with a particular language. So, if you suggest the text language is Russian, you could collect some statistical information about characters or small groups of characters using a lot of sample texts. E.g. in English language the "th" combination appears more often than "ht".
So, then you could permute different encoding combinations and choose the one which has more probable text statistics.

Protocol buffers handling very large String message?

I was finally able to write protocol buffers code over REST and did some comparison with XStream which we are currently uses.
Everything seems great, only stumble with one thing.
We have very large messages in one particular attributes, say something like this
message Data {
optional string datavalue=1;
}
Datavalue above are extremely huge text messages. Size is 512kb - 5 Mb.
Protocol buffers deserialize just fine, with superb performance comparing to XStream.
However, I notice when I send this message to wire (via REST), it took longer to get response. Always twice longer than XStream.
I am thinking this might come from serializing time.
From google documents, it says Protocol buffers is not designed to handle very large messages, although it can handle very large data set.
I was wondering if anyone has some opinion or maybe solution from my case above?
Thanks
I was benchmarking different serialization tools a while ago and noticed that the Protobuf Java library took about 1.7x as long to serialize strings as java.io.DataOutputStream did. When I looked into it, it seemed to have to do with weird artifact of how the JVM optimizes certain code paths. However, in my benchmarking, XStream was always slower, even with really long strings.
One quick thing to try is the format-compatible Protostuff library in place of Google's Protobuf library.
I remember reading somewhere (trying to locate the article) that protobuf is very good if you have a mix of binary and textual data types. When you are working purely on textual data then you could get better performance and size by compressing it.

Protocol buffers logging

In our business, we require to log every request/response which coming to our server.
At this time being, we are using xml as standard implementation.
Log files are used if we need to debug/trace some error.
I am kind of curious if we switch to protocol buffers, since it is binary, what will be the best way to log request/response to file?
For example:
FileOutputStream output = new FileOutputStream("\\files\log.txt");
request.build().writeTo(outout);
For anyone who has used protocol buffers in your application, how do you log your request/response, just in case we need it for debugging purpose?
TL;DR: write debugging logs in text, write long-term logs in binary.
There are at least two ways you can do this logging (and maybe, in fact, you should do both):
Writing your logs in text format. This is good for debugging and quickly checking for problems with your eyes.
Writing your logs in binary format - this will make future analysis much quicker since you can load the data using same protocol buffers code and do all kinds of things on them.
Quite honestly, this is more or less the way this is done at the place this technology came from.
We use the ShortDebugString() method on the C++ object to write down a human-readable version of all incoming and outgoing messages to a text-file. ShortDebugString() returns a one-line version of the same string returned by the toString() method in Java. Not sure how easy it is to accomplish the same thing in Java.
If you have competing needs for logging and performance then I suppose you could dump your binary data to the file as-is, with perhaps each record preceded by a tag containing a timestamp and a length value so you'll know where this particular bit of data ends. But I hasten to admit this is very ugly. You will need to write a utility to read and analyze this file, and will be helpless without that utility.
A more reasonable solution would be to dump your binary data in text form. I'm thinking of "lines" of text, again starting with whatever tagging information you find relevant, followed by some length information in decimal or hex, followed by as many hex bytes as needed to dump your buffer - thus you could end up with some fairly long lines. But since the file is line structured, you can use text-oriented tools (an editor in the simplest case) to work with it. Hex dumping essentially means you are using two bytes in the log to represent one byte of data (plus a bit of overhead). Heh, disk space is cheap these days.
If those binary buffers have a fairly consistent structure, you could even break out and label fields (or something like that) so your data becomes a little more human readable and, more importantly, better searchable. Of course it's up to you how much effort you want to sink into making your log records look pretty; but the time spent here may well pay off a little later in analysis.
If you've non-ASCII character strings in your messages, simply logging them by using implicit or explicit call to toString would escape the characters.
"오늘은 무슨 요일입니까?" becomes "\354\230\244\353\212\230\354\235\200 \353\254\264\354\212\250 \354\232\224\354\235\274\354\236\205\353\213\210\352\271\214?"
If you want to retain the non-ASCII characters, use TextFormat.printer().escapingNonAscii(false).printToString(message).
See this answer for more details.

Resources