py3k default to byte instead of string - string

Just switching from Python2 to Python3 and the new string system is a real pain (or rather I'm not understanding its true benefit).
Is there any way to make it default to the old style bytes system without having to put a b before every string. I send a lot of commands via sockets and the code looks just ugly - i.e.
conn.sendall(b'k\n')
I tend to use this more than I worry about unicode

No, there isn't. And from what I gather you don't think it is a pain, and you do understand the benefit, you just think the b'' is ugly, which doesn't seem to be a very good reason to me.
Separating binary and text data is a great simplification in almost all cases. The need to prefix binary data with a b is a small price to pay for that.

Related

Adding some noise to a text

I wonder if there is any known algorithm/strategy to add some noise to a text string (for instance, adding a random sequence of characters every now and then or something similar).
I don't want to completely destroy the text just to make it slightly unusable. Also, I'm not interested in reversing back the changes, I can just recreate the original text from the sources I used to create it in the first place if needed.
Of course, a very basic algorithm for doing this could be easyly implemented but probably somebody has already created a somewhat sophisticated algorithm for this. If a Java implementation of something like this is available even better.
If you are using .Net and you need some random bytes maybe try the GetBytes method from the rngcryptoprovider. Nice n random. You could also use it to help in selection random positions to update.

How to determine codepage of a file (that had some codepage transformation applied to it)

For example if I know that ć should be ć, how can I find out the codepage transformation that occurred there?
It would be nice if there was an online site for this, but any tool will do the job. The final goal is to reverse the codepage transformation (with iconv or recode, but tools are not important, I'll take anything that works including python scripts)
EDIT:
Could you please be a little more verbose? You know for certain that some substring should be exactly. Or know just the language? Or just guessing? And the transformation that was applied, was it correct (i.e. it's valid in the other charset)? Or was it single transformation from charset X to Y but the text was actually in Z, so it's now wrong? Or was it a series of such transformations?
Actually, ideally I am looking for a tool that will tell me what happened (or what possibly happened) so I can try to transform it back to proper encoding.
What (I presume) happened in the problem I am trying to fix now is what is described in this answer - utf-8 text file got opened as ascii text file and then exported as csv.
It's extremely hard to do this generally. The main problem is that all the ascii-based encodings (iso-8859-*, dos and windows codepages) use the same range of codepoints, so no particular codepoint or set of codepoints will tell you what codepage the text is in.
There is one encoding that is easy to tell. If it's valid UTF-8, than it's almost certainly no iso-8859-* nor any windows codepage, because while all byte values are valid in them, the chance of valid utf-8 multi-byte sequence appearing in a text in them is almost zero.
Than it depends on which further encodings may can be involved. Valid sequence in Shift-JIS or Big-5 is also unlikely to be valid in any other encoding while telling apart similar encodings like cp1250 and iso-8859-2 requires spell-checking the words that contain the 3 or so characters that differ and seeing which way you get fewer errors.
If you can limit the number of transformation that may have happened, it shouldn't be too hard to put up a python script that will try them out, eliminate the obvious wrongs and uses a spell-checker to pick the most likely. I don't know about any tool that would do it.
The tools like that were quite popular decade ago. But now it's quite rare to see damaged text.
As I know it could be effectively done at least with a particular language. So, if you suggest the text language is Russian, you could collect some statistical information about characters or small groups of characters using a lot of sample texts. E.g. in English language the "th" combination appears more often than "ht".
So, then you could permute different encoding combinations and choose the one which has more probable text statistics.

Protocol buffers logging

In our business, we require to log every request/response which coming to our server.
At this time being, we are using xml as standard implementation.
Log files are used if we need to debug/trace some error.
I am kind of curious if we switch to protocol buffers, since it is binary, what will be the best way to log request/response to file?
For example:
FileOutputStream output = new FileOutputStream("\\files\log.txt");
request.build().writeTo(outout);
For anyone who has used protocol buffers in your application, how do you log your request/response, just in case we need it for debugging purpose?
TL;DR: write debugging logs in text, write long-term logs in binary.
There are at least two ways you can do this logging (and maybe, in fact, you should do both):
Writing your logs in text format. This is good for debugging and quickly checking for problems with your eyes.
Writing your logs in binary format - this will make future analysis much quicker since you can load the data using same protocol buffers code and do all kinds of things on them.
Quite honestly, this is more or less the way this is done at the place this technology came from.
We use the ShortDebugString() method on the C++ object to write down a human-readable version of all incoming and outgoing messages to a text-file. ShortDebugString() returns a one-line version of the same string returned by the toString() method in Java. Not sure how easy it is to accomplish the same thing in Java.
If you have competing needs for logging and performance then I suppose you could dump your binary data to the file as-is, with perhaps each record preceded by a tag containing a timestamp and a length value so you'll know where this particular bit of data ends. But I hasten to admit this is very ugly. You will need to write a utility to read and analyze this file, and will be helpless without that utility.
A more reasonable solution would be to dump your binary data in text form. I'm thinking of "lines" of text, again starting with whatever tagging information you find relevant, followed by some length information in decimal or hex, followed by as many hex bytes as needed to dump your buffer - thus you could end up with some fairly long lines. But since the file is line structured, you can use text-oriented tools (an editor in the simplest case) to work with it. Hex dumping essentially means you are using two bytes in the log to represent one byte of data (plus a bit of overhead). Heh, disk space is cheap these days.
If those binary buffers have a fairly consistent structure, you could even break out and label fields (or something like that) so your data becomes a little more human readable and, more importantly, better searchable. Of course it's up to you how much effort you want to sink into making your log records look pretty; but the time spent here may well pay off a little later in analysis.
If you've non-ASCII character strings in your messages, simply logging them by using implicit or explicit call to toString would escape the characters.
"오늘은 무슨 요일입니까?" becomes "\354\230\244\353\212\230\354\235\200 \353\254\264\354\212\250 \354\232\224\354\235\274\354\236\205\353\213\210\352\271\214?"
If you want to retain the non-ASCII characters, use TextFormat.printer().escapingNonAscii(false).printToString(message).
See this answer for more details.

Why is the software world full of status codes?

Why did programmers ever start using status codes? I mean, I guess I could imagine this might be useful back in the days when a text string was an expensive resource. WAYYY back then. But even after we had megabytes of memory to work with, we continued to use them. What possible advantage could there be for obfuscating the meaning of an error message or status message behind a status code?
It's easy to provide different translations of a status code. Having to look up a string to find the translation in another language is a little silly.
Besides, status codes are often used in code and typing:
var result = OpenFile(...);
if (result == "File not fond") {
...
}
Cannot be detected as a mistake by the compiler, where as,
var result = OpenFile(...);
if (result == FILE_NOT_FOND) {
...
}
Will be.
It allows for localization and changes to the text of an error message.
It's the same reason as ever. Numbers are cheap, strings are expensive, even in today's mega/gigabyte world.
I don't think status codes constitute obfuscation; it's simply an abstraction of a state.
A great use of integer status codes is in a Finite-State Machine. Having the states be integers allow for an efficient switch statement to jump to the correct code.
Integers also allow for more efficient use of bandwidth, which is still a very important issue in mobile applications.
Yet another example of integer codes over strings is for comparison. If you have similar statuses grouped together (say status 10000-10999) performing range comparisons to know the type of status is a major win. Could you imaging doing string comparisons just to know if an error code is fatal or just a warning, eww.
Numbers can be easily compared, including by another program (e.g. was there a failure). Human readable strings cannot.
Consider some of the things you might include in a string comparison, and sometimes might not:
Encoding options
Range of supported characters (compare ASCII and Unicode)
Case Sensitivity
Accent Sensitivity
Accent encoding (decomposed or composed unicode forms).
And that is before allowing for the majority of humans who don't speak English.
404 is universal on the web. If there were no status codes, imagine if every different piece of server software had its own error string?
"File not found"
"No file exists"
"Couldn't access file"
"Error presenting file to user"
"Rendering of file failed"
"Could not send file to client"
etc...
Even when we disregard data length, it's still better to have standard numeric representations that are universally understood, even when accompanied by actual error messages.
Computers are still binary machines and numeric operations are still cheaper and faster than string operations.
Integer representation in memory is a far more consistent thing than string representation. To begin with just think about all those null-terminated and Pascal strings. Than you can think of ASCII and the characters from 128 to 255 that were different according to different codepages and end up with Unicode characters and all of their little endian big endians, UTF-8 etc.
As it comes, returning an integer and having a separate resource stating how to interpret those integers is a far more universal approach.
Well when talking to a customer over the telephone a Number is much better then a string, and string can be in many different languages a number can't, try googeling some error text in lets say Swedish and then try googling it in English guess where you get the best hit.
Because not everyone speaks English. It's easier to translate error codes to multiple languages than to litter your code base with strings.
It's also easier for machines to understand codes as you can assign classes of errors to a range of numbers. E.g 1-10 are I/o issues, 11-20 are db, etc
Status codes are unique, whereas strings may be not. There is just one status code for example "213", but there may be many interpretation of for example "file not found", like "File not found", "File not found!", "Datei nicht gefunden", "File does not exist"....
Thus, status codes keep the information as simple as possible!
How about backwards compatibility with 30+ years of software libraries? After all, some code is still written in C ...
Also ... having megabytes of memory available is no justification for using them. And that's assuming you're not programming an embedded device.
And ... it's just pointless busy work for the CPU. If a computer is blindingly fast at processing strings, imagine the speed boost from efficient coding techniques.
I work on mainframes, and there it's common for applications to have every message prepended by a code (usually 3-4 letters by product, 4-5 numbers by specific message, and then a letter indicating severity of the message). And I wish this would be a standard practice on PC too.
There are several advantages aside from translation (as mentioned by others) to this:
It's easy to find the message in the manual; usually, the software are accompanied with the message manual explaining all the messages (and possible solution, etc.).
It's possible for automation software to react in the specific messages in the log in a specific way.
It's easy to find the source of the message in the source code. You can have further error codes per specific message; in that case, this is again helpful in debugging.
For all practical purposes numbers are the best representation for statuses, even today, and I imagine would be so for a while.
The most important thing about status codes is probably conciseness and acceptance. Two different systems can talk to each other all they want using numbers but if they don't agree on what the numbers mean, it's going to be a mess. Conciseness is another important aspect as the status code must not be more verbose than the meaning it's trying to convey. The world might agree on using
The resource that you asked for in the HTTP request does not exist on this server
as a status code in lieu of 404, but that's just plain nuisance.
So why do we still use numbers, specifically the English numerals? Because that is the most concise form of representation today and all computers are built upon these.
There might come a day when we start using images, or even videos, or something totally unimaginable for representing status codes in a highly abstracted form, but we are not there yet. Not even for strings.
Make it easier for an end user to understand what is happening when things go wrong.
It helps to have a very basic method of giving statuses clearly and universally. Whereas strings can easily be typed differently depending on dialect and can also have grammatical changes, Numerals do not have grammatical formatting and do not change with dialect. There is also the storage and transfer issue, a string is larger and thus takes longer to transfer over a network and store (even if it is a few thousandths of a millisecond). Because of this, we can assign numbers as universal identifiers for statuses, for they can transfer quicker and are more punctual, and for the programmes that read them can identify them however they wish (Being multilingual).
Plus, it is easier to read computationally:
switch($status) {
case '404':
echo 'File not found!';
break;
case '500':
echo 'Broken server!';
break;
}
etc.

What is the advantage of using Base64 encoding?

What is the advantage of using Base64 encode?
I would like to understand it better. Do I really need it? Can't I simply use pure strings?
I heard that the encoding can be up to 30% larger than the original (at least for images).
Originally some protocols only allowed 7 bit, and sometimes only 6 bit, data.
Base64 allows one to encode 8 bit data into 6 bits for transmission on those types of links.
Email is an example of this.
The primary use case of base64 encoding is when you want to store or transfer data with a restricted set of characters; i.e. when you can't pass an arbitrary value in each byte.
<img alt="Embedded Image"
src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAADIA..." />
This code will show encoded image, but no one can link to this image from another website and use your traffic.
Base64 decode
The advantages of Base64 encode, like somebody said, are available to transmit data from binary, into (most commonly) ASCII characters. Due to the likeliness that the receiving end can handle ASCII, it makes it a nice way to transfer binary data, via a text stream.
If your situation can handle native binary data, that will most likely yield better results, in terms of speed and such, but if not, Base64 is most likely the way to go. JSON is a great example of when you would benefit from something like this, or when it needs to be stored in a text field somewhere. Give us some more details and we can provide a better tailored answer.
One application is to transfer binary data in contexts where only characters are allowed. E.g. in XML documents/transfers. XML-RPC is an example of this.
Convert BLOB data to string and back...
Whether or not to use it depends on what you're using it for.
I've used it mostly for encoding binary data to pass through a mechanism that has really been created for text files. For example - when passing a digital certificate request around or retrieving the finished digital certificate -- in those cases, it's often very convenient to pass the binary data as Base 64 via a text field on a web form.
I probably wouldn't use it if you have something that is already text and you just want to pass it somewhere.
I use it for passing around files that tend to get chewed up by email programs because they look like text files (e.g. HL7 transcripts for replay).

Resources