What is the difference between binary safe strings and binary unsafe strings? - string

I was reading redis manifesto[1] and it seems redis accepts only binary safe strings as keys but I don't know the difference between the two. Can anyone explain with an example?
[1] http://oldblog.antirez.com/post/redis-manifesto.html

According to Redis documentation, simple Redis strings have syntax "+redis_response\r\n" whereas bulk Redis strings have syntax "$str_len\r\nbinary_safe_string\r\n".
In other words, binary safe string in Redis can contain any data as simple as "foo" to any binary data upto 512MB say a JEPG image. Binary safe string has its length encoded in it and does not terminate with any particular character such as a NULL terminating string in C which ends with '\0.
HTH,
Swanand

I'm not familiar with the system in question, but the term "binary safe string" might be used either to describe certain string-storage types or to describe particular string instances. In a binary-safe string type, a string of length N may be used to encapsulate any sequence of N values in the range either 0-255 or 0-65535 (for 8- or 16-bit types, respectively). A binary-safe string instance might be one whose representation may be subdivided into uniformly-sized pieces, with each piece representing one character, as distinct from a string instance in which different characters require different amounts of storage space.
Some string types (which are not binary safe) will use variable-length representations for certain characters, and will behave oddly if asked to act upon e.g. a string which contains the code for "first half of a multi-part character" followed by something other than a "second half of multi-part character". Further, some code which works with strings will assume that it the Nth character will be stored in either the Nth byte or the Nth pair of bytes, and will malfunction if given a string in which, e.g. the 8th character is stored in the 12th and 13th pairs of bytes.
Looking only briefly at the link provided, I would guess that it's saying that the redis does not expect to only work with strings that use different numbers of bytes to hold different characters, though I'm not quite clear whether it's assuming that a string type will be able to handle any possible sequence of bytes, or whether it's assuming that any string instance which it's given may be safely regarded as a sequence of bytes. I think the fundamental concepts of interest, though, are (1) some string types use variable-length encodings and others do not; (2) even in types that use variable-length encodings, a useful subset of string instances will consist only of fixed-length characters.

Binary-safe means that a string can contain any character, while binary-unsafe can not, such as '\0' in C language. '\0' is the ending of a string, which means characters after '\0' and before '\0' will be considered as two different strings.

Related

Make a model to identify a string

I have a string like this
ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8
The first part of the string is a random 18 digit number in base64 format and the second is a unix timestamp in base64 too, while the last is an hmac.
I want to make a model to recognize a string like this.
How may i do it?
While I did not necessarily think deeply about it, this would be what comes to my mind first.
You certainly don't need machine learning for this. In fact, machine learning would not only be inefficient for problems like this but may even be worse, depending on a given approach.
Here, an exact solution can be achieved, simply by understanding the problem.
One way people often go about matching strings with a certain structure is with so called regular expressions or RegExp.
Regular expressions allow you to match string patterns of varying complexity.
To give a simple example in Python:
import re
your_string = "ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8"
regexp_pattern = r"(.+)\.(.+)\.(.+)"
re.findall(regexp_pattern, your_string)
>>> [('ODQ1OTc3MzY0MDcyNDk3MTUy', 'YKoz0Q', 'wlST3vVZ3IN8nTtVX1tz8Vvq5O8')]
Now one problem with this is how do you know where your string starts and stops. Most of the times there are certain anchors, especially in strings that were created programmatically. For instance, if we knew that prior to each string you wanted to match there is the word Token: , you could include that in your RegExp pattern r"Token: (.+)\.(.+)\.(.+)".
Other ways to avoid mismatches would be to clearer define the pattern requirements. Right now we simply match a pattern with any amount of characters and two . separating them into three sequences.
If you would know which implementation of base64 you were using, you could limit the alphabet of potential characters from . (thus any) to the alphabet used in your base64 implementation [abcdefgh1234]. In this example it would be abcdefgh1234, so the pattern could be refined like this r"([abcdefgh1234]+).([abcdefgh1234]+).(.+)"`.
The same applies to the HMAC code.
Furthermore, you could specify the allowed length of each substring.
For instance, you said you have 18 random digits. This would likely mean each is encoded as 1 byte, which would translate to 18*8 = 144 bits, which in base64, would translate to 24 tokens (where each encodes a sextet, thus 6 bits of information). The same could be done with the timestamp, assuming a 32 bit timestamp, this would likely necessitate 6 base64 tokens (representing 36 bits, 36 because you could not divide 32 into sextets).
With this information, you could further refine the pattern
r"([abcdefgh1234]{24})\.([abcdefgh1234]{6})\.(.+)"`
In addition, the same could be applied to the HMAC code.
I leave it to you to read a bit about RegExp but I'd guess it is the easiest solution and certainly more appropriate than any kind of machine learning.

Can lowercasing a UTF-8 string cause it to grow?

I'm helping out with someone writing some code to compare UTF-8 strings in a case-insensitive way. The scheme they are using is to uppercase the strings and then compare. The input strings can all fit in a 255 byte array. The output string similarly must fit in a 255 byte array.
I'm not a UTF-8 or Unicode expert, but I think this this scheme can't work for all strings. My understanding is that either lower casing or upper casing a UTF-8 string can result in the output string being longer (byte array wise), and as such changing case is probably not the best way to attack this problem. I'm trying to demonstrate the difficulty by giving a few strings that will not work with this design.
For example, take a string of the character U+0587 repeated 100 times. U+0587 takes two bytes in UTF-8, so the overall length of the byte array for the string is 200 bytes (ignoring the trailing null for now). If that string is uppercased, however, it becomes U+0535 U+0552, and each of those takes two bytes, for a total of 4 bytes. The 200 byte array is now 400 bytes, and cannot be stored in the limited space available.
So here's my question: I gave an example of a lowercase character needing more space to store when uppercased. Are there any examples of an uppercase character needing more space to store when lowercased? The locale is always en_US.UTF-8 in this case.
Thanks for any help.
Yes. Examples from my environment:
U+023A Ⱥ U+023E Ⱦ
There are several related factors that could cause variation. You already pointed out one:
The locale that you specify will affect the casing of characters that that locale is concerned with.
The version of the Unicode Common Locale Data Repository that your library uses.
The version of the Unicode Character Database that your library uses.
These aren't fixed targets because we can expect that there will be future versions and that there will be users using characters from them.
Ultimately, this comes down to your environment and to practical purpose this has.

How does punycode distinguish similar IRIs?

I've been looking into internationalised resource identifiers and there's one thing bugging me.
My understanding is that, for each label in a domain name (xyzzy.plugh.com has three labels, xyzzy, plugh and com), the following process is performed to translate it into ASCII representation so that it can be processed okay by all legacy software:
If it consists solely of ASCII characters, it's copied as is.
Otherwise:
First we output xn-- followed by all the ASCII characters (skipping non-ASCII).
Then, if the final character isn't -, we output - to separate the ASCII from non-ASCII.
Finally, we encode each of the non-ASCII characters using punycode so that they appear to be ASCII.
My question then is: how do we distinguish between the following two Unicode URIs?
http://aa☃.net/
http://☃aa.net/
It seems to me that both of these will encode to:
http://xn--aa-nfh.net/
simply because the sequencing information has been lost for the label as a whole.
Or am I missing something in the specification?
According to one punycode encoder, there are encoded differently:
aa☃.net -> xn--aa-gsx.net
☃aa.net -> xn--aa-esx.net
^
see here
The relevant RFC 3492 details why this is the case. First, it provides clues in the introduction:
Uniqueness: There is at most one basic string that represents a given extended string.
Reversibility: Any extended string mapped to a basic string can be recovered from that basic string.
That means there must be differentiable one-to-one mapping for every single basic/extended string pair.
Understanding how it differentiates the two possibilities requires an understanding of the decoder (the thing that turns the basic string back into an extended one, with all its Unicode glory) works.
The decoder begins by starting with just the basic string aa.net with a pointer to the first a, then applies a series of deltas, such as gsx or esx.
The delta actually encodes two things. The first is the number of non-insertions to be done and the second is the actual insertion.
So, gsx (the delta in aa☃.net) would encode two non-insertions (to skip the aa) followed by an insertion of ☃. The esx delta (for ☃aa.net) would encode zero non-insertions followed by an insertion of ☃.
That is how position is encoded into the basic strings.

Does a strings length equal the byte size?

Exactly that: Does a strings length equal the byte size? Does it matter on the language?
I think it is, but I just want to make sure.
Additional Info: I'm just wondering in general. My specific situation was PHP with MySQL.
As the answer is no, that's all I need know.
Nope. A zero terminated string has one extra byte. A pascal string (the Delphi shortstring) has an extra byte for the length. And unicode strings has more than one byte per character.
By unicode it depends on the encoding. It could be 2 or 4 bytes per character or even a mix of 1,2 and 4 bytes.
It entirely depends on the platform and representation.
For example, in .NET a string takes two bytes in memory per UTF-16 code point. However, surrogate pairs require two UTF-16 values for a full Unicode character in the range U+100000 to U+10FFFF. The in-memory form also has an overhead for the length of the string and possibly some padding, as well as the normal object overhead of a type pointer etc.
Now, when you write a string out to disk (or the network, etc) from .NET, you specify the encoding (with most classes defaulting to UTF-8). At that point, the size depends very much on the encoding. ASCII always takes a single byte per character, but is very limited (no accents etc); UTF-8 gives the full Unicode range with a variable encoding (all ASCII characters are represented in a single byte, but others take up more). UTF-32 always uses exactly 4 bytes for any Unicode character - the list goes on.
As you can see, it's not a simple topic. To work out how much space a string is going to take up you'll need to specify exactly what the situation is - whether it's an object in memory on some platform (and if so, which platform - potentially even down to the implementation and operating system settings), or whether it's a raw encoded form such as a text file, and if so using which encoding.
It depends on what you mean by "length". If you mean "number of characters" then, no, many languages/encoding methods use more than one byte per character.
Not always, it depends on the encoding.
There's no single answer; it depends on language and implementation (remember that some languages have multiple implementations!)
Zero-terminated ASCII strings occupy at least one more byte than the "content" of the string. (More may be allocated, depending on how the string was created.)
Non-zero-terminated strings use a descriptor (or similar structure) to record length, which takes extra memory somewhere.
Unicode strings (in various languages) use two bytes per char.
Strings in an object store may be referenced via handles, which adds a layer of indirection (and more data) in order to simplify memory management.
You are correct. If you encode as ASCII, there is one byte per character. Otherwise, it is one or more bytes per character.
In particular, it is important to know how this effects substring operations. If you don't have one byte per character, does s[n] get the nth byte or nth char? Getting the nth char will be inefficient for large n instead of constant, as it is with a one byte per character.

What defines data that can be stored in strings

A few days ago, I asked why its not possible to store binary data, such as a jpg file into a string variable.
Most of the answers I got said that string is used for textual information such as what I'm writing now.
What is considered textual data though? Bytes of a certain nature represent a jpg file and those bytes could be represented by character byte values...I think. So when we say strings are for textual information, is there some sort of range or list of characters that aren't stored?
Sorry if the question sounds silly. Just trying to 'get it'
I see three major problems with storing binary data in strings:
Most systems assume a certain encoding within string variables - e.g. if it's a UTF-8, UTF-16 or ASCII string. New line characters may also be translated depending on your system.
You should watch out for restrictions on the size of strings.
If you use C style strings, every null character in your data will terminate the string and any string operations performed will only work on the bytes up to the first null.
Perhaps the most important: it's confusing - other developers don't expect to find random binary data in string variables. And a lot of code which works on strings might also get really confused when encountering binary data :)
I would prefer to store binary data as binary, you would only think of converting it to text when there's no other choice since when you convert it to a textual representation it does waste some bytes (not much, but it still counts), that's how they put attachments in email.
Base64 is a good textual representation of binary files.
I think you are referring to binary to text encoding issue. (translate a jpg into a string would require that sort of pre-processing)
Indeed, in that article, some characters are mentioned as not always supported, other can be confusing:
Some systems have a more limited character set they can handle; not only are they not 8-bit clean, some can't even handle every printable ASCII character.
Others have limits on the number of characters that may appear between line breaks.
Still others add headers or trailers to the text.
And a few poorly-regarded but still-used protocols use in-band signaling, causing confusion if specific patterns appear in the message. The best-known is the string "From " (including trailing space) at the beginning of a line used to separate mail messages in the mbox file format.
Whoever told you you can't put 'binary' data into a string was wrong. A string simply represents an array of bytes that you most likely plan on using for textual data... but there is nothing stopping you from putting any data in there you want.
I do have to be careful though, because I don't know what language you are using... and in some languages \0 ends the string.
In C#, you can put any data into a string... example:
byte[] myJpegByteArray = GetBytesFromSomeImage();
string myString = Encoding.ASCII.GetString(myJpegByteArray);
Before internationalization, it didn't make much difference. ASCII characters are all bytes, so strings, character arrays and byte arrays ended up having the same implementation.
These days, though, strings are a lot more complicated, in order to deal with thousands of foreign language characters and the linguistic rules that go with them.
Sure, if you look deep enough, everything is just bits and bytes, but there's a world of difference in how the computer interprets them. The rules for "text" make things look right when it's displayed to a human, but the computer is free to monkey with the internal representation. For example,
In Unicode, there are many encoding systems. Changing between them makes every byte different.
Some languages have multiple characters that are linguistically equivalent. These could switch back and forth when you least expect it.
There are different ways to end a line of text. Unintended translations between CRLF and LF will break a binary file.
Deep down everything is just bytes.
Things like strings and pictures are defined by rules about how to order bytes.
strings for example end in a byte with value 32 (or something else)
jpg's don't
Depends on the language. For example in Python string types (str) are really byte arrays, so they can indeed be used for binary data.
In C the NULL byte is used for string termination, so a sting cannot be used for arbitrary binary data, since binary data could contain null bytes.
In C# a string is an array of chars, and since a char is basically an alias for 16bit int, you can probably get away with storing arbitrary binary data in a string. You might get errors when you try to display the string (because some values might not actually correspond to a legal unicode character), and some operations like case conversions will probably fail in strange ways.
In short it might be possible in some langauges to store arbitrary binary data in strings, but they are not designed for this use, and you may run into all kinds of unforseen trouble. Most languages have a byte-array type for storing arbitrary binary data.
I agree with Jacobus' answer:
In the end all data structures are made up of bytes. (Well, if you go even deeper: of bits). With some abstraction, you could say that a string or a byte array are conventions for programmers, on how to access them.
In this regard, the string is an abstraction for data interpreted as a text. Text was invented for communication among humans, computers or programs do not communicate very well using text. SQL is textual, but is an interface for humans to tell a database what to do.
So in general, textual data, and therefore strings, are primarily for human to human, or human to machine interaction (say for the content of a message box). Using them for something else (e.g. reading or writing binary image data) is possible, but carries lots of risk bacause you are using the data type for something it was not designed to handle. This makes it much more error prone. You may be able to store binary data in strings, mbut just because you are able to shoot yourself in the foot, you should avoid doing so.
Summary: You can do it. But you better don't.
Your original question (c# - What is string really good for?) made very little sense. So the answers didn't make sense, either.
Your original question said "For some reason though, when I write this string out to a file, it doesn't open." Which doesn't really mean much.
Your original question was incomplete, and the answers were misleading and confusing. You CAN store anything in a String. Period. The "strings are for text" answers were there because you didn't provide enough information in your question to determine what's going wrong with your particular bit of C# code.
You didn't provide a code snippet or an error message. That's why it's hard to 'get it' -- you're not providing enough details for us to know what you don't get.

Resources