What defines data that can be stored in strings - string

A few days ago, I asked why its not possible to store binary data, such as a jpg file into a string variable.
Most of the answers I got said that string is used for textual information such as what I'm writing now.
What is considered textual data though? Bytes of a certain nature represent a jpg file and those bytes could be represented by character byte values...I think. So when we say strings are for textual information, is there some sort of range or list of characters that aren't stored?
Sorry if the question sounds silly. Just trying to 'get it'

I see three major problems with storing binary data in strings:
Most systems assume a certain encoding within string variables - e.g. if it's a UTF-8, UTF-16 or ASCII string. New line characters may also be translated depending on your system.
You should watch out for restrictions on the size of strings.
If you use C style strings, every null character in your data will terminate the string and any string operations performed will only work on the bytes up to the first null.
Perhaps the most important: it's confusing - other developers don't expect to find random binary data in string variables. And a lot of code which works on strings might also get really confused when encountering binary data :)

I would prefer to store binary data as binary, you would only think of converting it to text when there's no other choice since when you convert it to a textual representation it does waste some bytes (not much, but it still counts), that's how they put attachments in email.
Base64 is a good textual representation of binary files.

I think you are referring to binary to text encoding issue. (translate a jpg into a string would require that sort of pre-processing)
Indeed, in that article, some characters are mentioned as not always supported, other can be confusing:
Some systems have a more limited character set they can handle; not only are they not 8-bit clean, some can't even handle every printable ASCII character.
Others have limits on the number of characters that may appear between line breaks.
Still others add headers or trailers to the text.
And a few poorly-regarded but still-used protocols use in-band signaling, causing confusion if specific patterns appear in the message. The best-known is the string "From " (including trailing space) at the beginning of a line used to separate mail messages in the mbox file format.

Whoever told you you can't put 'binary' data into a string was wrong. A string simply represents an array of bytes that you most likely plan on using for textual data... but there is nothing stopping you from putting any data in there you want.
I do have to be careful though, because I don't know what language you are using... and in some languages \0 ends the string.
In C#, you can put any data into a string... example:
byte[] myJpegByteArray = GetBytesFromSomeImage();
string myString = Encoding.ASCII.GetString(myJpegByteArray);

Before internationalization, it didn't make much difference. ASCII characters are all bytes, so strings, character arrays and byte arrays ended up having the same implementation.
These days, though, strings are a lot more complicated, in order to deal with thousands of foreign language characters and the linguistic rules that go with them.
Sure, if you look deep enough, everything is just bits and bytes, but there's a world of difference in how the computer interprets them. The rules for "text" make things look right when it's displayed to a human, but the computer is free to monkey with the internal representation. For example,
In Unicode, there are many encoding systems. Changing between them makes every byte different.
Some languages have multiple characters that are linguistically equivalent. These could switch back and forth when you least expect it.
There are different ways to end a line of text. Unintended translations between CRLF and LF will break a binary file.

Deep down everything is just bytes.
Things like strings and pictures are defined by rules about how to order bytes.
strings for example end in a byte with value 32 (or something else)
jpg's don't

Depends on the language. For example in Python string types (str) are really byte arrays, so they can indeed be used for binary data.
In C the NULL byte is used for string termination, so a sting cannot be used for arbitrary binary data, since binary data could contain null bytes.
In C# a string is an array of chars, and since a char is basically an alias for 16bit int, you can probably get away with storing arbitrary binary data in a string. You might get errors when you try to display the string (because some values might not actually correspond to a legal unicode character), and some operations like case conversions will probably fail in strange ways.
In short it might be possible in some langauges to store arbitrary binary data in strings, but they are not designed for this use, and you may run into all kinds of unforseen trouble. Most languages have a byte-array type for storing arbitrary binary data.

I agree with Jacobus' answer:
In the end all data structures are made up of bytes. (Well, if you go even deeper: of bits). With some abstraction, you could say that a string or a byte array are conventions for programmers, on how to access them.
In this regard, the string is an abstraction for data interpreted as a text. Text was invented for communication among humans, computers or programs do not communicate very well using text. SQL is textual, but is an interface for humans to tell a database what to do.
So in general, textual data, and therefore strings, are primarily for human to human, or human to machine interaction (say for the content of a message box). Using them for something else (e.g. reading or writing binary image data) is possible, but carries lots of risk bacause you are using the data type for something it was not designed to handle. This makes it much more error prone. You may be able to store binary data in strings, mbut just because you are able to shoot yourself in the foot, you should avoid doing so.
Summary: You can do it. But you better don't.

Your original question (c# - What is string really good for?) made very little sense. So the answers didn't make sense, either.
Your original question said "For some reason though, when I write this string out to a file, it doesn't open." Which doesn't really mean much.
Your original question was incomplete, and the answers were misleading and confusing. You CAN store anything in a String. Period. The "strings are for text" answers were there because you didn't provide enough information in your question to determine what's going wrong with your particular bit of C# code.
You didn't provide a code snippet or an error message. That's why it's hard to 'get it' -- you're not providing enough details for us to know what you don't get.

Related

Should I ALWAYS use rune instead of string except doing I/O

In Python3, all strings are Unicode so you only need to do decode or encode when doing I/O operation, and in the main part of you code, you only do with Unicode.
So, I want to know that in Go, should I do the same? Should I convert all the strings to []rune at the input and all my functions only receive []rune type?
Because I'm new to Go, so I don't know how many 3rd party library support rune as string. If I use rune all the way in my code, when I need to interact with a 3rd party library, will the overhead of converting rune to string be a problem?
Should I ALWAYS use rune instead of string except doing I/O
There are several very useful packages that work with strings which you would find awkward to work with if your data is in arrays (or slices) of runes.
There are many cases that I have to get the character at a index,
It isn't safe to do that in general, partly because of combining characters but also because strings (or Unicode text in general) can contain many other difficult situations - perhaps a mix of left-to-right and right-to-left text etc.
Normalizing the text to one of the several normal forms might help deal with most combining characters but there will be some combinations that don't reduce to a single rune.
I'm writing something like a parser to parse the text with emoji
Unicode emoticons are just another codepoint - so can be treated like most ordinary characters.
In many cases it is probably best to use the range operator to walk through a string.
If you wanted to, for example, replace all 😀 with :-), this could perhaps be handled using strings.Replace() or by using for ... range with strings.Builder.
For me the most persuasive argument is that once you venture outside of ASCII, text is weird, Unicode is almost unfathomably weird, plumbing its depths is something best left to experts who spend their lives grappling with its madness. If you want to spend your time on business-end features that usually more clearly matter to you, your business and customers, use standard packages.
Useful references:
Strings, bytes, runes and characters in Go. Rob Pike. 23 October 2013
Package unicode. Go authors.
Dark corners of Unicode. Eevee. Sep 12, 2015
Unicode is kind of insane. Ben Frederickson. 26 May 2015
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Joel Spolsky. October 8, 2003

When to use Unicode Normalization Forms NFC and NFD?

The Unicode Normalization FAQ includes the following paragraph:
Programs should always compare canonical-equivalent Unicode strings as equal ... The Unicode Standard provides well-defined normalization forms that can be used for this: NFC and NFD.
and continues...
The choice of which to use depends on the particular program or system. NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings. ... NFD and NFKD are most useful for internal processing.
My questions are:
What makes NFC best for "general text." What defines "internal processing" and why is it best left to NFD? And finally, never minding what is "best," are the two forms interchangable as long as two strings are compared using the same normalization form?
The FAQ is somewhat misleading, starting from its use of “should” followed by the inconsistent use of “requirement” about the same thing. The Unicode Standard itself (cited in the FAQ) is more accurate. Basically, you should not expect programs to treat canonically equivalent strings as different, but neither should you expect all programs to treat them as identical.
In practice, it really depends on what your software needs to do. In most situations, you don’t need to normalize at all, and normalization may destroy essential information in the data.
For example, U+0387 GREEK ANO TELEIA (·) is defined as canonical equivalent to U+00B7 MIDDLE DOT (·). This was a mistake, as the characters are really distinct and should be rendered differently and treated differently in processing. But it’s too late to change that, since this part of Unicode has been carved into stone. Consequently, if you convert data to NFC or otherwise discard differences between canonically equivalent strings, you risk getting wrong characters.
There are risks that you take by not normalizing. For example, the letter “ä” can appear as a single Unicode character U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or as two Unicode characters U+0061 LATIN SMALL LETTER A U+0308 COMBINING DIAERESIS. It will mostly be the former, i.e. the precomposed form, but if it is the latter and your code tests for data containing “ä”, using the precomposed form only, then it will not detect the latter. But in many cases, you don’t do such things but simply store the data, concatenate strings, print them, etc. Then there is a risk that the two representations result in somewhat different renderings.
It also matters whether your software passes character data to other software somehow. The recipient might expect, due to naive implicit assumptions or consciously and in a documented manner, that its input is normalized.
NFC is the general common sense form that you should use, ä is 1 code point there and that makes sense.
NFD is good for certain internal processing - if you want to make accent-insensitive searches or sorting, having your string in NFD makes it much easier and faster. Another usage is making more robust slug titles. These are just the most obvious ones, I am sure there are plenty of more uses.
If two strings x and y are canonical equivalents, then
toNFC(x) = toNFC(y)
toNFD(x) = toNFD(y)
Is that what you meant?

Erlang binary strings by default

I am writing an erlang module that has to deal a bit with strings, not too much, however, I do some tcp recv and then some parsing over the data.
While matching data and manipulating strings, I am using binary module all the time like binary:split(Data,<<":">>) and basically using <<"StringLiteral">> all the time.
Till now I have not encounter difficulties or missing methods from the alternative( using lists) and everything is coming out quite naturally except maybe for adding the <<>>, but I was wondering if this way of dealing with strings might have drawbacks I am not aware of.
Any hint?
As long as you and your team remember that your strings are binaries and not lists, there are no inherent problems with this approach. In fact, Couch DB took this approach as an optimization which apparently paid nice dividends.
You do need to be very aware of how your string is encoded in your binaries. When you do <<"StringLiteral">> in your code, you have to be aware that this is simply a binary serialization of the list of code-points. Your Erlang compiler reads your code as ISO-8859-1 characters, so as long as you only use Latin-1 characters and do this consistently, you should be fine, But this isn't very friendly to internationalization.
Most application software these day should prefer a unicode encoding. UTF-8 is compatible with your <<"StringLiteral">> for the first 128 codepoints, but not for the second 128, so be careful. You might be surprised what you see on your UTF-8 encoded web applications if you use <<"StrĂŻngLĂŻteral">> in your code.
There was an EEP proposal for binary support in the form of <<"StrĂŻngLĂŻteral"/utf8>>, but I don't think this is finalized.
Also be aware that your binary:split/2 function may have unexpected results in UTF-8 if there is a multi-byte character that contains the IS0-8859-1 byte that to are splitting on.
Some would argue that UTF-16 is a better encoding to use because it can be parsed more efficiently and can be more easily split by index, if you are assuming or verify that there are no 32-bit characters.
The unicode module should be use, but tread carefully when you use literals.
The only thing to be aware of is that a binary is a slice of bytes, whereas a list is a list of unicode codepoints. In other words, the latter is naturally unicode whereas the former requires you to do some sort of encoding, usually UTF-8.
To my knowledge, there is no drawbacks to your method.
Binaries are very efficient structures to store strings. If they are longer than 64B they are also stored outside process heap so they are not object of GC (still GC'ed by ref counting when last ref lost). Don't forget use iolists for concatenation them to avoid copying when performance matter.

What is a good method for obfuscating a base 64 string?

Base64 encoding is often used to obfuscate plaintext, I am wondering if there are any quick/easy ways of obfuscating a base 64 string, so that it is not easily recognizeable as such. To do so the method should obfuscate the padding characters (='s) such that they become some other symbol and are more dispersed.
Does anyone know of an easy (and easily reversible) way to do this?
You could use a shift cipher, but I am looking for something that's a little more comprehensive, for example if my shift cipher mapped = to a, someone might notice a string that frequently ends in a's.
The purpose is not to add security, it is actually simply to make base64 unrecognizeable as base 64. It also does not need to pass a security proffesional, just an individual that knows what base64 is and what it looks like. Ex (='s at the end etc.)
The method I describe would probably add non base 64 characters, like ^%$##!, to help obfuscate the reader.
Most of the replies seem to be on the topic of WHY I would want to do this, and the basic answer is that the operation would be completed numerous times (So I want something inexpensive), and done in a way where no password can be remembered (Why I don't XOR). Also the data isn't highly sensitive, and is just to be used as a method against the casual user, who might have knowledge of what a base 64 string is.
A couple of suggestions:
Strip any ending = (according to Wikipedia they are no needed) and then bitwise negate each byte. This will transform the text into mostly non-readable characters.
Loop over the data and xor each character with it's position, modulo 256. This will eliminate any simple statistical analysis since the mapping of each character depends on the position in the string.
In contrast to one of the points in Anders Abel's best answer, the = signs in the base64 strings seem to matter:
$ echo -n foobar | base64
Zm9vYmFy
$ echo -n foobar1 | base64
Zm9vYmFyMQ==
$ echo -n Zm9vYmFyMQ | base64 -D
foobar$ echo -n Zm9vYmFyMQ= | base64 -D
foobar$ echo -n Zm9vYmFyMQ== | base64 -D
foobar1$
What you are asking for is called "security by obscurity" and generally is a bad idea.
Base64 encoding was never designed or intended to be used to obfuscate text or data. Its used to encode binary data which needs to travel trough some communication channel which allows only ASCII characters - like email messages, or be part of XML, etc.
Better use real encryption if you want to hide the data. In any case, even after encrypting the data, you need to pass it as XML, etc., you may end up again encode it in Base64 for transport purposes.
I suppose you could generate a small amount of random data, and then use that to encode the Base64 characters. Prepend the random data to the re-encoded Base64 data.
A very simple example: given an input string "Hello", generate a random number in the range 1-9 and use that as the offset to apply to each input character. Suppose you generate "5", then the re-encoded string would be "5Mjqqt". Or encode the offset as a letter rather than as a number (a=1, b=2, ...) Then the "=" padding will be translated to a different character each time.
Or you could just drop the padding; according to the Wikipedia article, it's not really necessary.
(But consider whether this is really a necessary and sufficient thing to be doing in the first place. It's not clear from your question why you want to obfuscate base 64 data.)
agreed with the responses suggesting use of encryption if your requirements are to actually keep someone who is determined to decode the data from reversing the process.
otherwise, the answer somewhat depends on other constraints of your system, but a few ideas came to mind. if you're just concerned about the delimiter characters, and you have control over the process that generates the Base64 to begin with, you could choose some method of padding the data prior to conversion, thus eliminating the '=' characters from the output.
along this same vein, you could use one of the variants like 'base64url' encoding (see http://en.wikipedia.org/wiki/Base64 for lots of good info on the variants) that does not use the pad character.
after eliminating the '=' by one of these methods, you could perhaps do some sort of char-swapping on the generated Base64, just swapping every other character, just leaving any final character in place. you could also perhaps do some sort of substitution of the upper- or lowercase letters into some other characters to make it look less like Base64 to a quick glance.
however, whatever idea you choose, just remember that it will not be a substitute for a real encryption scheme if you require real protection of that data.
Base64 usually used when you want your data goes through some channel that can distort non-alpha-numeric symbols - for example in XML. If it is your task too - your code will be similar to Base64 no matter how you try :)
If your channel handles binary data well - then just get source text (decode Base64 back), get binary representation for it and use some sort of xor. For example make xor 37 with every byte in source bytes. The same operation will restore your text back.
But it still easily recognizable by anyone who has basic knowledge of cryptanalysis. If it is a problem - use real encryption.

Efficient String Implementation in Haskell

I'm currently teaching myself Haskell, and I'm wondering what the best practices are when working with strings in Haskell.
The default string implementation in Haskell is a list of Char. This is inefficient for file input-output, according to Real World Haskell, since each character is separately allocated (I assume that this means that a String is basically a linked list in Haskell, but I'm not sure.)
But if the default string implementation is inefficient for file i/o, is it also inefficient for working with Strings in memory? Why or why not? C uses an array of char to represent a String, and I assumed that this would be the default way of doing things in most languages.
As I see it, the list implementation of String will take up more memory, since each character will require overhead, and also more time to iterate over, because a pointer dereferencing will be required to get to the next char. But I've liked playing with Haskell so far, so I want to believe that the default implementation is efficient.
Apart from String/ByteString there is now the Text library which combines the best of both worlds—it works with Unicode while being ByteString-based internally, so you get fast, correct strings.
Best practices for working with strings performantly in Haskell are basically: Use Data.ByteString/Data.ByteString.Lazy.
http://hackage.haskell.org/packages/archive/bytestring/latest/doc/html/
As far as the efficiency of the default string implementation goes in Haskell, it's not. Each Char represents a Unicode codepoint which means it needs at least 21bits per Char.
Since a String is just [Char], that is a linked list of Char, it means Strings have poor locality of reference, and again means that Strings are fairly large in memory, at a minimum it's N * (21bits + Mbits) where N is the length of the string and M is the size of a pointer (32, 64, what have you) and unlike many other places where Haskell uses lists where other languages might use different structures (I'm thinking specifically of control flow here), Strings are much less likely to be able to be optimized to loops, etc. by the compiler.
And while a Char corresponds to a codepoint, the Haskell 98 report doesn't specify anything about the encoding used when doing file IO, not even a default much less a way to change it. In practice GHC provides an extensions to do e.g. binary IO, but you're going off the reservation at that point anyway.
Even with operations like prepending to front of the string it's unlikely that a String will beat a ByteString in practice.
The answer is a bit more complex than just "use lazy bytestrings".
Byte strings only store 8 bits per value, whereas String holds real Unicode characters. So if you want to work with Unicode then you have to convert to and from UTF-8 or UTF-16 all the time, which is more expensive than just using strings. Don't make the mistake of assuming that your program will only need ASCII. Unless its just throwaway code then one day someone will need to put in a Euro symbol (U+20AC) or accented characters, and your nice fast bytestring implementation will be irretrievably broken.
Byte strings make some things, like prepending to the start of a string, more expensive.
That said, if you need performance and you can represent your data purely in bytestrings, then do so.
The basic answer given, use ByteString, is correct. That said, all of the three answers before mine have inaccuracies.
Regarding UTF-8: whether this will be an issue or not depends entirely on what sort of processing you do with your strings. If you're simply treating them as single chunks of data (which includes operations such as concatenation, though not splitting), or doing certain limited byte-based operations (e.g., finding the length of the string in bytes, rather than the length in characters), you won't have any issues. If you are using I18N, there are enough other issues that simply using String rather than ByteString will start to fix only a very few of the problems you'll encounter.
Prepending single bytes to the front of a ByteString is probably more expensive than doing the same for a String. However, if you're doing a lot of this, it's probably possible to find ways of dealing with your particular problem that are cheaper.
But the end result would be, for the poster of the original question: yes, Strings are inefficient in Haskell, though rather handy. If you're worried about efficiency, use ByteStrings, and view them as either arrays of Char8 or Word8, depending on your purpose (ASCII/ISO-8859-1 vs Unicode of some sort, or just arbitrary binary data). Generally, use Lazy ByteStrings (where prepending to the start of a string is actually a very fast operation) unless you know why you want non-lazy ones (which is usually wrapped up in an appreciation of the performance aspects of lazy evaluation).
For what it's worth, I am building an automated trading system entirely in Haskell, and one of the things we need to do is very quickly parse a market data feed we receive over a network connection. I can handle reading and parsing 300 messages per second with a negligable amount of CPU; as far as handling this data goes, GHC-compiled Haskell performs close enough to C that it's nowhere near entering my list of notable issues.

Resources