Efficient String Implementation in Haskell - string

I'm currently teaching myself Haskell, and I'm wondering what the best practices are when working with strings in Haskell.
The default string implementation in Haskell is a list of Char. This is inefficient for file input-output, according to Real World Haskell, since each character is separately allocated (I assume that this means that a String is basically a linked list in Haskell, but I'm not sure.)
But if the default string implementation is inefficient for file i/o, is it also inefficient for working with Strings in memory? Why or why not? C uses an array of char to represent a String, and I assumed that this would be the default way of doing things in most languages.
As I see it, the list implementation of String will take up more memory, since each character will require overhead, and also more time to iterate over, because a pointer dereferencing will be required to get to the next char. But I've liked playing with Haskell so far, so I want to believe that the default implementation is efficient.

Apart from String/ByteString there is now the Text library which combines the best of both worlds—it works with Unicode while being ByteString-based internally, so you get fast, correct strings.

Best practices for working with strings performantly in Haskell are basically: Use Data.ByteString/Data.ByteString.Lazy.
http://hackage.haskell.org/packages/archive/bytestring/latest/doc/html/
As far as the efficiency of the default string implementation goes in Haskell, it's not. Each Char represents a Unicode codepoint which means it needs at least 21bits per Char.
Since a String is just [Char], that is a linked list of Char, it means Strings have poor locality of reference, and again means that Strings are fairly large in memory, at a minimum it's N * (21bits + Mbits) where N is the length of the string and M is the size of a pointer (32, 64, what have you) and unlike many other places where Haskell uses lists where other languages might use different structures (I'm thinking specifically of control flow here), Strings are much less likely to be able to be optimized to loops, etc. by the compiler.
And while a Char corresponds to a codepoint, the Haskell 98 report doesn't specify anything about the encoding used when doing file IO, not even a default much less a way to change it. In practice GHC provides an extensions to do e.g. binary IO, but you're going off the reservation at that point anyway.
Even with operations like prepending to front of the string it's unlikely that a String will beat a ByteString in practice.

The answer is a bit more complex than just "use lazy bytestrings".
Byte strings only store 8 bits per value, whereas String holds real Unicode characters. So if you want to work with Unicode then you have to convert to and from UTF-8 or UTF-16 all the time, which is more expensive than just using strings. Don't make the mistake of assuming that your program will only need ASCII. Unless its just throwaway code then one day someone will need to put in a Euro symbol (U+20AC) or accented characters, and your nice fast bytestring implementation will be irretrievably broken.
Byte strings make some things, like prepending to the start of a string, more expensive.
That said, if you need performance and you can represent your data purely in bytestrings, then do so.

The basic answer given, use ByteString, is correct. That said, all of the three answers before mine have inaccuracies.
Regarding UTF-8: whether this will be an issue or not depends entirely on what sort of processing you do with your strings. If you're simply treating them as single chunks of data (which includes operations such as concatenation, though not splitting), or doing certain limited byte-based operations (e.g., finding the length of the string in bytes, rather than the length in characters), you won't have any issues. If you are using I18N, there are enough other issues that simply using String rather than ByteString will start to fix only a very few of the problems you'll encounter.
Prepending single bytes to the front of a ByteString is probably more expensive than doing the same for a String. However, if you're doing a lot of this, it's probably possible to find ways of dealing with your particular problem that are cheaper.
But the end result would be, for the poster of the original question: yes, Strings are inefficient in Haskell, though rather handy. If you're worried about efficiency, use ByteStrings, and view them as either arrays of Char8 or Word8, depending on your purpose (ASCII/ISO-8859-1 vs Unicode of some sort, or just arbitrary binary data). Generally, use Lazy ByteStrings (where prepending to the start of a string is actually a very fast operation) unless you know why you want non-lazy ones (which is usually wrapped up in an appreciation of the performance aspects of lazy evaluation).
For what it's worth, I am building an automated trading system entirely in Haskell, and one of the things we need to do is very quickly parse a market data feed we receive over a network connection. I can handle reading and parsing 300 messages per second with a negligable amount of CPU; as far as handling this data goes, GHC-compiled Haskell performs close enough to C that it's nowhere near entering my list of notable issues.

Related

What are Pascal Strings?

Are they named after the programming language, or the mathematician?
What are the defining characteristics of Pascal strings? In Wikipedia's article on strings it seems like the defining characteristic is storing the length of the string in the first byte. In another article I get the impression that the memory layout of the strings is also important.
While perusing an unrelated SO thread somebody mentioned that Pascal strings make Excel fast. What are the advantages of Pascal strings over null-terminated strings? Or more generally, in what situations do Pascal strings excel?
Are Pascal strings implemented in any other languages?
Last, do I capitalize both words ("Pascal Strings") or only the first ("Pascal strings")? I'm a technical writer...
Pascal strings were made popular by one specific, but huge influential Pascal implementation, named UCSD. So UCSD Strings is a better term. This is the same implementation that made bytecode interpreters popular.
In general it is not one specific type, but the basic principle of having the size prefixed to the character data. This makes getting the length a constant time operation (O(1)) instead of scanning the character data for a nul character.
Not all Pascals used this concept. IIRC, the original (seventies) convention was to space pad an allocation, and scan backwards for a non space character (making it impossible for strings to have a terminating space). Moreover, since software was mostly used in isolation, all kinds of schemes were used, often based on what was advantageous for that implementation/architecture.
While the construct is not part of Standard Pascal, the most popular dialects from Borland (Turbo Pascal, Delphi and Free Pascal) generally base themselves on UCSD dialect, and thus have pascal strings, Delphi currently has 5 such strings. (short/ansi/wide/unicode/open)
On the other hand, this means that in a loop, you need some additional check based on indexes to check for the end of the string.
So instead by copying a string using
while (p^) do begin P^=p2^; inc(p) inc(p2); end;
which is wholly equivalent to
while (*s++ = *t++);
in C when using an optimizing compiler.
you need to do e.g.
while (len>0) do begin p^:=p2^; inc(p) inc(p2); dec(len); end;
or even
i:=1;
while (i<=len) do begin p[i]:=p2[i]; inc(i); end;
This made the number of instructions in a Pascal string loop slightly larger than the equivalent zero terminated string, and adds one more live value. Additionally, UCSD was a bytecode (p-code) interpreter language, and the latter code based on pascal string use is "safe".
In case of an architecture that had built in post increment (++) operators (like the PDP-8,11's C was developed for originally), the pointer version was even cheaper, specially without optimization. Nowadays optimizing compilers could easily detect any of these constructs and convert them to whatever is best.
More importantly, since the early nineties security became more important, and in general solely relying on null terminated strings property is frowned upon because small errors in validation can cause potentially exploitable buffer overflow issues. C and the its standards therefore deprecated the old string use, and now use "-n-" versions of the older string routines (strNcpy etc) that need a maximal length to be passed. This is adds the same extra live value, similar to the length, like a manually managed Pascal strings principle, where the programmer must take care of passing the length (or maximum buffer size for C's -N- functions) around. Pascal strings still have the advantage of getting to the last occupied char in an O(1) operation, and the fact that there are no forbidden chars though.
Length prefixed strings are also used extensively in file format, because, obviously, it is useful to know the number of bytes to read up front.
It's an old name dating back to the days where "C language versus Pascal language" was actually a comparison people made. Depending on who you ask, it's either specifically storing the length in the first byte, or refers to any length prefix (two bytes, four bytes). Other memory management details are not included, they are implementation-dependent and not a fundamental difference to C strings.
Pascal strings excel in... everything. NUL terminated strings save one to three bytes on short strings, which may have been useful in 1970 but isn't even worth mentioning today in virtually all circumstances. Aside from not being able to store a zero byte (which isn't too bad for text but rules out any kind of binary data), you can't determine string length efficiently. This affects a a good portion of string algorithms negatively. One example, in the comment you link to, is string comparison: If you have the length, you can instantly return false when comparing strings of different length. There are also many other downsides not related to performance.
For these reasons, virtually every language implementation newer than about 1980 uses length prefixes for strings. This is another reason why the "pascal string" name is outdated.

`Data.Text` vs `Data.Vector.Unboxed Char`

Is there any difference in how Data.Text and Data.Vector.Unboxed Char work internally? Why would I choose one over the other?
I always thought it was cool that Haskell defines String as [Char]. Is there a reason that something analagous wasn't done for Text and Vector Char?
There certainly would be an advantage to making them the same.... Text-y and Vector-y tools could be written to be used in both camps. Imagine Ropes of Ints, or Regexes on strings of poker cards.
Of course, I understand that there were probably historical reasons and I understand that most current libraries use Data.Text, not Vector Char, so there are many practical reasons to favor one over the other. But I am more interested in learning about the abstract qualities, not the current state that we happen to be in.... If the whole thing were rewritten tomorrow, would it be better to unify the two?
Edit, with more info-
To put stuff into perspective-
According to this page, http://www.haskell.org/haskellwiki/GHC/Memory_Footprint, GHC uses 16 bytes for each Char in your program!
Data.Text is not O(1) index'able, it is O(n).
Ropes (binary trees wrapped around text) can also hold strings.... They have better complexity for index/insert/delete, although depending on the number of nodes and balance of the tree, index could be close to that of Text.
This is my takeaway from this-
Text and Vector Char are different internally....
Use String if you don't care about performance.
If performance is important, default to using Text.
If fast indexing of chars is necessary, and you don't mind a lot of memory overhead (up to 16x), use Vector Char.
If you want to insert/delete a lot of data, use Ropes.
It's a fairly bad idea to think of Text as being a list of characters. Text is designed to be thought of as an opaque, user-readable blob of Unicode text. Character boundaries might be defined based on encoding, locale, language, time of month, phase of the moon, coin flips performed by a blinded participant, and migratory patterns of Venezuela's national bird whatever it may be. The same story happens with sorting, up-casing, reversing, etc.
Which is a long way of saying that Text is an abstract type representing human language and goes far out of its way to not behave just the same way as its implementation, be it a ByteString, a Vector UTF16CodePoint, or something totally unique (which is the case).
To clarify this distinction take note that there's no guarantee that unpack . pack witnesses an isomorphism, that the preferred ways of converting from Text to ByteString are in Data.Text.Encoding and are partial, and that there's a whole sophisticated plug-in module text-icu littered with complex ways of handling human language strings.
You absolutely should use Text if you're dealing with a human language string. You should also be really careful to treat it with care since human language strings are not easily amenable to computer processing. If your string is better thought of as a machine string, you probably should use ByteString.
The pedagogical advantages of type String = [Char] are high, but the practical advantages are quite low.
To add to what J. Abrahamson said, it's also worth making the distinction between iterating over runes (roughly character by character, but really could be ideograms too) as opposed to unitary logical unicode code points. Sometimes you need to know if you're looking at a code point that has been "decorated" by a previous code point.
In the case of the latter, you then have to make the distinction between code points that stand alone (such as letters, ideograms) and those that modify the text that follows (right-to-left code point, diacritics, etc).
Well implemented unicode libraries will typically abstract these details away and let you process the text in a more or less character-by-character fashion but you have to drop certain assumptions that come from thinking in terms of ASCII.
A byte is not a character. A logical unit of text isn't necessarily a "character". Not every code point stands alone, some decorate/annotate the following code point or even the rest of the byte stream until invalidated (right-to-left).
Unicode is hard. There is no one true encoding that will eliminate the difficulty of encapsulating the variety inherent in human language. Data.Text does a respectable job of it though.
To summarize:
The methods of processing are:
byte-by-byte - totally invalid for unicode, only applicable to latin-1/ASCII
code point by code point - works for processing unicode, but is lower-level than people realize
logical rune-by-rune - what you actually want
The types are:
String (aka [Char]) - has a limited scope. Best used for teaching Haskell or for legacy use-cases.
Text - the preferred way to handle "human" text.
Bytestring - for byte streams, raw data, binary etc.

Data.Text vs String

While the general opinion of the Haskell community seems to be that it's always better to use Text instead of String, the fact that still the APIs of most of maintained libraries are String-oriented confuses the hell out of me. On the other hand, there are notable projects, which consider String as a mistake altogether and provide a Prelude with all instances of String-oriented functions replaced with their Text-counterparts.
So are there any reasons for people to keep writing String-oriented APIs except backwards- and standard Prelude-compatibility and the "switch-making inertia"?
Are there possibly any other drawbacks to Text as compared to String?
Particularly, I'm interested in this because I'm designing a library and trying to decide which type to use to express error messages.
My unqualified guess is that most library writers don't want to add more dependencies than necessary. Since strings are part of literally every Haskell distribution (it's part of the language standard!), it is a lot easier to get adopted if you use strings and don't require your users to sort out Text distributions from hackage.
It's one of those "design mistakes" that you just have to live with unless you can convince most of the community to switch over night. Just look at how long it has taken to get Applicative to be a superclass of Monad – a relatively minor but much wanted change – and imagine how long it would take to replace all the String things with Text.
To answer your more specific question: I would go with String unless you get noticeable performance benefits by using Text. Error messages are usually rather small one-off things so it shouldn't be a big problem to use String.
On the other hand, if you are the kind of ideological purist that eschews pragmatism for idealism, go with Text.
* I put design mistakes in scare quotes because strings as a list-of-chars is a neat property that makes them easy to reason about and integrate with other existing list-operating functions.
If your API is targeted at processing large amounts of character oriented data and/or various encodings, then your API should use Text.
If your API is primarily for dealing with small one-off strings, then using the built-in String type should be fine.
Using String for large amounts of text will make applications using your API consume significantly more memory. Using it with foreign encodings could seriously complicate usage depending on how your API works.
String is quite expensive (at least 5N words where N is the number of Char in the String). A word is same number of bits as the processor architecture (ex. 32 bits or 64 bits):
http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html
There are at least three reasons to use [Char] in small projects.
[Char] does not rely on any arcane staff, like foreign pointers, raw memory, raw arrays, etc that may work differently on different platforms or even be unavailable altogether
[Char] is the lingua franka in haskell. There are at least three 'efficient' ways to handle unicode data in haskell: utf8-bytestring, Data.Text.Text and Data.Vector.Unboxed.Vector Char, each requiring dealing with extra package.
by using [Char] one gains access to all power of [] monad, including many specific functions (alternative string packages do try to help with it, but still)
Personally, I consider utf16-based Data.Text one of the most questionable desicions of the haskell community, since utf16 combines flaws of both utf8 and utf32 encoding while having none of their benefits.
I wonder if Data.Text is always more efficient than Data.String???
"cons" for instance is O(1) for Strings and O(n) for Text. Append is O(n) for Strings and O(n+m) for strict Text's. Likewise,
let foo = "foo" ++ bigchunk
bar = "bar" ++ bigchunk
is more space efficient for Strings than for strict Texts.
Other issue not related to efficiency is pattern matching (perspicuous code) and lazyness (predictably per-character in Strings, somehow implementation dependent in lazy Text).
Text's are obviously good for static character sequences and for in-place modification. For other forms of structural editing, Data.String might have advantages.
I do not think there is a single technical reason for String to remain.
And I can see several ones for it to go.
Overall I would first argue that in the Text/String case there is only one best solution :
String performances are bad, everyone agrees on that
Text is not difficult to use. All functions commonly used on String are available on Text, plus some useful more in the context of strings (substitution, padding, encoding)
having two solutions creates unnecessary complexity unless all base functions are made polymorphic. Proof : there are SO questions on the subject of automatic conversions. So this is a problem.
So one solution is less complex than two, and the shortcomings of String will make it disappear eventually. The sooner the better !

Most efficient method of constructing and printing very large strings

I have a program which constructs very large strings. Currently I am using lazy ByteStrings. Here are the problem parameters summarized:
The current implementation works up to about 500k characters, simply running out of memory afterwards (~600MB). I would like this (amount of characters) to run in under 50MB.
The string isn't accessed while being built. This probably leads to a lot of thunks and hence the memory issue. I am using Builder to make the ByteStrings, but it seems that there is no strict version of Builder (or at least I can't find it).
The string cannot be put in the file while being built. The entire build operation has to happen before the string is placed in a file.
I don't need unicode support. Even 7 bit ascii would do. I believe that ByteString doesn't waste memory to encode unicode characters though.
Things I have tried:
Calling seq on the ByteStrings as they are being built. This seems to work for 50-100k characters but after that the effect is the same.
Using strict ByteStrings. I couldn't figure out how to use Builder with them, so I ended up using lists and concat.
Using UArray Int Char. This means either knowing the size of the string in advance and allocating the entire array, or having a ton of intermediate data structures.

What defines data that can be stored in strings

A few days ago, I asked why its not possible to store binary data, such as a jpg file into a string variable.
Most of the answers I got said that string is used for textual information such as what I'm writing now.
What is considered textual data though? Bytes of a certain nature represent a jpg file and those bytes could be represented by character byte values...I think. So when we say strings are for textual information, is there some sort of range or list of characters that aren't stored?
Sorry if the question sounds silly. Just trying to 'get it'
I see three major problems with storing binary data in strings:
Most systems assume a certain encoding within string variables - e.g. if it's a UTF-8, UTF-16 or ASCII string. New line characters may also be translated depending on your system.
You should watch out for restrictions on the size of strings.
If you use C style strings, every null character in your data will terminate the string and any string operations performed will only work on the bytes up to the first null.
Perhaps the most important: it's confusing - other developers don't expect to find random binary data in string variables. And a lot of code which works on strings might also get really confused when encountering binary data :)
I would prefer to store binary data as binary, you would only think of converting it to text when there's no other choice since when you convert it to a textual representation it does waste some bytes (not much, but it still counts), that's how they put attachments in email.
Base64 is a good textual representation of binary files.
I think you are referring to binary to text encoding issue. (translate a jpg into a string would require that sort of pre-processing)
Indeed, in that article, some characters are mentioned as not always supported, other can be confusing:
Some systems have a more limited character set they can handle; not only are they not 8-bit clean, some can't even handle every printable ASCII character.
Others have limits on the number of characters that may appear between line breaks.
Still others add headers or trailers to the text.
And a few poorly-regarded but still-used protocols use in-band signaling, causing confusion if specific patterns appear in the message. The best-known is the string "From " (including trailing space) at the beginning of a line used to separate mail messages in the mbox file format.
Whoever told you you can't put 'binary' data into a string was wrong. A string simply represents an array of bytes that you most likely plan on using for textual data... but there is nothing stopping you from putting any data in there you want.
I do have to be careful though, because I don't know what language you are using... and in some languages \0 ends the string.
In C#, you can put any data into a string... example:
byte[] myJpegByteArray = GetBytesFromSomeImage();
string myString = Encoding.ASCII.GetString(myJpegByteArray);
Before internationalization, it didn't make much difference. ASCII characters are all bytes, so strings, character arrays and byte arrays ended up having the same implementation.
These days, though, strings are a lot more complicated, in order to deal with thousands of foreign language characters and the linguistic rules that go with them.
Sure, if you look deep enough, everything is just bits and bytes, but there's a world of difference in how the computer interprets them. The rules for "text" make things look right when it's displayed to a human, but the computer is free to monkey with the internal representation. For example,
In Unicode, there are many encoding systems. Changing between them makes every byte different.
Some languages have multiple characters that are linguistically equivalent. These could switch back and forth when you least expect it.
There are different ways to end a line of text. Unintended translations between CRLF and LF will break a binary file.
Deep down everything is just bytes.
Things like strings and pictures are defined by rules about how to order bytes.
strings for example end in a byte with value 32 (or something else)
jpg's don't
Depends on the language. For example in Python string types (str) are really byte arrays, so they can indeed be used for binary data.
In C the NULL byte is used for string termination, so a sting cannot be used for arbitrary binary data, since binary data could contain null bytes.
In C# a string is an array of chars, and since a char is basically an alias for 16bit int, you can probably get away with storing arbitrary binary data in a string. You might get errors when you try to display the string (because some values might not actually correspond to a legal unicode character), and some operations like case conversions will probably fail in strange ways.
In short it might be possible in some langauges to store arbitrary binary data in strings, but they are not designed for this use, and you may run into all kinds of unforseen trouble. Most languages have a byte-array type for storing arbitrary binary data.
I agree with Jacobus' answer:
In the end all data structures are made up of bytes. (Well, if you go even deeper: of bits). With some abstraction, you could say that a string or a byte array are conventions for programmers, on how to access them.
In this regard, the string is an abstraction for data interpreted as a text. Text was invented for communication among humans, computers or programs do not communicate very well using text. SQL is textual, but is an interface for humans to tell a database what to do.
So in general, textual data, and therefore strings, are primarily for human to human, or human to machine interaction (say for the content of a message box). Using them for something else (e.g. reading or writing binary image data) is possible, but carries lots of risk bacause you are using the data type for something it was not designed to handle. This makes it much more error prone. You may be able to store binary data in strings, mbut just because you are able to shoot yourself in the foot, you should avoid doing so.
Summary: You can do it. But you better don't.
Your original question (c# - What is string really good for?) made very little sense. So the answers didn't make sense, either.
Your original question said "For some reason though, when I write this string out to a file, it doesn't open." Which doesn't really mean much.
Your original question was incomplete, and the answers were misleading and confusing. You CAN store anything in a String. Period. The "strings are for text" answers were there because you didn't provide enough information in your question to determine what's going wrong with your particular bit of C# code.
You didn't provide a code snippet or an error message. That's why it's hard to 'get it' -- you're not providing enough details for us to know what you don't get.

Resources