In Haskell's Data.Text.Encoding, presuming one presents a pure ASCII ByteString, is decodeLatin1 very much faster than decodeUtf8? Intuitively It seems like there would be at least one more machine instruction given the nature of utf-8 (ie a test of the top bit). I know I could do my own profiling, but I presume this may have been done already so that is why I ask.
Here is the underlying C code that the text library uses internally for the decoder. Specifically the function _hs_text_decode_latin1:
http://hackage.haskell.org/package/text-1.0.0.1/src/cbits/cbits.c
is decodeLatin1 very much faster than decodeUtf8
The answer to this is simply that it shouldn't matter, you should choose to use latin1 decoder if you need to work with existing latin1 text data. Anything else is just microoptimization for almost all cases, the text library is already very heavily optimized.
Related
Do you know the fastest way to encode and decode UTF8 with some extra information? Here's the interesting cases that occur to me:
Serialization
I just want to encode an opaque buffer with no validation so I can decode again later. The fastest would be to use the underlying memory buffer and somehow unsafely coerce it from Text to ByteString without touching the contents.
Probably ASCII
I guess that 99% of the time my UTF8 is actually ASCII so it makes sense to do a first pass to confirm this and only further processing if it's found not to be true.
Probably not ASCII
Converse of the previous.
Probably short
A single key in JSON or a database that I guess will be 1 to 20 characters. Would be silly pay some upfront cost like vectorized SIMD approach.
Probably long
An HTML document. It's worth it pay some upfront cost for the highest throughput.
There's some more variants that are similar like if encoding JSON or URL and you think there's probably no escape characters.
I'm asking this question under the [Haskell] tag since Haskell's strong typing makes some techniques that would be easy in, say, C hard to implement. Also, there may be some special GHC tricks like using SSE4 instructions on an Intel platform that would be interesting. But this is more of a UTF8 issue in general and good ideas would be helpful to any language.
Update
After some research I propose to implement encode and decode for serialization purposes like so:
myEncode :: Text -> ByteString
myEncode = unsafeCoerce
myDecode :: ByteString -> Text
myDecode = unsafeCoerce
This is a great idea if you enjoy segfault ...
This question implicates a sprawling range of issues. I'm going to interpret it as "In Haskell, how should I convert between Unicode and other character encodings?"
In Haskell, the recommended way to convert to and from Unicode is with the functions in text-icu, which provides some basic functions:
fromUnicode :: Converter -> Text -> ByteString
toUnicode :: Converter -> ByteString -> Text
text-icu is a binding to the International Components for Unicode libraries, which does the heavy lifting for, among other things, encoding and decoding to non-Unicode character sets. Its website gives documentation on conversion in general and some specific information on how its converter implementations operate. Note that different character sets require somewhat different coverter implementations.
ICU can also attempt to automatically detect the character set of an input. "This is, at best, an imprecise operation using statistics and heuristics." No other implementation could "fix" that characteristic. The Haskell bindings do not expose that functionality as I write; see #8.
I don't know of any character set conversion procedures written in native Haskell. As the ICU documentation indicates, there is a lot of complexity; after all, this is a rich area of international computing history.
Performance
As the ICU FAQ laconically notes, "Most of the time, the memory throughput of the hard drive and RAM is the main performance constraint." Although that comment is not specifically about conversions, I'd expect it to be broadly the case here as well. Is your experience otherwise?
unsafeCoerce is not appropriate here.
Is there any difference in how Data.Text and Data.Vector.Unboxed Char work internally? Why would I choose one over the other?
I always thought it was cool that Haskell defines String as [Char]. Is there a reason that something analagous wasn't done for Text and Vector Char?
There certainly would be an advantage to making them the same.... Text-y and Vector-y tools could be written to be used in both camps. Imagine Ropes of Ints, or Regexes on strings of poker cards.
Of course, I understand that there were probably historical reasons and I understand that most current libraries use Data.Text, not Vector Char, so there are many practical reasons to favor one over the other. But I am more interested in learning about the abstract qualities, not the current state that we happen to be in.... If the whole thing were rewritten tomorrow, would it be better to unify the two?
Edit, with more info-
To put stuff into perspective-
According to this page, http://www.haskell.org/haskellwiki/GHC/Memory_Footprint, GHC uses 16 bytes for each Char in your program!
Data.Text is not O(1) index'able, it is O(n).
Ropes (binary trees wrapped around text) can also hold strings.... They have better complexity for index/insert/delete, although depending on the number of nodes and balance of the tree, index could be close to that of Text.
This is my takeaway from this-
Text and Vector Char are different internally....
Use String if you don't care about performance.
If performance is important, default to using Text.
If fast indexing of chars is necessary, and you don't mind a lot of memory overhead (up to 16x), use Vector Char.
If you want to insert/delete a lot of data, use Ropes.
It's a fairly bad idea to think of Text as being a list of characters. Text is designed to be thought of as an opaque, user-readable blob of Unicode text. Character boundaries might be defined based on encoding, locale, language, time of month, phase of the moon, coin flips performed by a blinded participant, and migratory patterns of Venezuela's national bird whatever it may be. The same story happens with sorting, up-casing, reversing, etc.
Which is a long way of saying that Text is an abstract type representing human language and goes far out of its way to not behave just the same way as its implementation, be it a ByteString, a Vector UTF16CodePoint, or something totally unique (which is the case).
To clarify this distinction take note that there's no guarantee that unpack . pack witnesses an isomorphism, that the preferred ways of converting from Text to ByteString are in Data.Text.Encoding and are partial, and that there's a whole sophisticated plug-in module text-icu littered with complex ways of handling human language strings.
You absolutely should use Text if you're dealing with a human language string. You should also be really careful to treat it with care since human language strings are not easily amenable to computer processing. If your string is better thought of as a machine string, you probably should use ByteString.
The pedagogical advantages of type String = [Char] are high, but the practical advantages are quite low.
To add to what J. Abrahamson said, it's also worth making the distinction between iterating over runes (roughly character by character, but really could be ideograms too) as opposed to unitary logical unicode code points. Sometimes you need to know if you're looking at a code point that has been "decorated" by a previous code point.
In the case of the latter, you then have to make the distinction between code points that stand alone (such as letters, ideograms) and those that modify the text that follows (right-to-left code point, diacritics, etc).
Well implemented unicode libraries will typically abstract these details away and let you process the text in a more or less character-by-character fashion but you have to drop certain assumptions that come from thinking in terms of ASCII.
A byte is not a character. A logical unit of text isn't necessarily a "character". Not every code point stands alone, some decorate/annotate the following code point or even the rest of the byte stream until invalidated (right-to-left).
Unicode is hard. There is no one true encoding that will eliminate the difficulty of encapsulating the variety inherent in human language. Data.Text does a respectable job of it though.
To summarize:
The methods of processing are:
byte-by-byte - totally invalid for unicode, only applicable to latin-1/ASCII
code point by code point - works for processing unicode, but is lower-level than people realize
logical rune-by-rune - what you actually want
The types are:
String (aka [Char]) - has a limited scope. Best used for teaching Haskell or for legacy use-cases.
Text - the preferred way to handle "human" text.
Bytestring - for byte streams, raw data, binary etc.
I don't know what the original code is, so I assume that the original code is IBM850 or ISO8859-1.My process below
IBM850 -> UTF8
if this is OK, I consider the original code is IBM850, if NOK,do next step:
ISO8859-1 -> UTF8
if this is OK, I consider the original code is UTF8.
But there is a problem,
if the original code is ISO8859-1, it will be recognised to IBM850.
if the original code is IBM850, it will be recognised to ISO8859-1.
It seems that there are common ground between IBM850 and ISO8859-1.
Who can help me, thanks.
Yes, only the most trivial kind of autodetection is possible by testing whether conversion fails or succeeds. It's not going to work for input encodings where (almost) any input is valid.
You should know something more about your likely output, to test whether if it makes more sense after translating from IBM850 or from ISO8859-1. That's what enca and libenca do. You can probably start with some simple expectations to check:
Does your source happen to be within the ASCII subset of both encodings? Then you're happy with any conversion (but you have no way to know the original encoding at all).
Does your code use box drawing characters? If it does not, it would be easy to reject some candidates for IBM850.
Does your code use control characters from ISO8859-1? If it does not, it would be easy to reject some candidates for ISO8859-1 if codepoints 0x80-0x9F are used.
Do the fragments of your code which are non-ASCII always represent a text in a natural language? Then you can use frequency tables for characters and their pairs, selecting the source encoding which makes the result closer to your natural language(s) on these criteria. (If both variants are almost equally acceptable, it's probably better to give an error message and leave the final decision to humans).
I am writing an erlang module that has to deal a bit with strings, not too much, however, I do some tcp recv and then some parsing over the data.
While matching data and manipulating strings, I am using binary module all the time like binary:split(Data,<<":">>) and basically using <<"StringLiteral">> all the time.
Till now I have not encounter difficulties or missing methods from the alternative( using lists) and everything is coming out quite naturally except maybe for adding the <<>>, but I was wondering if this way of dealing with strings might have drawbacks I am not aware of.
Any hint?
As long as you and your team remember that your strings are binaries and not lists, there are no inherent problems with this approach. In fact, Couch DB took this approach as an optimization which apparently paid nice dividends.
You do need to be very aware of how your string is encoded in your binaries. When you do <<"StringLiteral">> in your code, you have to be aware that this is simply a binary serialization of the list of code-points. Your Erlang compiler reads your code as ISO-8859-1 characters, so as long as you only use Latin-1 characters and do this consistently, you should be fine, But this isn't very friendly to internationalization.
Most application software these day should prefer a unicode encoding. UTF-8 is compatible with your <<"StringLiteral">> for the first 128 codepoints, but not for the second 128, so be careful. You might be surprised what you see on your UTF-8 encoded web applications if you use <<"StrïngLïteral">> in your code.
There was an EEP proposal for binary support in the form of <<"StrïngLïteral"/utf8>>, but I don't think this is finalized.
Also be aware that your binary:split/2 function may have unexpected results in UTF-8 if there is a multi-byte character that contains the IS0-8859-1 byte that to are splitting on.
Some would argue that UTF-16 is a better encoding to use because it can be parsed more efficiently and can be more easily split by index, if you are assuming or verify that there are no 32-bit characters.
The unicode module should be use, but tread carefully when you use literals.
The only thing to be aware of is that a binary is a slice of bytes, whereas a list is a list of unicode codepoints. In other words, the latter is naturally unicode whereas the former requires you to do some sort of encoding, usually UTF-8.
To my knowledge, there is no drawbacks to your method.
Binaries are very efficient structures to store strings. If they are longer than 64B they are also stored outside process heap so they are not object of GC (still GC'ed by ref counting when last ref lost). Don't forget use iolists for concatenation them to avoid copying when performance matter.
I'm currently teaching myself Haskell, and I'm wondering what the best practices are when working with strings in Haskell.
The default string implementation in Haskell is a list of Char. This is inefficient for file input-output, according to Real World Haskell, since each character is separately allocated (I assume that this means that a String is basically a linked list in Haskell, but I'm not sure.)
But if the default string implementation is inefficient for file i/o, is it also inefficient for working with Strings in memory? Why or why not? C uses an array of char to represent a String, and I assumed that this would be the default way of doing things in most languages.
As I see it, the list implementation of String will take up more memory, since each character will require overhead, and also more time to iterate over, because a pointer dereferencing will be required to get to the next char. But I've liked playing with Haskell so far, so I want to believe that the default implementation is efficient.
Apart from String/ByteString there is now the Text library which combines the best of both worlds—it works with Unicode while being ByteString-based internally, so you get fast, correct strings.
Best practices for working with strings performantly in Haskell are basically: Use Data.ByteString/Data.ByteString.Lazy.
http://hackage.haskell.org/packages/archive/bytestring/latest/doc/html/
As far as the efficiency of the default string implementation goes in Haskell, it's not. Each Char represents a Unicode codepoint which means it needs at least 21bits per Char.
Since a String is just [Char], that is a linked list of Char, it means Strings have poor locality of reference, and again means that Strings are fairly large in memory, at a minimum it's N * (21bits + Mbits) where N is the length of the string and M is the size of a pointer (32, 64, what have you) and unlike many other places where Haskell uses lists where other languages might use different structures (I'm thinking specifically of control flow here), Strings are much less likely to be able to be optimized to loops, etc. by the compiler.
And while a Char corresponds to a codepoint, the Haskell 98 report doesn't specify anything about the encoding used when doing file IO, not even a default much less a way to change it. In practice GHC provides an extensions to do e.g. binary IO, but you're going off the reservation at that point anyway.
Even with operations like prepending to front of the string it's unlikely that a String will beat a ByteString in practice.
The answer is a bit more complex than just "use lazy bytestrings".
Byte strings only store 8 bits per value, whereas String holds real Unicode characters. So if you want to work with Unicode then you have to convert to and from UTF-8 or UTF-16 all the time, which is more expensive than just using strings. Don't make the mistake of assuming that your program will only need ASCII. Unless its just throwaway code then one day someone will need to put in a Euro symbol (U+20AC) or accented characters, and your nice fast bytestring implementation will be irretrievably broken.
Byte strings make some things, like prepending to the start of a string, more expensive.
That said, if you need performance and you can represent your data purely in bytestrings, then do so.
The basic answer given, use ByteString, is correct. That said, all of the three answers before mine have inaccuracies.
Regarding UTF-8: whether this will be an issue or not depends entirely on what sort of processing you do with your strings. If you're simply treating them as single chunks of data (which includes operations such as concatenation, though not splitting), or doing certain limited byte-based operations (e.g., finding the length of the string in bytes, rather than the length in characters), you won't have any issues. If you are using I18N, there are enough other issues that simply using String rather than ByteString will start to fix only a very few of the problems you'll encounter.
Prepending single bytes to the front of a ByteString is probably more expensive than doing the same for a String. However, if you're doing a lot of this, it's probably possible to find ways of dealing with your particular problem that are cheaper.
But the end result would be, for the poster of the original question: yes, Strings are inefficient in Haskell, though rather handy. If you're worried about efficiency, use ByteStrings, and view them as either arrays of Char8 or Word8, depending on your purpose (ASCII/ISO-8859-1 vs Unicode of some sort, or just arbitrary binary data). Generally, use Lazy ByteStrings (where prepending to the start of a string is actually a very fast operation) unless you know why you want non-lazy ones (which is usually wrapped up in an appreciation of the performance aspects of lazy evaluation).
For what it's worth, I am building an automated trading system entirely in Haskell, and one of the things we need to do is very quickly parse a market data feed we receive over a network connection. I can handle reading and parsing 300 messages per second with a negligable amount of CPU; as far as handling this data goes, GHC-compiled Haskell performs close enough to C that it's nowhere near entering my list of notable issues.