The commonly recommended Haskell string types seem to be ByteString or Text. I often work with a very large number of short (English word sized) strings, and typically need to store them in a lookup table such as Data.Map. In many cases I find that in this scenario, a table of Strings can take up less memory then a table of ByteStrings. Unboxed Data.Vectors of Word8 are also (much) more compact than ByteStrings.
What is the best practice when one needs to store and compare large numbers of small strings in Haskell?
Below I have tried to condense a particular problematic case into a small example:
import qualified Data.ByteString.Lazy.Char8 as S
import qualified Data.ByteString as Strict
import qualified Data.Map as Map
import qualified Data.Vector.Unboxed as U
import qualified Data.Serialize as Serialize
import Control.Monad.State
main = putStr
. unlines . map show . flip evalState (0,Map.empty)
. mapM toInt
. S.words
=<<
S.getContents
toInt x = do
let x' =
U.fromList . Strict.unpack . -- Comment this line to increase memory usage
Serialize.encode $ x
(i,t) <- get
case Map.lookup x' t of
Just j -> return j
Nothing -> do
let i' = i + (1::Int)
put (i', Map.insert x' i t)
return i
When I run this on a file containing around 400.000 words of English text, the version with strict bytestring keys uses around 50MB memory, the one with Word8 vectors uses 6MB.
In the absence of other answers, I'm going to go out on a limb here.
What is the best practice when one needs to store and compare large numbers of small strings in Haskell?
If the small strings are meant to be human readable (e.g. an English word) then use Text. If they are meant to be read only by the computer, use ByteString. The decision to use strict or lazy variants of these depends on how you build and use these small strings.
You shouldn't need to use your own unboxed Vectors of Word8. If you experiencing a specific situation where regular String is faster than Text or ByteString, then throw the details up on StackOverflow and we'll try to figure out why. If you perform detailed analysis and can prove that an unboxed Vector of Word8 consistently works significantly better than Text or ByteString, then start conversations on mailing lists, irc, reddit, etc; the standard libraries are not set in stone, and improvements are always welcome.
But I think it highly likely that you are just doing something weird, as hammar and shang suggest.
P.S. for your particular use case, instead of storing a lot of small strings, you should consider a more appropriate data structure catered to your needs, e.g. a Trie as danr suggests.
A (strict) ByteSting is a constructor over an unboxed ForiegnPtr to a Word8 and two unboxed Ints.
A ForeignPtr is another constructor over an Addr# (a GHC prim) and a ForeignPtrContents:
data ForeignPtrContents
= PlainForeignPtr !(IORef (Finalizers, [IO ()]))
| MallocPtr (MutableByteArray# RealWorld) !(IORef (Finalizers, [IO ()]))
| PlainPtr (MutableByteArray# RealWorld)
...
For short strings, ByteStrings simply pack too much administration to benefit their contiguous representation of the actual "string" data.
For the original question - I'd check an average word length of your corpus, but I can't see ByteString being more efficient than String aka [Char] which uses 12 bytes per Char (source the original ByteString paper).
A general plea to Haskellers (not aimed the poster of the original question) - please stop bashing String aka [Char] - having both String and Text (and ByteString when you really need bytes) makes sense. Or use Clean where the contiguous String representation is better suited to short strings.
Caveat - I may have been looking at an old version of the ByteString internals with regards to what data types it uses internally.
I know this is a 6-year old post, but I was wondering the same recently, and found this useful blog post: https://markkarpov.com/post/short-bs-and-text.html. It seems that yes, this is a recognized issue, and Short(Text/ByteString) are the solution.
Related
I need to encode some data to JSON and then push is to the syslog using hsyslog. The types of the two relevant functions are:
Aeson.encode :: a -> Data.ByteString.Lazy.ByteString
System.Posix.Syslog.syslog :: Maybe Facility
-> Priority
-> CStringLen
-> IO ()
What's the most efficient way (speed & memory) to convert a Lazy.ByteString -> CStringLen? I found Data.ByteString.Unsafe, but it works only with ByteString, not Lazy.ByteString?
Shall I just stick a unsafeUseAsCStringLen . Data.String.Conv.toS and call it a day? Will it to the right thing wrt efficiency?
I guess I would use Data.ByteString.Lazy.toStrict in place of toS, to avoid the additional package dependency.
Anyway, you won't find anything more efficient than:
unsafeUseAsCStringLen (toStrict lbs) $ \cstrlen -> ...
In general, toStrict is an "expensive" operation, because a lazy ByteString will generally be made up of a bunch of "chunks" each consisting of a strict ByteString and not necessarily yet loaded into memory. The toStrict function must force all the strict ByteString chunks into memory and ensure that they are copied into a single, contiguous block as required for a strict ByteString before the no-copy unsafeUseAsCStringLen is applied.
However, toStrict handles a lazy ByteString that consists of a single chunk optimally without any copying.
In practice, aeson uses an efficient Data.ByteString.Builder to create the JSON, and if the JSON is reasonably small (less than 4k, I think), it will build a single-chunk lazy ByteString. In this case, toStrict is zero-copy, and unsafeUseAsCStringLen is zero copy, and the entire operation is basically free.
But note that, in your application, where you are passing the string to the syslogger, fretting about the efficiency of this operation is crazy. My guess would be that you'd need thousands of copy operations to even make a dent in the performance of the overall action.
Suppose I wish to write something like this:
-- | Decode a 'ByteString' containing Code Page 437 encoded text.
decodeCP437 :: ByteString -> Text
decodeCP437 = undefined
(I know about encoding package, but its dependency list is ridiculous price to pay for this single, and I believe quite trivial function.)
My question is how to construct Text from ByteString with reasonable efficiency, in particular without using lists. It seems to me that Data.Text.Encoding should be a good source for inspiration, but at first sight it uses withForeignPtr and I guess it's too low level for my use case.
How the problem should be approached? In a nutshell, I guess I need to continuously take bytes (Word8) from ByteString, translate every byte to corresponding Char, and somehow efficiently build Text from them. Complexity of basic building functions in Data.Text for Text construction not surprisingly indicates that appending characters one by one is not the best idea, but I don't see better tools for this available.
Update: I want to create strict Text. It seems that the only option is to create builder then get lazy Text from it (O(n)) and then convert to strict Text (O(n)).
You can use the Builder API, which offers O(1) singleton :: Char -> Builder and O(1) (<>) :: Builder -> Builder -> Builder for efficient construction operations.
As Nikita Volkov mentioned in his question Data.Text vs String I also wondered why I have to deal with the different String implementations type String = [Char] and Data.Text in haskell. In my code I use the pack and unpack functions really often.
My question: Is there a way to have an automatic conversion between both string types so that I can avoid writing pack and unpack so often?
In other programming languages like Python or JavaScript there is for example an automatic conversion between integers and floats if it is needed. Can I reach something like this also in haskell? I know, that the mentioned languages are weakly typed, but I heard that C++ has a similar feature.
Note: I already know the language extension {-# LANGUAGE OverloadedStrings #-}. But as I understand this language extensions just applies to strings defined as "...". I want to have an automatic conversion for strings which I got from other functions or I have as arguments in function definitions.
Extended question: Haskell. Text or Bytestring covers also the difference between Data.Text and Data.ByteString. Is there a way to have an automatic conversion between the three strings String, Data.Text and Data.ByteString?
No.
Haskell doesn't have implicit coercions for technical, philosophical, and almost religious reasons.
As a comment, converting between these representations isn't free and most people don't like the idea that you have hidden and potentially expensive computations lurking around. Additionally, with strings as lazy lists, coercing them to a Text value might not terminate.
We can convert literals to Texts automatically with OverloadedStrings by desugaring a string literal "foo" to fromString "foo" and fromString for Text just calls pack.
The question might be to ask why you're coercing so much? Is there some why do you need to unpack Text values so often? If you constantly changing them to strings it defeats the purpose a bit.
Almost Yes: Data.String.Conversions
Haskell libraries make use of different types, so there are many situations in which there is no choice but to heavily use conversion, distasteful as it is - rewriting libraries doesn't count as a real choice.
I see two concrete problems, either of which being potentially a significant problem for Haskell adoption :
coding ends up requiring specific implementation knowledge of the libraries you want to use.This is a big issue for a high-level language
performance on simple tasks is bad - which is a big issue for a generalist language.
Abstracting from the specific types
In my experience, the first problem is the time spent guessing the package name holding the right function for plumbing between libraries that basically operate on the same data.
To that problem there is a really handy solution : the Data.String.Conversions package, provided you are comfortable with UTF-8 as your default encoding.
This package provides a single cs conversion function between a number of different types.
String
Data.ByteString.ByteString
Data.ByteString.Lazy.ByteString
Data.Text.Text
Data.Text.Lazy.Text
So you just import Data.String.Conversions, and use cs which will infer the right version of the conversion function according to input and output types.
Example:
import Data.Aeson (decode)
import Data.Text (Text)
import Data.ByteString.Lazy (ByteString)
import Data.String.Conversions (cs)
decodeTextStoredJson' :: T.Text -> MyStructure
decodeTextStoredJson' x = decode (cs x) :: Maybe MyStructure
NB : In GHCi you generally do not have a context that gives the target type so you direct the conversion by explicitly stating the type of the result, like for read
let z = cs x :: ByteString
Performance and the cry for a "true" solution
I am not aware of any true solution as of yet - but we can already guess the direction
it is legitimate to require conversion because the data does not change ;
best performance is achieved by not converting data from one type to another for administrative purposes ;
coercion is evil - coercitive, even.
So the direction must be to make these types not different, i.e. to reconcile them under (or over) an archtype from which they would all derive, allowing composition of functions using different derivations, without the need to convert.
Nota : I absolutely cannot evaluate the feasability / potential drawbacks of this idea. There may be some very sound stoppers.
I need to make extensive use of:
slice :: Int -> Int -> ByteString -> ByteString
slice start len = take len . drop start
Two part question:
Does this already have a name? I can't find anything searching for that type on Hoogle, but it seems like it should be a really common need. I also tried searching for (Int, Int) -> ByteString -> ByteString and some flip'd versions of same. I also tried looking for [a] versions to see if there was a name in common use.
Is there a better way to write it?
I'm suspicious that I'm doing something wrong because I strongly expected to find lots of people having gone down the same road, but my google-fu isn't finding anything.
The idiomatic way is via take and drop, which has O(1) complexity on strict bytestrings.
slice is not provided, to discourage the reliance on unsafe indexing operations.
According to the documentation there is no such function. Currently strict ByteStrings are represented as a pointer to beggining of pinned memory, an offset and a length. So, indeed, your implementation is better way to do splice. However, you should be careful with splices because spliced bytestrings takes the same amount of space as the original bytestring. In order to avoid this you might want to copy a spliced bytestring, but this is not always necessarily.
Let's say I need to do write/read of a Data.Time.UTCTime in "%Y-%m-%d %H:%M:%S" format many many times to/from a file.
It seems to me that, using Data.Time.formatTime or Data.Time.parseTime to convert UTCTime to/from String and then packing/unpacking the String to/from ByteString, would be too slow since it involves an intermediate String. But writing a ByteString builder/parser of UTCTime by hand seems like repeating a lot of work already done in formatTime and parseTime.
I guess my question is: Is there a systematic way to get functions of type t -> String or String -> t converted to t -> ByteString or ByteString -> t with increased efficiency without repeating a lot of work?
I am totally a Haskell newbie, so please forgive me if the question is just stupid.
No, there isn't a general way to convert a function of type t -> String to one of type t -> ByteString. It might help you reconcile yourself to that reality if you recall that a ByteString isn't just a faster String; it's lower-level than that. A ByteString is a sequence of bytes; it doesn't mean anything more unless you have an encoding in mind.
So your options are:
Use function composition, as in phg's answer:
import Data.ByteString.Char8 as B
timeToByteStr :: UTCTime -> ByteString
timeToByteStr = B.pack . formatTime'
parseBStrTime :: ByteString -> Maybe UTCTime
parseBStrTime = parseTime' . B.unpack
where I've corrected the function names. I've also used, e.g., parseTime' instead of parseTime, to elide over the format string and TimeLocale you need to pass in.
(It's also worth noting that Data.Text is often a better choice than Data.ByteString.Char8. The latter only works properly if your Chars fall in Unicode code points 0-255.)
If performance is an issue and this conversion profiles out as a bottleneck, write a parser/builder.
Just use Strings.
The last option is an underrated choice for Haskell newbies, IMHO. String is suboptimal, performance-wise, but it's not spawn-of-the-Devil. If you're just getting started with Haskell, why not keep your life as simple as possible, until and unless your code becomes too slow?
If you're not really in the need of heavily optimized code, the usual and easiest way is to use function composition:
timeToByte = toBytes . formatTime
byteToTime = parseTime . fromBytes
or something like that, as I'm not familiar with those libraries.
If after profiling you recognize that this approach is still to slow, I guess you will have to write something by hand.