Let's say I need to do write/read of a Data.Time.UTCTime in "%Y-%m-%d %H:%M:%S" format many many times to/from a file.
It seems to me that, using Data.Time.formatTime or Data.Time.parseTime to convert UTCTime to/from String and then packing/unpacking the String to/from ByteString, would be too slow since it involves an intermediate String. But writing a ByteString builder/parser of UTCTime by hand seems like repeating a lot of work already done in formatTime and parseTime.
I guess my question is: Is there a systematic way to get functions of type t -> String or String -> t converted to t -> ByteString or ByteString -> t with increased efficiency without repeating a lot of work?
I am totally a Haskell newbie, so please forgive me if the question is just stupid.
No, there isn't a general way to convert a function of type t -> String to one of type t -> ByteString. It might help you reconcile yourself to that reality if you recall that a ByteString isn't just a faster String; it's lower-level than that. A ByteString is a sequence of bytes; it doesn't mean anything more unless you have an encoding in mind.
So your options are:
Use function composition, as in phg's answer:
import Data.ByteString.Char8 as B
timeToByteStr :: UTCTime -> ByteString
timeToByteStr = B.pack . formatTime'
parseBStrTime :: ByteString -> Maybe UTCTime
parseBStrTime = parseTime' . B.unpack
where I've corrected the function names. I've also used, e.g., parseTime' instead of parseTime, to elide over the format string and TimeLocale you need to pass in.
(It's also worth noting that Data.Text is often a better choice than Data.ByteString.Char8. The latter only works properly if your Chars fall in Unicode code points 0-255.)
If performance is an issue and this conversion profiles out as a bottleneck, write a parser/builder.
Just use Strings.
The last option is an underrated choice for Haskell newbies, IMHO. String is suboptimal, performance-wise, but it's not spawn-of-the-Devil. If you're just getting started with Haskell, why not keep your life as simple as possible, until and unless your code becomes too slow?
If you're not really in the need of heavily optimized code, the usual and easiest way is to use function composition:
timeToByte = toBytes . formatTime
byteToTime = parseTime . fromBytes
or something like that, as I'm not familiar with those libraries.
If after profiling you recognize that this approach is still to slow, I guess you will have to write something by hand.
Related
In my app I'm doing a lot of conversions from Text to various datatypes, often just to Text itself, but sometimes to other datatypes.
I also rarely do conversions from other string types, e.g. String and ByteString.
Interestingly, Readable.fromText does the job for me, at least for Integer and Text. However I also now need UTCTime, which Readable.fromText doesn't have an instance for (but which I could write myself).
I was thinking that Readable.fromText was a Text analogy of Text.Read.readEither for [Char], however I've realised that Readable.fromText is actually subtlety different, in that readEither for text isn't just pure, but instead expects the input string to be quoted. This isn't the case however for reading integers however, who don't expect quotes.
I understand that this is because show shows strings with quotes, so for read to be consistent it needs to require quotes.
However this is not the behaviour I want. I'm looking for a typeclass where reading strings to strings is basically the id function.
Readable seems to do this, but it's misleadingly named, as its behaviour is not entirely analogous to read on [Char]. Is there another typeclass that has this behaviour also? Or am I best of just extending Readable, perhaps with newtypes or alternatively PRs?
The what
Just use Data.Text and Data.Text.Read directly
With signed decimal or just decimal you get a simple and yet expressive minimalistic parser function. It's directly usable:
type Reader a = Text -> Either String (a, Text)
decimal :: Integral a => Reader a
signed :: Num a => Reader a -> Reader a
Or you cook up your own runReader :: Reader a -> M a combinator for some M to possibly handle non-empty leftover and deal with the Left case.
For turning a String -> Text, all you have to do is use pack
The why
Disclaimer: The matter of parsing data the right way is answered differently depending on who you ask.
I belong to the school that believes typeclasses are a poor fit for parsing mainly for two reasons.
Typeclasses limit you to one instance per type
You can easily have two different time formats in the data. Now you might tell yourself that you only have one use case, but what if you depend on another library that itself or transitively introduces another instance Readable UTCTime? Now you have to use newtypes for no reason other than be able to select a particular implementation, which is not nice!
Code transparency
You cannot make any inference as to what parser behavior you get from a typename alone. And for the most part haddock instance documentation often does not exist because it is often assumed the behavior be obvious.
Consider for example: What will instance Readable Int64 do?
Will it assume an ASCII encoded numeric representation? Or some binary representation?
If binary, which endianness is going to be assumed?
What representation of signedness is expected? In ASCII case perhaps a minus? Or maybe with a space? Or if binary, is it going to be one-complement? Two-complement?
How will it handle overflow?
Code transparency on call-sites
But the intransparency extends to call-sites as well. Consider the following example
do fieldA <- fromText
fieldB <- fromText
fieldB <- fromText
pure T{..}
What exactly does this do? Which parsers will be invoked? You will have to know the types of fieldA, fieldB and fieldB to answer that question. Now in simple code that might seem obvious, but you might easily forget if you look at the same code 2 weeks from now. Or you have more elaborate code, where the types involves are inferred non-locally. It becomes hard to follow which instance this will end up selecting (and the instance can make a huge difference, especially if you start newtyping for different formats. Say you cannot make any inference from a field name fooTimestamp because it might perhaps be UnixTime or UTCTime)
And much worse: If you refactor and alter one of the field types data declaration from one type to another - say a time field from Word64 to UTCTime - this might silently and unexpectedly switch out to a different parser, leading to a bug. Yuk!
On the topic of Show/Read
By the way, the reason why show/read behave they way they do for Prelude instances and deriving-generated instances can be discovered in the Haskell Report 2010.
On the topic of show it says
The result of show is a syntactically correct Haskell expression
containing only constants [...]
And equivalently for read
The result of show is readable by read if all component types are readable.
(This is true for all instances defined in the Prelude but may not be true
for user-defined instances.) [...]
So show for a string foo produces "foo" because that is the syntactically valid Haskell literal representing the string value of foo, and read will read that back, acting as a kind of eval
I need to encode some data to JSON and then push is to the syslog using hsyslog. The types of the two relevant functions are:
Aeson.encode :: a -> Data.ByteString.Lazy.ByteString
System.Posix.Syslog.syslog :: Maybe Facility
-> Priority
-> CStringLen
-> IO ()
What's the most efficient way (speed & memory) to convert a Lazy.ByteString -> CStringLen? I found Data.ByteString.Unsafe, but it works only with ByteString, not Lazy.ByteString?
Shall I just stick a unsafeUseAsCStringLen . Data.String.Conv.toS and call it a day? Will it to the right thing wrt efficiency?
I guess I would use Data.ByteString.Lazy.toStrict in place of toS, to avoid the additional package dependency.
Anyway, you won't find anything more efficient than:
unsafeUseAsCStringLen (toStrict lbs) $ \cstrlen -> ...
In general, toStrict is an "expensive" operation, because a lazy ByteString will generally be made up of a bunch of "chunks" each consisting of a strict ByteString and not necessarily yet loaded into memory. The toStrict function must force all the strict ByteString chunks into memory and ensure that they are copied into a single, contiguous block as required for a strict ByteString before the no-copy unsafeUseAsCStringLen is applied.
However, toStrict handles a lazy ByteString that consists of a single chunk optimally without any copying.
In practice, aeson uses an efficient Data.ByteString.Builder to create the JSON, and if the JSON is reasonably small (less than 4k, I think), it will build a single-chunk lazy ByteString. In this case, toStrict is zero-copy, and unsafeUseAsCStringLen is zero copy, and the entire operation is basically free.
But note that, in your application, where you are passing the string to the syslogger, fretting about the efficiency of this operation is crazy. My guess would be that you'd need thousands of copy operations to even make a dent in the performance of the overall action.
Suppose I wish to write something like this:
-- | Decode a 'ByteString' containing Code Page 437 encoded text.
decodeCP437 :: ByteString -> Text
decodeCP437 = undefined
(I know about encoding package, but its dependency list is ridiculous price to pay for this single, and I believe quite trivial function.)
My question is how to construct Text from ByteString with reasonable efficiency, in particular without using lists. It seems to me that Data.Text.Encoding should be a good source for inspiration, but at first sight it uses withForeignPtr and I guess it's too low level for my use case.
How the problem should be approached? In a nutshell, I guess I need to continuously take bytes (Word8) from ByteString, translate every byte to corresponding Char, and somehow efficiently build Text from them. Complexity of basic building functions in Data.Text for Text construction not surprisingly indicates that appending characters one by one is not the best idea, but I don't see better tools for this available.
Update: I want to create strict Text. It seems that the only option is to create builder then get lazy Text from it (O(n)) and then convert to strict Text (O(n)).
You can use the Builder API, which offers O(1) singleton :: Char -> Builder and O(1) (<>) :: Builder -> Builder -> Builder for efficient construction operations.
I need to make extensive use of:
slice :: Int -> Int -> ByteString -> ByteString
slice start len = take len . drop start
Two part question:
Does this already have a name? I can't find anything searching for that type on Hoogle, but it seems like it should be a really common need. I also tried searching for (Int, Int) -> ByteString -> ByteString and some flip'd versions of same. I also tried looking for [a] versions to see if there was a name in common use.
Is there a better way to write it?
I'm suspicious that I'm doing something wrong because I strongly expected to find lots of people having gone down the same road, but my google-fu isn't finding anything.
The idiomatic way is via take and drop, which has O(1) complexity on strict bytestrings.
slice is not provided, to discourage the reliance on unsafe indexing operations.
According to the documentation there is no such function. Currently strict ByteStrings are represented as a pointer to beggining of pinned memory, an offset and a length. So, indeed, your implementation is better way to do splice. However, you should be careful with splices because spliced bytestrings takes the same amount of space as the original bytestring. In order to avoid this you might want to copy a spliced bytestring, but this is not always necessarily.
The commonly recommended Haskell string types seem to be ByteString or Text. I often work with a very large number of short (English word sized) strings, and typically need to store them in a lookup table such as Data.Map. In many cases I find that in this scenario, a table of Strings can take up less memory then a table of ByteStrings. Unboxed Data.Vectors of Word8 are also (much) more compact than ByteStrings.
What is the best practice when one needs to store and compare large numbers of small strings in Haskell?
Below I have tried to condense a particular problematic case into a small example:
import qualified Data.ByteString.Lazy.Char8 as S
import qualified Data.ByteString as Strict
import qualified Data.Map as Map
import qualified Data.Vector.Unboxed as U
import qualified Data.Serialize as Serialize
import Control.Monad.State
main = putStr
. unlines . map show . flip evalState (0,Map.empty)
. mapM toInt
. S.words
=<<
S.getContents
toInt x = do
let x' =
U.fromList . Strict.unpack . -- Comment this line to increase memory usage
Serialize.encode $ x
(i,t) <- get
case Map.lookup x' t of
Just j -> return j
Nothing -> do
let i' = i + (1::Int)
put (i', Map.insert x' i t)
return i
When I run this on a file containing around 400.000 words of English text, the version with strict bytestring keys uses around 50MB memory, the one with Word8 vectors uses 6MB.
In the absence of other answers, I'm going to go out on a limb here.
What is the best practice when one needs to store and compare large numbers of small strings in Haskell?
If the small strings are meant to be human readable (e.g. an English word) then use Text. If they are meant to be read only by the computer, use ByteString. The decision to use strict or lazy variants of these depends on how you build and use these small strings.
You shouldn't need to use your own unboxed Vectors of Word8. If you experiencing a specific situation where regular String is faster than Text or ByteString, then throw the details up on StackOverflow and we'll try to figure out why. If you perform detailed analysis and can prove that an unboxed Vector of Word8 consistently works significantly better than Text or ByteString, then start conversations on mailing lists, irc, reddit, etc; the standard libraries are not set in stone, and improvements are always welcome.
But I think it highly likely that you are just doing something weird, as hammar and shang suggest.
P.S. for your particular use case, instead of storing a lot of small strings, you should consider a more appropriate data structure catered to your needs, e.g. a Trie as danr suggests.
A (strict) ByteSting is a constructor over an unboxed ForiegnPtr to a Word8 and two unboxed Ints.
A ForeignPtr is another constructor over an Addr# (a GHC prim) and a ForeignPtrContents:
data ForeignPtrContents
= PlainForeignPtr !(IORef (Finalizers, [IO ()]))
| MallocPtr (MutableByteArray# RealWorld) !(IORef (Finalizers, [IO ()]))
| PlainPtr (MutableByteArray# RealWorld)
...
For short strings, ByteStrings simply pack too much administration to benefit their contiguous representation of the actual "string" data.
For the original question - I'd check an average word length of your corpus, but I can't see ByteString being more efficient than String aka [Char] which uses 12 bytes per Char (source the original ByteString paper).
A general plea to Haskellers (not aimed the poster of the original question) - please stop bashing String aka [Char] - having both String and Text (and ByteString when you really need bytes) makes sense. Or use Clean where the contiguous String representation is better suited to short strings.
Caveat - I may have been looking at an old version of the ByteString internals with regards to what data types it uses internally.
I know this is a 6-year old post, but I was wondering the same recently, and found this useful blog post: https://markkarpov.com/post/short-bs-and-text.html. It seems that yes, this is a recognized issue, and Short(Text/ByteString) are the solution.