Idiomatic way to take a substring of a ByteString - haskell

I need to make extensive use of:
slice :: Int -> Int -> ByteString -> ByteString
slice start len = take len . drop start
Two part question:
Does this already have a name? I can't find anything searching for that type on Hoogle, but it seems like it should be a really common need. I also tried searching for (Int, Int) -> ByteString -> ByteString and some flip'd versions of same. I also tried looking for [a] versions to see if there was a name in common use.
Is there a better way to write it?
I'm suspicious that I'm doing something wrong because I strongly expected to find lots of people having gone down the same road, but my google-fu isn't finding anything.

The idiomatic way is via take and drop, which has O(1) complexity on strict bytestrings.
slice is not provided, to discourage the reliance on unsafe indexing operations.

According to the documentation there is no such function. Currently strict ByteStrings are represented as a pointer to beggining of pinned memory, an offset and a length. So, indeed, your implementation is better way to do splice. However, you should be careful with splices because spliced bytestrings takes the same amount of space as the original bytestring. In order to avoid this you might want to copy a spliced bytestring, but this is not always necessarily.

Related

What type to give to a function that measures a data structure? Int, Integer, Integral?

I have a data structure, such as an expression tree or graph. I want to add some "measure" functions, such as depth and size.
How best to type these functions?
I see the following three variants as of roughly equal usefulness:
depth :: Expr -> Int
depth :: Expr -> Integer
depth :: Num a => Expr -> a
I have the following considerations in mind:
I'm looking at base and fgl as examples, and they consistently use Int, but Data.List also has functions such as genericLength that are polymorphic in return type, and I am thinking that maybe the addition of these generic functions is a reflection of a modernizing trend that I probably should respect and reinforce.
A similar movement of thought is noticeable in some widely used libraries providing a comprehensive set of functions with the same functionality when there are several probable choices of a return type to be desired by the user (e.g. xml-conduit offers parsers that accept both lazy and strict kinds of either ByteString or Text).
Integer is a nicer type overall than Int and I sometimes find that I need to cast a length of a list to an Integer, say because an algorighm that operates in Integer needs to take this length into account.
Making functions return Integral means these functions are made polymorphic, and it may have performance penalty. I don't know all the particulars well, but, as I understand, there may be some run-time cost, and polymorphic things are harder to memoize.
What is the accepted best practice? Which part of it is due to legacy and compatibility considerations? (I.e. if Data.List was designed today, what type would functions such as length have?) Did I miss any pros and cons?
Short answer: As a general rule use Int, and if you need to convert it to something else, use fromIntegral. (If you find yourself doing the conversion a lot, define fi = fromIntegral to save typing or else create your own wrapper.)
The main consideration is performance. You want to write the algorithm so it uses an efficient integer type internally. Provided Int is big enough for whatever calculation you're doing (the standard guarantees a signed 30-bit integer, but even on 32-bit platforms using GHC, it's a signed 32-bit integer), you can assume it will be a high-speed integer type on the platform, particularly in comparison to Integer (which has boxing and bignum calculation overhead that can't be optimized away). Note that the performance differences can be substantial. Simple counting algorithms will often be 5-10x faster using Ints compared to Integers.
While you could give your function a different signature:
depth :: Expr -> Integer
depth :: (Num a) => Expr -> a
but actually implement it under the hood using the efficient Int type and do the conversion at the end, making the conversion implicit strikes me as poor practice. Particularly if this is a library function, making it clear that Int is being used internally by making it part of the signature strikes me as more sensible.
With respect to your listed considerations:
First, the generic* functions in Data.List aren't modern. In particular, genericLength was available in GHC 0.29, released July, 1996. At some prior point, length had been defined in terms of genericLength, as simply:
length :: [a] -> Int
length = genericLength
but in GHC 0.29, this definition was commented out with an #ifdef USE_REPORT_PRELUDE, and several hand-optimized variants of length were defined independently. The other generic* functions weren't in 0.29, but they were already around by GHC 4.02 (1998).
Most importantly, when the Prelude version of length was generalized from lists to Foldables, which is a fairly recent development (since GHC 7.10?), nobody cared enough to do anything with genericLength. I also don't think I've ever seen these functions used "in the wild" in any serious Haskell code. For the most part, you can think of them as deprecated.
Second, the use of lazy/strict and ByteString/Text variants in libraries represents a somewhat different situation. In particular, a conduit-xml user will normally be making the decision between lazy and strict variants and between ByteString and Text types based on considerations about the data being processed and the construction of the algorithms that are far-reaching and pervade the entire type system of a given program. If the only way to use conduit-xml with a lazy Text type was to convert it piecemeal to strict ByteStrings, pass it to the library, and then pull it back out and convert it back to a lazy Text type, no one would accept that complexity. In contrast, a monomorphic Int-based definition of depth works fine, because all you need is fromInteger . depth to adapt it to any numeric context.
Third, as noted above, Integer is only a "nicer" type from the standpoint of having arbitrary precision in situations where you don't care about performance. For things like depth and count in any practical setting, performance is likely to be more important than unlimited precision.
Fourth, I don't think either the runtime cost or failure-to-memoize should be serious considerations in choosing between polymorphic or non-polymorphic versions here. In most situations, GHC will generate a specialized version of the polymorphic function in a context where memoization is no problem.
On this basis, I suspect if Data.List was designed today, it would still use Ints.
I agree with all the points in K. A. Buhr's great answer, but here are a couple more:
You should use a return type of Integer if you expect to support an expression tree that somehow doesn't fit in memory (which seems interesting, but unlikely). If I saw Expr -> Integer I would go looking in the code or docs to try to understand how or why the codomain might be so large.
Re. performance of Integer: normal machine-word arithmetic will be used if the number is not larger than the max width of a machine word. Simplifying, the type is basically:
data Integer = SmallInteger Int | LargeInteger ByteArray
K. A. Buhr mentions that there is an unavoidable performance penalty which is that this value cannot be unboxed (that is it will always have a heap representation, and will be read from and written to memory), and that does sound right to me.
In contrast, functions on Int (or Word) are often unboxed, so that in core you will see types that look like Int# -> Int# ->. You can think of an Int# as only existing in a machine register. This is what you want your numeric code to look like if you care about performance.
Re. polymorphic versions: designing libraries around concrete numeric inputs and polymorphic numeric outputs probably works okay, in terms of convenient type inference. We already have this to a certain degree, in that numeric literals are overloaded. There are certainly times when literals (e.g. also string literals when -XOverloadedStrings) need to be given type signatures, and so I'd expect that if base were designed to be more polymorphic that you would run into more occasions where this would be required (but fewer uses of fromIntegral).
Another option you haven't mentioned is using Word to express that the depth is non-negative. This is more useful for inputs, but even then is often not worth it: Word will still overflow, and negative literals are valid (although GHC will issue a warning); to a certain extent it's just moving where the bug occurs.

How to craft a type matching "a list with a single element of type Int"?

I am currently writing my very first program in Haskell.
In the specification I am working with, [0] 5 is used to define a MAC key that could be written "\x00\x00\x00\x00\x00"::ByteString.
I somewhat fancy the idea of reusing that notation (even though it makes very little sense from a programming perspective). Eventually writing mackey so that mackey [0] 5 does the right thing was simple enough.
The only question that remains is how to define my input type so that it enforces the use of a list with a single integer element. Is that even possible?
NB: normally, I wouldn't bother too much about that. I shouldn't even use a list in such case: a simple Int would be enough and "enforce" everything I need; so I know that the correct way is to use a simple integer. But this is a very good way to explore what can be done (or not) with Haskell type system. :)
As you've observed yourself, a single Int does exactly what's needed and is probably the way to go. Don't use a list if you don't want a list!
That said, using a plain Int may not be the best thing either. Perhaps you want to be clear what's the meaning of each argument. You might for that purpose make an alias for Int and call it accordingly:
newtype KeyWord = KeyWord Int
macKey :: KeyWord -> Int -> MAC
In this case the syntax at the call site would then be macKey (KeyWord 0) 5.
It would be possible to shorten that a bit more, but it's probably not worth it. In fact, even the newtype is probably overkill – the main benefit is that the type signature becomes more explicit, but for calling the function this is mostly boilerplate. A simple type-alias is probably enough:
type KeyWord = Int
and then you can again write macKey 0 5 while retaining the clear signature.
If you need to write out lots of those keys in a concise manner, you might consider making macKey and infix operator:
infix 7 #*
(#*) :: KeyWord -> Int -> MAC
and then write 0#*5.

How to approach writing of custom decoding function from `ByteString` to `Text`

Suppose I wish to write something like this:
-- | Decode a 'ByteString' containing Code Page 437 encoded text.
decodeCP437 :: ByteString -> Text
decodeCP437 = undefined
(I know about encoding package, but its dependency list is ridiculous price to pay for this single, and I believe quite trivial function.)
My question is how to construct Text from ByteString with reasonable efficiency, in particular without using lists. It seems to me that Data.Text.Encoding should be a good source for inspiration, but at first sight it uses withForeignPtr and I guess it's too low level for my use case.
How the problem should be approached? In a nutshell, I guess I need to continuously take bytes (Word8) from ByteString, translate every byte to corresponding Char, and somehow efficiently build Text from them. Complexity of basic building functions in Data.Text for Text construction not surprisingly indicates that appending characters one by one is not the best idea, but I don't see better tools for this available.
Update: I want to create strict Text. It seems that the only option is to create builder then get lazy Text from it (O(n)) and then convert to strict Text (O(n)).
You can use the Builder API, which offers O(1) singleton :: Char -> Builder and O(1) (<>) :: Builder -> Builder -> Builder for efficient construction operations.

Match a lot of patterns in Haskell efficiently

I have thought of using Haskell for a game server but when coding, I found myself looking at the part where I parse packets thinking "wow, this will result in a lot of pattern matching". This seeing the amount of matches to be done are many (walk there, attack that, loot that, open that, and so on).
What I do is:
Receive a packet
Parse the packet header into a hexadecimal String (say "02B5" for example)
Get rest of data from the packet
Match header in parseIO
Call the appropriate function with the packet content
It would be easy to map String -> method, but the methods have different amount of in-parameters.
I thought of the simple two ways of pattern matching shown below.
#1
packetIO :: String -> IO ()
packetIO packet =
case packet of
"02B5" -> function1
"ADD5" -> function2
... and so on
#2
packetIO :: String -> IO ()
packetIO "02B5" = function1
packetIO "ADD5" = function2
... and so on
Both looking at performance and coding style, is there a way to better handle the packets received from the client?
If you have any resources or links I failed to find, please do point me in their direction!
EDIT 130521:
Seems like both alternatives, listed below, are good choices. Just waiting to see answers to my questions in the comments before choosing which was the best solution for me.
Storing (ByteString -> Function) in a Map structure. O(log n)
Converting ByteString to Word16 and pattern match. O(log n) through tree or O(1) through lookup tables
EDIT 130521:
Decided to go for pattern matching with Word16 as Philip JF said.
Both are great alternatives and while my guess is both is equally fast, Map might be faster seeing I don't have to convert to Word16, the other option gave more readable code for my use:
packetIO 0x02B5 = function1
packetIO 0xADD5 = function2
etc
Why not parse to numbers (Word16 in Data.Word?) and then do the matching with that, instead of using strings? Haskell supports hex literals...
Both of your functions are equivalent. The compiler desugars the second one to the first one. Pattern matching is syntactic sugar for case.
case is optimal for this kind of thing. It compiles to a jump table, which is O(1). That means both of the solutions you listed are optimal.
As far as coding style goes, both styles are perfectly idiomatic. I personally prefer case over pattern matching, but I know a lot of other people prefer pattern matching for top-level functions.

Converting Data.Time.UTCTime to / from ByteString

Let's say I need to do write/read of a Data.Time.UTCTime in "%Y-%m-%d %H:%M:%S" format many many times to/from a file.
It seems to me that, using Data.Time.formatTime or Data.Time.parseTime to convert UTCTime to/from String and then packing/unpacking the String to/from ByteString, would be too slow since it involves an intermediate String. But writing a ByteString builder/parser of UTCTime by hand seems like repeating a lot of work already done in formatTime and parseTime.
I guess my question is: Is there a systematic way to get functions of type t -> String or String -> t converted to t -> ByteString or ByteString -> t with increased efficiency without repeating a lot of work?
I am totally a Haskell newbie, so please forgive me if the question is just stupid.
No, there isn't a general way to convert a function of type t -> String to one of type t -> ByteString. It might help you reconcile yourself to that reality if you recall that a ByteString isn't just a faster String; it's lower-level than that. A ByteString is a sequence of bytes; it doesn't mean anything more unless you have an encoding in mind.
So your options are:
Use function composition, as in phg's answer:
import Data.ByteString.Char8 as B
timeToByteStr :: UTCTime -> ByteString
timeToByteStr = B.pack . formatTime'
parseBStrTime :: ByteString -> Maybe UTCTime
parseBStrTime = parseTime' . B.unpack
where I've corrected the function names. I've also used, e.g., parseTime' instead of parseTime, to elide over the format string and TimeLocale you need to pass in.
(It's also worth noting that Data.Text is often a better choice than Data.ByteString.Char8. The latter only works properly if your Chars fall in Unicode code points 0-255.)
If performance is an issue and this conversion profiles out as a bottleneck, write a parser/builder.
Just use Strings.
The last option is an underrated choice for Haskell newbies, IMHO. String is suboptimal, performance-wise, but it's not spawn-of-the-Devil. If you're just getting started with Haskell, why not keep your life as simple as possible, until and unless your code becomes too slow?
If you're not really in the need of heavily optimized code, the usual and easiest way is to use function composition:
timeToByte = toBytes . formatTime
byteToTime = parseTime . fromBytes
or something like that, as I'm not familiar with those libraries.
If after profiling you recognize that this approach is still to slow, I guess you will have to write something by hand.

Resources