Split ByteString on a ByteString (instead of a Word8 or Char) - string

I know I already have the Haskell Data.ByteString.Lazy function to split a CSV on a single character, such as:
split :: Word8 -> ByteString -> [ByteString]
But I want to split on a multi-character ByteString (like splitting on a String instead of a Char):
split :: ByteString -> ByteString -> [ByteString]
I have multi-character separators in a csv-like text file that I need to parse, and the individual characters themselves appear in some of the fields, so choosing just one separator character and discarding the others would contaminate the data import.
I've had some ideas on how to do this, but they seem kind of hacky (e.g. take three Word8s, test if they're the separator combination, start a new field if they are, recurse further), and I imagine I would be reinventing a wheel anyway. Is there a way to do this without rebuilding the function from scratch?

The documentation of Bytestrings breakSubstring contains a function that does what you are asking for:
tokenise x y = h : if null t then [] else tokenise x (drop (length x) t)
where (h,t) = breakSubstring x y

There are a few functions in bytestring for splitting on subsequences:
breakSubstring :: ByteString -> ByteString -> (ByteString,ByteString)
There's also a
bytestring-csv package, http://hackage.haskell.org/package/bytestring-csv
a split package: http://hackage.haskell.org/package/split for strings though.

Related

Efficient attoparsec parser combinating general parsers and anyChar

This is similar to my previous question Attoparsec efficient parser for multiple char, but I oversimplified the example parser I provided. I really apologize if it is considered spamming.
If I define
charToText :: Char -> Text
charToText c = pack [c]
parseEqStarMonad :: Parser Text
-- I will not define it here, but it could be any Parser Text
envParser :: Parser Text
envParser = mconcat <$> many (parseEqStarMonad <|> (charToText <$> anyChar))
it seems to me that lifting charToText is inefficient, because for each character matched charToText creates a singleton list to pack it as a Text.
Is there or more efficient way to perform this parsing ?
You can use singleton c instead of pack [c]. But beyond that, I don't see any obvious improvement.
This is fine if mconcat is used sparsely. If you need to append a lot of Text together, you should use a Builder instead.

Why do Data.Binary instances of bytestring add the length of the bytestring as prefix

Looking at the put instances of the various ByteString types we find that the length of the bytestring is always prefixed in the binary file before writing it. For example here - https://hackage.haskell.org/package/binary-0.8.8.0/docs/src/Data.Binary.Class.html#put
Taking an example
instance Binary B.ByteString where
put bs = put (B.length bs) -- Why this??
<> putByteString bs
get = get >>= getByteString
Is there any particular reason for doing this? And is the only way to write Bytestring without prefixing the length - creating our own newtype wrapper and having an instance for Binary?
Is there any particular reason for doing this?
The idea of get and put is that you can combine several objects. For example you can write:
write_func :: ByteString -> Char -> Put
write_func some_bytestring some_char = do
put some_bytestring
put some_char
then you want to define a function that can read the data back, and evidently you want the two functions to act together as an identity function: that if the writer writes a certain ByteString and a certain Char, then you want the read function to read the same ByteString and character.
The reader function should look similar to:
read_fun :: Get (ByteString, Char)
read_fun = do
bs <- get
c <- get
return (bs, c)
but the problem is, when does a ByteString ends? The 'A' character could also be part of the ByteString. You thus need to somehow indicate where the ByteString ends. This can be done by saving the length, or some marker at the end. In case of a marker, you will need to "escape" the bytestring, such that it can not contain the marker itself.
But you thus need some mechanism to specify that when the ByteString ends.
And is the only way to write Bytestring without prefixing the length - creating our own newtype wrapper and having an instance for Binary?
No, in fact it is already in the instance definition. If you want to write a ByteString without length, then you can use putByteString :: ByteString -> Put:
write_func :: ByteString -> Char -> Put
write_func some_bytestring some_char = do
putByteString some_bytestring
put some_char
but when reading the ByteString back, you will need to figure out how many bytes you have to read.

How to generate strings drawn from every possible character?

At the moment I'm generating strings like this:
arbStr :: Gen String
arbStr = listOf $ elements (alpha ++ digits)
where alpha = ['a'..'z']
digits = ['0'..'9']
But obviously this only generates strings from alpha num chars. How can I do it to generate from all possible chars?
Char is a instance of both the Enum and Bounded typeclass, you can make use of the arbitraryBoundedEnum :: (Bounded a, Enum a) => Gen a function:
import Test.QuickCheck(Gen, arbitraryBoundedEnum, listOf)
arbStr :: Gen String
arbStr = listOf arbitraryBoundedEnum
For example:
Prelude Test.QuickCheck> sample arbStr
""
""
"\821749"
"\433465\930384\375110\256215\894544"
"\431263\866378\313505\1069229\238290\882442"
""
"\126116\518750\861881\340014\42369\89768\1017349\590547\331782\974313\582098"
"\426281"
"\799929\592960\724287\1032975\364929\721969\560296\994687\762805\1070924\537634\492995\1079045\1079821"
"\496024\32639\969438\322614\332989\512797\447233\655608\278184\590725\102710\925060\74864\854859\312624\1087010\12444\251595"
"\682370\1089979\391815"
Or you can make use of the arbitrary in the Arbitrary Char typeclass:
import Test.QuickCheck(Gen, arbitrary, listOf)
arbStr :: Gen String
arbStr = listOf arbitrary
Note that the arbitrary for Char is implemented such that ASCII characters are (three times) more common than non-ASCII characters, so the "distribution" is different.
Since Char is an instance of Bounded as well as Enum (confirm this by asking GHCI for :i Char), you can simply write
[minBound..maxBound] :: [Char]
to get a list of all legal characters. Obviously this will not lead to efficient random access, though! So you could instead convert the bounds to Int with Data.Char.ord :: Char -> Int, and use QuickCheck's feature to select from a range of integers, then map back to a character with Data.Chra.chr :: Int -> Char.
When we do like
λ> length ([minBound..maxBound] :: [Char])
1114112
we get the number of all characters and say Wow..! If you think the list is too big then you may always do like drop x . take y to limit the range.
Accordingly, if you need n many random characters just shuffle :: [a] -> IO [a] the list and do a take n from that shuffled list.
Edit:
Well of course... since shuffling could be expensive, it's best if we chose a clever strategy. It would be ideal to randomly limit the all characters list. So just
make a limits = liftM sort . mapM randomRIO $ replicate 2 (0,1114112) :: (Ord a, Random a, Num a) => IO [a]
limits >>= \[min,max] -> return . drop min . take max $ ([minBound..maxBound] :: [Char])
Finally just take n many like random Chars like liftM . take n from the result of Item 2.

How to get nth byte from ByteString?

How can I get nth byte of ByteString in Haskell?
I tried to find function like !! for ByteStrings, but found nothing.
ByteString.index is the function you're looking for.
Most of the "containerish" types emulate the extended list interface; you also want to be careful because that index function will crash the program if you feed it a string that's too short (as will !! on ordinary lists). A better implementation might be
import Data.ByteString as B
nthByte :: Int -> B.ByteString -> Maybe Word8
nthByte n bs = fst <$> B.uncons (B.drop n bs)
which, reading inside out, drops the first n bytes (maybe producing an empty byte string), then attempts to split the first character from the remainder, and if successful, ignores the rest of the string.

Haskell How to Create a Word8?

I want to write a simple function which splits a ByteString into [ByteString] using '\n' as the delimiter. My attempt:
import Data.ByteString
listize :: ByteString -> [ByteString]
listize xs = Data.ByteString.splitWith (=='\n') xs
This throws an error because '\n' is a Char rather than a Word8, which is what Data.ByteString.splitWith is expecting.
How do I turn this simple character into a Word8 that ByteString will play with?
You could just use the numeric literal 10, but if you want to convert the character literal you can use fromIntegral (ord '\n') (the fromIntegral is required to convert the Int that ord returns into a Word8). You'll have to import Data.Char for ord.
You could also import Data.ByteString.Char8, which offers functions for using Char instead of Word8 on the same ByteString data type. (Indeed, it has a lines function that does exactly what you want.) However, this is generally not recommended, as ByteStrings don't store Unicode codepoints (which is what Char represents) but instead raw octets (i.e. Word8s).
If you're processing textual data, you should consider using Text instead of ByteString.

Resources