How do I work with indvidual elements of a ByteString in Haskell

How do I work with indvidual elements of a ByteString in Haskell - haskell

I need to write a function with the following type
replaceSubtrie :: SSTrie -> Data.Word.Word8 -> SSTrie -> SSTrie
replaceSubtrie trie base subtrie = ???
where depending on the value of base, the subtrie will be inserted into the trie in differing ways. SSTrie is my own data type and I know how to work with it, but I have no idea how to deal with the Word8 value.
base is a single "character" (for certain values of "character") taken from a ByteString. Specifically, it is the result of calling index on ByteString -- that's the only reason why I've declared it Word8.
I can't do pattern matching, as there's no Word8 constructor available. And I can't get guards to work because I don't know how to construct a Word8 constant to compare it against.
[edited] Jerome's suggestiong worked. But more generally, are there any good articles out there showing how to work with Bytestrings (and other more low-level data)? Like, how could I have known that fact about Word8?
[edited - Question for Don Stewart]
Right now I've got it working with code like this
replaceSubtrie trie 0x41 subtrie = trie{ a=subtrie }
When I change it to this:
replaceSubtrie trie 'A' subtrie = trie{ a=subtrie }
I get an error:
Trie.hs:40:21:
Couldn't match expected type `Word8' with actual type `Char'
In the pattern: 'A'
In an equation for `replaceSubtrie':
replaceSubtrie trie 'A' subtrie = trie {a = subtrie}
I do have import qualified Data.ByteString.Char8 as C at the top of my file. What am I doing wrong?

I feel a bit silly looking up the ASCII value for 'A', but what the hell
You can simply import Data.ByteString.Char8 or Data.ByteString.Lazy.Char8, to get all the same functions, but permitting the use of character literals in patterns.

Related

Understanding double colon "::" and type variables in Haskell?

For a uni assignment, a function "lookup" has been defined as:
lookup :: Env e -> String -> Maybe e
The "Env" keyword has been defined as:
import qualified Data.Map as M
newtype Env e = Env (M.Map String e) deriving (Functor, Foldable, Traversable, Show, Eq, Monoid, Semigroup)
I understand that the double colon "::" means "has type of". So the lookup function takes two arguments and returns something. I know the second argument is a string, however, I'm struggling to understand the first argument and the output. What is the "Env" type and what does the "e" after it represent. It feels like e is an argument to "Env" but if that is the case, the output wouldn't make sense.
Any help would be appreciated.

It feels like e is an argument to "Env" but if that is the case, the output wouldn't make sense.
That's the direction the syntax is trying to "feel like", because that's what it means. Env is a parameterised type, and e is being passed as an argument to that parameter.
I'm not 100% sure why you think the output wouldn't make sense, under that interpretation? Maybe is also a parameterised type, and the variable e is also being passed to Maybe; this is no different than using a variable twice in any piece of code. You can think of it as something like this pseudo-code:
lookupType(e) {
argument1_type = Env(e)
argument2_type = String
result_type = Maybe(e)
}
Is your confusion that you are thinking of e in the type signature of lookup not as being passed to Env, but as being defined to receive the argument of Env? That happens in the place where Env is defined, not where it is used in the type of lookup. Again, this is just like what happens at the value level with function arguments; e.g. when you're writing code to define a function like plus:
plus x y = x + y
Then the x and y variables are created to stand for whatever plus is applied to elsewhere. But in another piece of code where you're using plus, something like incr x = plus x 1 here the variable is just being passed to plus as an argument, it is not defined as the parameter of plus (it was in fact defined as the parameter of incr).
Perhaps the thing that you need more explicitly called out is this. lookup :: Env e -> String -> Maybe e is saying:
For any type e that you like, lookup takes an Env e and a String and returns a Maybe e
Thus you could pass lookup an Env Integer and a String, and it will give you back a Maybe Integer. Or you could pass it an Env (Maybe [(String, Float)]) and a String, and it will give you back a Maybe (Maybe [(String, Float)]). This should hopefully be intuitive, because it's just looking up keys in an environment; whatever type of data is stored in the environment you pass to lookup is the type that lookup will maybe-return.
The e is there because in fact lookup is parametrically polymorphic; it's almost like lookup takes a type argument called e, which it can then pass to other things in its type signature.1 That's why I wrote my pseudo-code the way I did, above.
But this is just how variables in type signatures work, in base Haskell2. You simply write variables in your type signature without defining them anyway, and they mean your definition can be used with ANY type at the position you write the variable. The only reason the variables have names like e (rather than being some sort of wildcard symbol like ?) is because you often need to say that you can work with any type, but it has to be the same type in several different places. lookup is like that; it can take any type of Env and return any sort of Maybe, but they have to be consistent. Giving the variable the name e merely allows you to link them to say that they're the same variable.
1 This is in fact exactly how types like this work at a lower level. Haskell normally keeps this kind of type argument invisible though; you just write a type signature containing variables, without defining them, and every time you use the accompanying binding the compiler figures out how the variables should be instantiated.
2 In more advanced Haskell, when you turn on a bunch of extensions, you can actually control exactly where type variables are introduced, rather than it always happening automatically at the beginning of every type signature you use with variables. You don't need to know that yet, and I'm not going to talk about it further.

I'll try to give concrete examples, providing and motivating intuition. With thtat intuition, I think the question has a very natural answer:
e is a type variable, and the lookup function wants to work on all possible environments, regardless of which concrete type e is". The unbound e is a natural way to syntactically express that
Step 1, the Env type
The Env type is a wrapper for Map type in the Data.Map module in the containers package. It is a collection of key-value pairs, and you can insert new key-value pairs and look them up. If the key you are looking up is missing, you must return an error, a null value, a default or something else. Just like hashmaps or dictionaries in other programming languages.
The documentation (linked above) writes
data Map = Map k a
A Map from keys k to values a.
and we will try that out and wee what k and a can be.
I use ghci to get interactive feedback.
Prelude> import Data.Map as M
Prelude M> map1 = M.fromList [("Sweden","SEK"),("Chile","CLP")]
Prelude M> :type map1
map1 :: Map [Char] [Char]
Prelude M> map2 = M.fromList [(1::Integer,"Un"),(2,"deux")]
Prelude M> :type map2
map2 :: Map Integer [Char]
Prelude M> map3 = fromList [("Ludvig",10000::Double),("Mikael",25000)]
Prelude M> :type map3
map3 :: Map [Char] Double
You can see that we create various mappings based on lists of key-value pairs.The type signature in the documentation Map k a correspond to different k and a in the ghci session above. For map2, k correspongs to Integer and a to [Char].
You can also see how I declared manual types at some places, using the double colon syntax.
To reduce the flexibility, we can create a new "wrapping" type that for M.Map. With this wrapper, we make sure that the keys are always Strings.
Prelude M> newtype Env e = Env (M.Map String e) deriving (Show)
this definition says that the type Env e is, for every e an alias for M.Map String e. The newtype declaration further says that we must always be explicit and wrap the mappings in a Env value constructor. Let see if we can do that.
Prelude M> Env map1
Env (fromList [("Chile","CLP"),("Sweden","SEK")])
Prelude M> Env map2
<interactive>:34:6: error:
• Couldn't match type ‘Integer’ with ‘[Char]’
Expected type: Map String [Char]
Actual type: Map Integer [Char]
• In the first argument of ‘Env’, namely ‘(map2)’
In the expression: Env (map2)
In an equation for ‘it’: it = Env (map2)
Prelude M> Env map3
Env (fromList [("Ludvig",10000.0),("Mikael",25000.0)])
In the session above, we see that both map1 and map3 can be wrapped in an Env, since they have an appropriate type (they have k==String), but map2 cannot (having k==Integer).
The logical step from Map to Env is a little tricky, since also renames some variables. What was called a when talking about maps, is called e in the Env case. But variable names are always arbitrary, so that is ok.
Step 2, the lookup
We have established that Env e is a wrapping type that contains a lookup table from strings to values of some type e. How do you lookup things? Let us start with the non-wrapped case, and then the wrapped case. In Data.Map there is a function called lookup. Lets try it!
Prelude M> M.lookup "Ludvig" map3
Just 10000.0
Prelude M> M.lookup "Elias" map3
Nothing
Okay, to look up a value, we supply a key, and get just the corresponding value. If the key is missing, we get nothing. What is the type of the return value?
Prelude M> :type M.lookup "Ludvig" map3
M.lookup "Ludvig" map3 :: Maybe Double
when making a lookup in a Data [Char] Double, we need a key of type [Char] and return a vlue of type Maybe Double. Okay. That sounds reasonable. What about some other example?
Prelude M> :type M.lookup 1 map2
M.lookup 1 map2 :: Maybe [Char]
when making a lookup in a Data Integer [Char], we need a key of type Integer and return a value of type Maybe [Char]. So in general, to lookup we need a key of type k and a map of type M.Map k a and return a Maybe a.
Generally, we think M.lookup :: k -> M.Map k a -> Maybe a. Lets see if ghci agrees.
Prelude M> :t M.lookup
M.lookup :: Ord k => k -> Map k a -> Maybe a
It does! It also requires that the k type must be some type with a defined order on it, such as strings or numbers. That is the meaning of the Ord k => thing in the beginning. Why? Well, it has to do with how M.Map type is defined... do not care too much about it right now.
We can specialize the type signature if we set k to be String. In that special case, it would look like
M.lookup :: String -> Map String a -> Maybe a
hmm. this inspires us to flip the order of argument 1 and 2 to lookup, and replace the variable a with e. It is just a variable, and names on variables are arbitrary anyways... Lets call this new function myLookup
myLookup :: Map String e -> String -> Maybe e
and since Env a is essentially the same as Map String e, if we just unwrap the Env type, we may define
myLookup :: Env a -> String -> Maybe e
how does one implement such a function? One way is to just unwrap the type by pattern matching, and let the library to the heavy lifting.
myLookup (Env map) key = M.lookup key map
Conclusion
I have tried to build up concrete examples using functions and types from the standard library of Haskell. I hope I have illustrated the idea of a polymorhic type (that M.Map k a is a type when supplied with a k and a v) and that Env is a wrapper, specializing to the case when k is a String.
I hope that this concrete example motivates why the type signatures looks like they do and why type variables are useful.
I have not tried to give you the formal treatment with terminology. I have not explained the Maybe type, nor typeclasses and derived typeclasses. I hope you can read up on that elsewhere.
I hope it helps.

Pass no char to a function that is expecting it in Haskell

I am working with Haskell and I have defined the following type
--Build type Transition--
data Transition = Transition {
start_state :: Int,
symbol :: Char,
end_state :: Int
} deriving Show
and I would like to be able to define the following Transition
Transition 0 '' 1
which would be mean "a transition given by no symbol" (I need it to compute the epsilon closure of a NFA). How can I do this?
Thank you!

Well the idea of defining a type is that every value you pass to that field is a "member" of that type. Char only contains only characters (and the empty string is not a character) and undefined (but it is advisable not to use undefined here).
Usually in case you want to make values optional, you can use a Maybe a type instead, so:
data Transaction = Transaction {
start_state :: Int,
symbol :: Maybe Char,
end_state :: Int
} deriving Show
So now we can pass two kinds of values: Nothing which thus should be interpreted as "no character", or Just x, with x a character, and this thus acts as a character, so in your case, that would be:
Transaction 0 Nothing 1
Maybe is also an instance of Functor, Applicative and Monad, which should make working with Maybe types quite convenient (yes it can sometimes introduce some extra work, but by using fmap, etc. the amount of pattern matching shifting to Maybe Char should be rather low).
Note: like #amalloy says, an NFA (and DFA) has Transitions, not Transactions.

How to make a custom Attoparsec parser combinator that returns a Vector instead of a list?

{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Text
import Control.Applicative(many)
import Data.Word
parseManyNumbers :: Parser [Int] -- I'd like many to return a Vector instead
parseManyNumbers = many (decimal <* skipSpace)
main :: IO ()
main = print $ parseOnly parseManyNumbers "131 45 68 214"
The above is just an example, but I need to parse a large amount of primitive values in Haskell and need to use arrays instead of lists. This is something that possible in the F#'s Fparsec, so I've went as far as looking at Attoparsec's source, but I can't figure out a way to do it. In fact, I can't figure out where many from Control.Applicative is defined in the base Haskell library. I thought it would be there as that is where documentation on Hackage points to, but no such luck.
Also, I am having trouble deciding what data structure to use here as I can't find something as convenient as a resizable array in Haskell, but I would rather not use inefficient tree based structures.
An option to me would be to skip Attoparsec and implement an entire parser inside the ST monad, but I would rather avoid it except as a very last resort.

There is a growable vector implementation in Haskell, which is based on the great AMT algorithm: "persistent-vector". Unfortunately, the library isn't that much known in the community so far. However to give you a clue about the performance of the algorithm, I'll say that it is the algorithm that drives the standard vector implementations in Scala and Clojure.
I suggest you implement your parser around that data-structure under the influence of the list-specialized implementations. Here the functions are, btw:
-- | One or more.
some :: f a -> f [a]
some v = some_v
where
many_v = some_v <|> pure []
some_v = (fmap (:) v) <*> many_v
-- | Zero or more.
many :: f a -> f [a]
many v = many_v
where
many_v = some_v <|> pure []
some_v = (fmap (:) v) <*> many_v

Some ideas:
Data Structures
I think the most practical data structure to use for the list of Ints is something like [Vector Int]. If each component Vector is sufficiently long (i.e. has length 1k) you'll get good space economy. You'll have
to write your own "list operations" to traverse it, but you'll avoid re-copying data that you would have to perform to return the data in a single Vector Int.
Also consider using a Dequeue instead of a list.
Stateful Parsing
Unlike Parsec, Attoparsec does not provide for user state. However, you
might be able to make use of the runScanner function (link):
runScanner :: s -> (s -> Word8 -> Maybe s) -> Parser (ByteString, s)
(It also returns the parsed ByteString which in your case may be problematic since it will be very large. Perhaps you can write an alternate version which doesn't do this.)
Using unsafeFreeze and unsafeThaw you can incrementally fill in a Vector. Your s data structure might look
something like:
data MyState = MyState
{ inNumber :: Bool -- True if seen a digit
, val :: Int -- value of int being parsed
, vecs :: [ Vector Int ] -- past parsed vectors
, v :: Vector Int -- current vector we are filling
, vsize :: Int -- number of items filled in current vector
}
Maybe instead of a [Vector Int] you use a Dequeue (Vector Int).
I imagine, however, that this approach will be slow since your parsing function will get called for every single character.
Represent the list as a single token
Parsec can be used to parse a stream of tokens, so how about writing
your own tokenizer and letting Parsec create the AST.
The key idea is to represent these large sequences of Ints as a single token. This gives you a lot more latitude in how you parse them.
Defer Conversion
Instead of converting the numbers to Ints at parse time, just have parseManyNumbers return a ByteString and defer the conversion until
you actually need the values. This much enable you to avoid reifying
the values as an actual list.

Vectors are arrays, under the hood. The tricky thing about arrays is that they are fixed-length. You pre-allocate an array of a certain length, and the only way of extending it is to copy the elements into a larger array.
This makes linked lists simply better at representing variable-length sequences. (It's also why list implementations in imperative languages amortise the cost of copying by allocating arrays with extra space and copying only when the space runs out.) If you don't know in advance how many elements there are going to be, your best bet is to use a list (and perhaps copy the list into a Vector afterwards using fromList, if you need to). That's why many returns a list: it runs the parser as many times as it can with no prior knowledge of how many that'll be.
On the other hand, if you happen to know how many numbers you're parsing, then a Vector could be more efficient. Perhaps you know a priori that there are always n numbers, or perhaps the protocol specifies before the start of the sequence how many numbers there'll be. Then you can use replicateM to allocate and populate the vector efficiently.

Parsec returns [Char] instead of Text

I am trying to create a parser for a custom file format. In the format I am working with, some fields have a closing tag like so:
<SOL>
<DATE>0517
<YEAR>86
</SOL>
I am trying to grab the value between the </ and > and use it as part of the bigger parser.
I have come up with the code below. The trouble is, the parser returns [Char] instead of Text. I can pack each Char by doing fmap pack $ return r to get a text value out, but I was hoping type inference would save me from having to do this. Could someone give hints as to why I am getting back [Char] instead of Text, and how I can get back Text without having to manually pack the value?
{-# LANGUAGE NoMonomorphismRestriction #-}
{-# LANGUAGE OverloadedStrings #-}
import Data.Text
import Text.Parsec
import Text.Parsec.Text
-- |A closing tag is on its own line and is a "</" followed by some uppercase characters
-- followed by some '>'
closingTag = do
_ <- char '\n'
r <- between (string "</") (char '>') (many upper)
return r

string has the type
string :: Stream s m Char => String -> ParsecT s u m String
(See here for documentation)
So getting a String back is exactly what's supposed to happen.
Type inference doesn't change types, it only infers them. String is a concrete type, so there's no way to infer Text for it.
What you could do, if you need this in a couple of places, is to write a function
text :: Stream s m Char => String -> ParsecT s u m Text
text = fmap pack . string
or even
string' :: (IsString a, Stream s m Char) => String -> ParsecT s u m a
string' = fmap fromString . string
Also, it doesn't matter in this example but you'd probably want to import Text qualified, names like pack are used in a number of different modules.
As Ørjan Johansen correctly pointed out, string isn't actually the problem here, many upper is. The same principle applies though.

The reason you get [Char] here is that upper parses a Char and many turns that into a [Char]. I would write my own combinator along the lines of:
manyPacked = fmap pack . many
You could probably use type-level programming with type classes etc. to automatically choose between many and manyPack depending on the expect return type, but I don't think that's worth it. (It would probably look a bit like Scala's CanBuiltFrom).

Text or Bytestring

Good day.
The one thing I now hate about Haskell is quantity of packages for working with string.
First I used native Haskell [Char] strings, but when I tried to start using hackage libraries then completely lost in endless conversions. Every package seem to use different strings implementation, some adopts their own handmade thing.
Next I rewrote my code with Data.Text strings and OverloadedStrings extension, I chose Text because it has a wider set of functions, but it seems many projects prefer ByteString.
Someone could give short reasoning why to use one or other?
PS: btw how to convert from Text to ByteString?
Couldn't match expected type
Data.ByteString.Lazy.Internal.ByteString
against inferred type Text
Expected type: IO Data.ByteString.Lazy.Internal.ByteString
Inferred type: IO Text
I tried encodeUtf8 from Data.Text.Encoding, but no luck:
Couldn't match expected type
Data.ByteString.Lazy.Internal.ByteString
against inferred type Data.ByteString.Internal.ByteString
UPD:
Thanks for responses, that *Chunks goodness looks like way to go, but I somewhat shocked with result, my original function looked like this:
htmlToItems :: Text -> [Item]
htmlToItems =
getItems . parseTags . convertFuzzy Discard "CP1251" "UTF8"
And now became:
htmlToItems :: Text -> [Item]
htmlToItems =
getItems . parseTags . fromLazyBS . convertFuzzy Discard "CP1251" "UTF8" . toLazyBS
where
toLazyBS t = fromChunks [encodeUtf8 t]
fromLazyBS t = decodeUtf8 $ intercalate "" $ toChunks t
And yes, this function is not working because its wrong, if we supply Text to it, then we're confident this text is properly encoded and ready to use and converting it is stupid thing to do, but such a verbose conversion still has to take place somewhere outside htmltoItems.

ByteStrings are mainly useful for binary data, but they are also an efficient way to process text if all you need is the ASCII character set. If you need to handle unicode strings, you need to use Text. However, I must emphasize that neither is a replacement for the other, and they are generally used for different things: while Text represents pure unicode, you still need to encode to and from a binary ByteString representation whenever you e.g. transport text via a socket or a file.
Here is a good article about the basics of unicode, which does a decent job of explaining the relation of unicode code-points (Text) and the encoded binary bytes (ByteString): The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
You can use the Data.Text.Encoding module to convert between the two datatypes, or Data.Text.Lazy.Encoding if you are using the lazy variants (as you seem to be doing based on your error messages).

You definitely want to be using Data.Text for textual data.
encodeUtf8 is the way to go. This error:
Couldn't match expected type Data.ByteString.Lazy.Internal.ByteString
against inferred type Data.ByteString.Internal.ByteString
means that you're supplying a strict bytestring to code which expects a lazy bytestring. Conversion is easy with the fromChunks function:
Data.ByteString.Lazy.fromChunks :: [Data.ByteString.Internal.ByteString] -> ByteString
so all you need to do is add the function fromChunks [myStrictByteString] wherever the lazy bytestring is expected.
Conversion the other way can be accomplished with the dual function toChunks, which takes a lazy bytestring and gives a list of strict chunks.
You may want to ask the maintainers of some packages if they'd be able to provide a text interface instead of, or in addition to, a bytestring interface.

Use a single function cs from the Data.String.Conversions.
It will allow you to convert between String, ByteString and Text (as well as ByteString.Lazy and Text.Lazy), depending on the input and the expected types.
You still have to call it, but no longer to worry about the respective types.
See this answer for usage example.

For what it's worth, I found these two helper functions to be quite useful:
import qualified Data.ByteString.Char8 as BS
import qualified Data.Text as T
-- | Text to ByteString
tbs :: T.Text -> BS.ByteString
tbs = BS.pack . T.unpack
-- | ByteString to Text
bst :: BS.ByteString -> T.Text
bst = T.pack . BS.unpack
Example:
foo :: [BS.ByteString]
foo = ["hello", "world"]
bar :: [T.Text]
bar = bst <$> foo
baz :: [BS.ByteString]
baz = tbs <$> bar

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string