Which container should I use? - haskell

I'm new to haskell, and so I'm trying to recreate the following C++ code in haskell.
int main() {
class MyClass {
public:
int a;
std::string s;
float f;
};
std::vector <MyClass> v;
LoadSerialized(&v); // don't need haskell equivalent; just reads a bunch of MyClass's and pushes them back onto v
}
Now, I've looked at the various containers in haskell that might work as the std::vector here: there's list, unboxed vector, boxed vector, and some weird usage of foreign pointers like the following:
data Table = Table { floats :: ForeignPtr CFloat
, ints :: ForeignPtr Int }
newTable :: IO Table
newTable = do
fp <- S.mallocByteString (floatSize * sizeOf (undefined :: CFloat))
ip <- S.mallocByteString (intSize * sizeOf (undefined :: Int ))
withForeignPtr fp $ \p ->
forM_ [0..floatSize-1] $ \n ->
pokeElemOff p n pi
withForeignPtr ip $ \p ->
forM_ [0..intSize-1] $ \n ->
pokeElemOff p n n
return (Table fp ip)
Now, I could implement the C++ code in the way I think is best--being a haskell newbie. Or I could ask people more experienced with the language what the best way is, because to me it looks like there's some nuance going on here that I'm missing. Simply, I want to push a structure containing many datatypes into a haskell container, and I don't care about the order. If it helps, I'm going to read serialized data into the container as you can see with LoadSerialized.
I'm not mixing in C++ code.
(Edit: is it stackoverflow policy to allow censorship of questions through editing (not minor)? It does say "always respect the original author.")

If you are writing the whole program in Haskell, Just use a list unless you have a good reason not to. (If you do have a good reason not to, please say what it is and we can help you choose a more appropriate data structure. e.g. random access to a specific list element is O(n) rather than the O(1) of a C++ vector, and updating values in a data structure is different in Haskell.)
If you are mixing Haskell and C++ in the same program, and you need help calling C++ from Haskell, please say.
Use lists by default. List operations such as map, foldr and filter can be fused together by the compiler, resulting in more efficient code than you would typically get using a C++ vector.
Use an array of some sort if you find yourself needing to lookup an element by index, or wanting to mutate an element at a specific index. See Data.Array, Data.Array.IO, and Data.Array.ST.
Use a sequence if you find yourself needing to insert new elements in the middle of the data structure, or at both ends of the structure. See Data.Sequence.

Related

Haskell: Assigning unique char to matrix values if x > 0

So my goal for the program is for it to receive an Int matrix for input, and program converts all numbers > 0 to a unique sequential char, while 0's convert into a '_' (doesn't matter, just any character not in the sequence).
eg.
main> matrixGroupings [[0,2,1],[2,2,0],[[0,0,2]]
[["_ab"],["cd_"],["__e"]]
The best I've been able to achieve is
[["_aa"],["aa_"],["__a"]]
using:
matrixGroupings xss = map (map (\x -> if x > 0 then 'a' else '_')) xss
As far as I can tell, the issue I'm having is getting the program to remember what its last value was, so that when the value check is > 0, it picks the next char in line. I can't for the life of me figure out how to do this though.
Any help would be appreciated.
Your problem is an instance of an ancient art: labelling of various structures with a stream of
labels. It dates back at least to Chris Okasaki, and my favourite treatment is by Jeremy
Gibbons.
As you can see from these two examples, there is some variety to the way a structure may be
labelled. But in this present case, I suppose the most straightforward way will do. And in Haskell
it would be really short. Let us dive in.
The recipe is this:
Define a polymorphic type for your matrices. It must be such that a matrix of numbers and a
matrix of characters are both rightful members.
Provide an instance of Traversable class. It may in many cases be derived automagically.
Pick a monad to your liking. One simple choice is State. (Actually, that is the only choice I
can think of.)
Create an action in this monad that takes a number to a character.
Traverse a matrix with this action.
Let's cook!
A type may be as simple as this:
newtype Matrix a = Matrix [[a]] deriving Show
It is entirely possible that the inner lists will be of unequal length — this type does not
protect us from making a "ragged" matrix. This is poor design. But I am going to skim over
it for now. Haskell provides an endless depth for perfection. This type is good enough for
our needs here.
We can immediately define an example of a matrix:
example :: Matrix Int
example = Matrix [[0,2,1],[2,2,0],[0,0,2]]
How hard is it to define a Traversable? 0 hard.
{-# language DeriveTraversable #-}
...
newtype Matrix a = Matrix [[a]] deriving (Show, Functor, Foldable, Traversable)
Presto.
Where do we get labels from? It is a side effect. The function reaches somewhere, takes a
stream of labels, takes the head, and puts the tail back in the extra-dimensional pocket. A
monad that can do this is State.
It works like this:
label :: Int -> State String Char
label 0 = return '_'
label x = do
ls <- get
case ls of
[ ] -> error "No more labels!"
(l: ls') -> do
put ls'
return l
I hope the code explains itself. When a function "creates" a monadic value, we call it
"effectful", or an "action" in a given monad. For instance, print is an action that,
well, prints stuff. Which is an effect. label is also an action, though in a different
monad. Compare and see for youself.
Now we are ready to cook a solution:
matrixGroupings m = evalState (traverse label m) ['a'..'z']
This is it.
λ matrixGroupings example
Matrix ["_ab","cd_","__e"]
Bon appetit!
P.S. I took all glory from you, it is unfair. To make things fun again, I challenge you for an exercise: can you define a Traversable instance that labels a matrix in another order — by columns first, then rows?

How to make a custom Attoparsec parser combinator that returns a Vector instead of a list?

{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Text
import Control.Applicative(many)
import Data.Word
parseManyNumbers :: Parser [Int] -- I'd like many to return a Vector instead
parseManyNumbers = many (decimal <* skipSpace)
main :: IO ()
main = print $ parseOnly parseManyNumbers "131 45 68 214"
The above is just an example, but I need to parse a large amount of primitive values in Haskell and need to use arrays instead of lists. This is something that possible in the F#'s Fparsec, so I've went as far as looking at Attoparsec's source, but I can't figure out a way to do it. In fact, I can't figure out where many from Control.Applicative is defined in the base Haskell library. I thought it would be there as that is where documentation on Hackage points to, but no such luck.
Also, I am having trouble deciding what data structure to use here as I can't find something as convenient as a resizable array in Haskell, but I would rather not use inefficient tree based structures.
An option to me would be to skip Attoparsec and implement an entire parser inside the ST monad, but I would rather avoid it except as a very last resort.
There is a growable vector implementation in Haskell, which is based on the great AMT algorithm: "persistent-vector". Unfortunately, the library isn't that much known in the community so far. However to give you a clue about the performance of the algorithm, I'll say that it is the algorithm that drives the standard vector implementations in Scala and Clojure.
I suggest you implement your parser around that data-structure under the influence of the list-specialized implementations. Here the functions are, btw:
-- | One or more.
some :: f a -> f [a]
some v = some_v
where
many_v = some_v <|> pure []
some_v = (fmap (:) v) <*> many_v
-- | Zero or more.
many :: f a -> f [a]
many v = many_v
where
many_v = some_v <|> pure []
some_v = (fmap (:) v) <*> many_v
Some ideas:
Data Structures
I think the most practical data structure to use for the list of Ints is something like [Vector Int]. If each component Vector is sufficiently long (i.e. has length 1k) you'll get good space economy. You'll have
to write your own "list operations" to traverse it, but you'll avoid re-copying data that you would have to perform to return the data in a single Vector Int.
Also consider using a Dequeue instead of a list.
Stateful Parsing
Unlike Parsec, Attoparsec does not provide for user state. However, you
might be able to make use of the runScanner function (link):
runScanner :: s -> (s -> Word8 -> Maybe s) -> Parser (ByteString, s)
(It also returns the parsed ByteString which in your case may be problematic since it will be very large. Perhaps you can write an alternate version which doesn't do this.)
Using unsafeFreeze and unsafeThaw you can incrementally fill in a Vector. Your s data structure might look
something like:
data MyState = MyState
{ inNumber :: Bool -- True if seen a digit
, val :: Int -- value of int being parsed
, vecs :: [ Vector Int ] -- past parsed vectors
, v :: Vector Int -- current vector we are filling
, vsize :: Int -- number of items filled in current vector
}
Maybe instead of a [Vector Int] you use a Dequeue (Vector Int).
I imagine, however, that this approach will be slow since your parsing function will get called for every single character.
Represent the list as a single token
Parsec can be used to parse a stream of tokens, so how about writing
your own tokenizer and letting Parsec create the AST.
The key idea is to represent these large sequences of Ints as a single token. This gives you a lot more latitude in how you parse them.
Defer Conversion
Instead of converting the numbers to Ints at parse time, just have parseManyNumbers return a ByteString and defer the conversion until
you actually need the values. This much enable you to avoid reifying
the values as an actual list.
Vectors are arrays, under the hood. The tricky thing about arrays is that they are fixed-length. You pre-allocate an array of a certain length, and the only way of extending it is to copy the elements into a larger array.
This makes linked lists simply better at representing variable-length sequences. (It's also why list implementations in imperative languages amortise the cost of copying by allocating arrays with extra space and copying only when the space runs out.) If you don't know in advance how many elements there are going to be, your best bet is to use a list (and perhaps copy the list into a Vector afterwards using fromList, if you need to). That's why many returns a list: it runs the parser as many times as it can with no prior knowledge of how many that'll be.
On the other hand, if you happen to know how many numbers you're parsing, then a Vector could be more efficient. Perhaps you know a priori that there are always n numbers, or perhaps the protocol specifies before the start of the sequence how many numbers there'll be. Then you can use replicateM to allocate and populate the vector efficiently.

Accessing vector element by index using lens

I'm looking for a way to reference an element of a vector using lens library...
Let me try to explain what I'm trying to achieve using a simplified example of my code.
I'm working in this monad transformer stack (where StateT is the focus, everything else is not important)
newtype MyType a = MyType (StateT MyState (ExceptT String IO) a)
MyState has a lot of fields but one of those is a vector of clients which is a data type I defined:
data MyState = MyState { ...
, _clients :: V.Vector ClientT
}
Whenever I need to access one of my clients I tend to do it like this:
import Control.Lens (use)
c <- use clients
let neededClient = c V.! someIndex
... -- calculate something, update client if needed
clients %= (V.// [(someIndex, updatedClient)])
Now, here is what I'm looking for: I would like my function to receive a "reference" to the client I'm interested in and use it (retrieve it from State, update it if needed).
In order to clear up what I mean here is a snippet (that won't compile even in pseudo code):
...
myFunction (clients.ix 0)
...
myFunction clientLens = do
c <- use clientLens -- I would like to access a client in the vector
... -- calculate stuff
clientLens .= updatedClient
Basically, I would like to pass to myFunction something from Lens library (I don't know what I'm passing here... Lens? Traversal? Getting? some other thingy?) which will allow me to point at particular element in the vector which is kept in my StateT. Is it at all possible? Currently, when using "clients.ix 0" I get an error that my ClientT is not an instance of Monoid.
It is a very dumbed down version of what I have. In order to answer the question "why I need it this way" requires a lot more explanation. I'm interested if it is possible to pass this "reference" which will point to some element in my vector which is kept in State.
clients.ix 0 is a traversal. In particular, traversals are setters, so setting and modifying should work fine:
clients.ix 0 .= updatedClient
Your problem is with use. Because a traversal doesn't necessarily contain exactly one value, when you use a traversal (or use some other getter function on it), it combines all the values assuming they are of a Monoid type.
In particular,
use (clients.ix n)
would want to return mempty if n is out of bounds.
Instead, you can use the preuse function, which discards all but the first target of a traversal (or more generally, a fold), and wraps it in a Maybe. E.g.
Just c <- preuse (clients.ix n)
Note this will give a pattern match error if n is out of bounds, since preuse returns Nothing then.

Self-reference in data structure – Checking for equality

In my initial attempt at creating a disjoint set data structure I created a Point data type with a parent pointer to another Point:
data Point a = Point
{ _value :: a
, _parent :: Point a
, _rank :: Int
}
To create a singleton set, a Point is created that has itself as its parent (I believe this is called tying the knot):
makeSet' :: a -> Point a
makeSet' x = let p = Point x p 0 in p
Now, when I wanted to write findSet (i.e. follow parent pointers until you find a Point whose parent is itself) I hit a problem: Is it possible to check if this is the case? A naïve Eq instance would of course loop infinitely – but is this check conceptually possible to write?
(I ended up using a Maybe Point for the parent field, see my other question.)
No, what you are asking for is known in the Haskell world as referential identity: the idea that for two values of a certain type, you can check whether they are the same value in memory or two separate values that happen to have exactly the same properties.
For your example, you can ask yourself whether you would consider the following two values the same or not:
pl1 :: Point Int
pl1 = Point 0 (Point 0 pl1 1) 1
pl2 :: Point Int
pl2 = Point 0 pl2 1
Haskell considers both values completely equal. I.e. Haskell does not support referential identity. One of the reasons for this is that it would violate other features that Haskell does support. It is for example the case in Haskell that we can always replace a reference to a function by that function's implementation without changing the meaning (equational reasoning). For example if we take the implementation of pl2: Point 0 pl2 1 and replace pl2 by its definition, we get Point 0 (Point 0 pl2 1) 1, making pl2's definition equivalent to pl1's. This shows that Haskell cannot allow you to observe the difference between pl1 and pl2 without violating properties implied by equational reasoning.
You could use unsafe features like unsafePerformIO (as suggested above) to work around the lack of referential identity in Haskell, but you should know that you are then breaking core principles of Haskell and you may observe weird bugs when GHC starts optimizing (e.g. inlining) your code. It is better to use a different representation of your data, e.g. the one you mentioned using a Maybe Point value.
You could try to do this using StableName (or StablePtr) and unsafePerformIO, but it seems like a worse idea than Maybe Point for this case.
In order the observed the effect you're most likely interested in—pointer equality instead of value equality—you most likely will want to write your algorithm in the ST monad. The ST monad can be thought of kind of like "locally impure IO, globally pure API" though, by the nature of Union Find, you'll likely have to dip the entire lookup process into the impure section of your code.
Fortunately, monads still contain this impurity fairly well.
There is also already an implementation of Union Find in Haskell which uses three methods to achieve the kind of identity you're looking for: using the IO monad, using IntSets, and using the ST monad.
In Haskell data is immutable (except IO a, IORef a, ST a, STRef a, ..) and data is function.
So, any data is "singleton"
When you type
data C = C {a, b :: Int}
changeCa :: C -> Int -> C
changeCa c newa = c {a = newa}
You don't change a variable, you destroy old data and create a new one.
You could try use links and pointers, but it is useless and complex
And last,
data Point a = Point {p :: Point a}
is a infinite list-like data = Point { p=Point { p=Point { p=Point { ....

Writing fusible O(1) update for vector

It is continuation of this question. Since vector library doesn't seem to have a fusible O(1) update function, I am wondering if it is possible to write a fusible O(1) update function that doesn't involve unsafeFreeze and unsafeThaw. It would use vector stream representation, I guess - I am not familiar with how to write one using stream and unstream - hence, this question. The reason is this will give us the ability to write a cache-friendly update function on vector where only a narrow region of vector is being modified, and so, we don't want to walk through entire vector just to process that narrow region (and this operation can happen billions of times in each function call - so, the motivation to keep the overhead really low). The transformation functions like map process entire vector - so they will be too slow.
I have a toy example of what I want to do, but the upd function below uses unsafeThaw and unsafeFreeze - it doesn't seem to be optimized away in the core, and also breaks the promise of not using the buffer further:
module Main where
import Data.Vector.Unboxed as U
import Data.Vector.Unboxed.Mutable as MU
import Control.Monad.ST
upd :: Vector Int -> Int -> Int -> Vector Int
upd v i x = runST $ do
v' <- U.unsafeThaw v
MU.write v' i x
U.unsafeFreeze v'
sum :: Vector Int -> Int
sum = U.sum . (\x -> upd x 0 73) . (\x -> upd x 1 61)
main = print $ Main.sum $ U.fromList [1..3]
I know how to implement imperative algorithms using STVector. In case you are wondering why this alternative approach, I want to try out this approach of using pure vectors to check how GHC transformation of a particular algorithm differs when written using fusible pure vector streams (with monadic operations under the hood of course).
When the algorithm is written using STVector, it doesn't seem to be as nicely iterative as I would like it to be (I guess it is harder for GHC optimizer to spot loops when there is lot of mutability strewn around). So, I am investigating this alternative approach to see I can get a nicer loop in there.
The upd function you have written does not look correct, let alone fusable. Fusion is a library level optimization and requires you to write your code out of certain primatives. In this case what you want is not just fusion, but recycling which can be easily achieved via the bulk update operations such as // and update. These operations will fuse, and even happen in place much of the time.
If you really want to write your own destructive update based code DO NOT use unsafeThaw--use modify
Any function is a fusible update function; you seem to be trying to escape from the programming model the vector package is trying to get you to use
module Main where
import Data.Vector.Unboxed as U
change :: Int -> Int -> Int
change 0 n = 73
change 1 n = 61
change m n = n
myfun2 = U.sum . U.imap change . U.enumFromStepN 1 1
main = print $ myfun2 30000000
-- this doesn't create any vectors much less 'update' them, as you will see if you study the core.

Resources