How can I use clSetKernelArg to set local memory size in an OpenCL Haskell program? - haskell

I'm using the System.GPU.OpenCL module by Luis Cabellos to control an OpenCL kernel.
All is working well but to speed things up I am trying to cache some global memory into a local buffer. I have just noticed that it seems to be impossible to pass a local buffer using the current definition of clSetKernelArg, but perhaps someone can enlighten me?
The definition is,
clSetKernelArg :: Storable a => CLKernel -> CLuint -> a -> IO ()
clSetKernelArg krn idx val = with val $ \pval -> do
whenSuccess (raw_clSetKernelArg krn idx (fromIntegral . sizeOf $ val) (castPtr pval))
$ return ()
where the raw function is defined as,
foreign import CALLCONV "clSetKernelArg" raw_clSetKernelArg ::
CLKernel -> CLuint -> CSize -> Ptr () -> IO CLint
Therefore the high level clSetKernelArg conveniently figures out the size of the memory and also extracts a pointer to it. This is perfect for global memory, but it seems that the way to use clSetKernelArg when local memory is requested is to specify the size of the desired memory in the CSize, and set Ptr to zero. Of course, putting nullPtr here doesn't work, so how can I circumvent this problem? I would call raw_clSetKernelArg directly, but it seems it is not exported by the module.
Thanks.

I don't think there's any way to rig up a hack so that pval ends up being a nullPtr.
This seems like a fairly simple omission from the wrapped API; I'd suggest simply reporting it rather than trying to hack around it :)

Related

Does Data.Map work in a pass-by-value or pass-by-reference way? (Better explanation inside)

I have a recursive function working within the scope of strictly defined interface, so I can't change the function signatures.
The code compiles fine, and even runs fines without error. My problem is that it's a large result set, so it's very hard to test if there is a semantic error.
My primary question is: In a sequence of function calls A to B to A to B to breaking condition, considering the same original Map is passed to all functions until breaking condition, and that some functions only return an Integer, would an insert on a Map in a function that only returns an Integer still be reflected once control is returned to the first function?
primaryFunc :: SuperType -> MyMap -> (Integer, MyMap)
primaryFunc (SubType1 a) mapInstance = do
let returnInt = func1 a mapInstance
(returnInt, mapInstance)
primaryFunc (SubType2 c) mapInstance = do
let returnInt = primaryFunc_nonprefix_SuperType c mapInstance
let returnSuperType = (Const returnInt)
let returnTable = H.insert c returnSuperType mapInstance
(returnInt, returnTable)
primaryFunc (ConstSubType d) mapInstance = do
let returnInt = d
(returnInt, mapInstance)
func1 :: SubType1 -> MyMap -> Integer
func1 oe vt = do
--do stuff with input and map data to get return int
returnInt = primaryFunc
returnInt
func2 :: SubType2 -> MyMap -> Integer
func2 pe vt = do
--do stuff with input and map data to get return int
returnInt = primaryFunc
returnInt
Your question is almost impossibly dense and ambiguous, but it should be possible to answer what you term your "primary" question from the simplest first principles of Haskell:
No Haskell function updates a value (e.g. a map). At most it can return a modified copy of its input.
Outside of the IO monad, no function can have side effects. No function can affect the value of any variable assigned before it was called; all it can do is return a value.
So if you pass a map as a parameter to a function, nothing the function does can alter your existing reference to that value. If you want an updated value, you can only get that from the output of a function to which you have passed the original value as input. New value, new reference.
Because of this, you should have absolute clarity at any depth within your web of functions about which value you are working with. Knowing this, you should be able to answer your own question. Frankly, this is such a fundamental characteristic of Haskell that I am perplexed that you even need to ask.
If a function only returns an integer, then any operations you perform on any values made available to the function can only affect the output - that is, the integer value returned. Nothing done within the function can affect anything else (short of causing the whole program to crash).
So if function A has a reference to a map and it passes this value to function B which returns an int, nothing function B does can affect A's copy of the map. If function B were allowed to secretly alter A's copy of the map, that would be a side effect. Side effects are not allowed.
You need to understand that Haskell does not have variables as you understand them. It has immutable values, references to immutable values and functions (which take inputs and return new outputs). Functions do not have variables which are in scope for other functions which might alter those variables on the fly. That cannot happen.
As an aside, not only does the code you posted show that you do not understand the basics of Haskell syntax, the question you asked shows that you haven't understood the primary characteristics of Haskell as a language. Not only are these fundamentals things which can be understood before having learned any syntax, they are things you need to know to make sense of the syntax.
If you have a deadline, meet it using a tool you do understand. Then go learn Haskell properly.
In addition, you will find that
an insert on a Map in a function that only returns an Integer
is nearly impossible to express. Yes, you can technically do it like in
insert k v map `seq` 42 -- force an insert and throw away the result
but if you think that, for example:
let done = insert k v map in 42
does anything with the map, you're probably wrong.
In no case, however, is the original map altered.

Performance of a function that changes functions

Lets suppose we have a function
type Func = Bool -> SophisticatedData
fun1 :: Func
And we'd like to change this function some input:
change :: SophisticatedData -> Func -> Func
change data func = \input -> if input == False then data else func input
Am I right that after several calls of change (endFunc = change data1 $ change data2 $ startFunc) resulting function would call all intermediate ones each time? Am I right that GC wouldn't able to delete unused data? What is the haskell way to cope with this task?
Thanks.
Well let's start by cleaning up change to be a bit more legible
change sd f input = if input then func input else sd
So when we compose these
change d1 $ change d2 $ change d3
GHC starts by storing a thunk for each of them. Remember that $ is a function to so the whole change d* thing is going to be a thunk at first. Thunks are relatively cheap and if you're not creating 10k or so of them at once you'll be just fine :) so no worries there.
Now the question is, what happens when you start evaluating, the answer is, it'll still not evaluate the complex data, so it's still quite memory efficient, and it only needs to force input to determine which branch it's taking. Because of this, you should never actually fully evaluate SophisticatedData until after choose has run and returned a one to you, then it will be evaluated as need if you use it.
Further more, at each step, GHC can garbage collect the unneeded thunks since they can't be referenced anymore.
In conclusion, you should be just fine. Trust in the laziness
You are correct: if foo is a chain of O(n) calls to change, there will be O(n) overhead on every call to foo. The way to deal with this is to memoize foo:
memoize :: Func -> Func
memoize f = \x -> if x then fTrue else fFalse where
fTrue = f True
fFalse = f False

Convert Ptr () to SourceCompletionProvider

I've solved the prev question partially.
Right now I'm able to register GObject subtype via bindings-gobject (see hpase)
I can implement SourceCompletionProvider using c'g_type_add_interface_static function (but didn't tried yet).
The only issue is to convert Ptr (), returned by c'g_object_newv, to gtk2hs data type SourceCompletionProvider. How can I do it? Any hints?
SourceCompletionProvider is defined like:
newtype SourceCompletionProvider = SourceCompletionProvider (ForeignPtr (SourceCompletionProvider))
What does this definition means? Why it is recursive? Provider is a ForeignPtr to provider -- looks strange for me.
Thanks.
Solution:
makeNewGObject mkGObject $ castPtr <$> c'g_object_newv myObType 0 nullPtr
The outer SourceCompletionProvider is required since this is a newtype, and the inner SourceCompletionProvider is just a marker to distinguish this foreign pointer from pointers to other types. If you look at the definition of Ptr a, it's data Ptr a = Ptr Addr# - a is a phantom type that doesn't appear on the right-hand side, so the definition is not actually recursive.
You can convert a Ptr () to ForeignPtr () using newForeignPtr_ and then cast it to ForeignPtr SourceCompletionProvider with castForeignPtr.
edit: After looking at this a bit more, I think that to make this work you'll need to first convert your Ptr to GObject with the method outlined above and then use unsafeCastGObject. Not tested, though.

How to get good performance when writing a list of integers from 1 to 10 million to a file?

question
I want a program that will write a sequence like,
1
...
10000000
to a file. What's the simplest code one can write, and get decent performance? My intuition is that there is some lack-of-buffering problem. My C code runs at 100 MB/s, whereas by reference the Linux command line utility dd runs at 9 GB/s 3 GB/s (sorry for the imprecision, see comments -- I'm more interested in the big picture orders-of-magnitude though).
One would think this would be a solved problem by now ... i.e. any modern compiler would make it immediate to write such programs that perform reasonably well ...
C code
#include <stdio.h>
int main(int argc, char **argv) {
int len = 10000000;
for (int a = 1; a <= len; a++) {
printf ("%d\n", a);
}
return 0;
}
I'm compiling with clang -O3. A performance skeleton which calls putchar('\n') 8 times gets comparable performance.
Haskell code
A naiive Haskell implementation runs at 13 MiB/sec, compiling with ghc -O2 -optc-O3 -optc-ffast-math -fllvm -fforce-recomp -funbox-strict-fields. (I haven't recompiled my libraries with -fllvm, perhaps I need to do that.) Code:
import Control.Monad
main = forM [1..10000000 :: Int] $ \j -> putStrLn (show j)
My best stab with Haskell runs even slower, at 17 MiB/sec. The problem is I can't find a good way to convert Vector's into ByteString's (perhaps there's a solution using iteratees?).
import qualified Data.Vector.Unboxed as V
import Data.Vector.Unboxed (Vector, Unbox, (!))
writeVector :: (Unbox a, Show a) => Vector a -> IO ()
writeVector v = V.mapM_ (System.IO.putStrLn . show) v
main = writeVector (V.generate 10000000 id)
It seems that writing ByteString's is fast, as demonstrated by this code, writing an equivalent number of characters,
import Data.ByteString.Char8 as B
main = B.putStrLn (B.replicate 76000000 '\n')
This gets 1.3 GB/s, which isn't as fast as dd, but obviously much better.
Some completely unscientific benchmarking first:
All programmes have been compiled with the default optimisation level (-O3 for gcc, -O2 for GHC) and run with
time ./prog > outfile
As a baseline, the C programme took 1.07s to produce a ~76MB (78888897 bytes) file, roughly 70MB/s throughput.
The "naive" Haskell programme (forM [1 .. 10000000] $ \j -> putStrLn (show j)) took 8.64s, about 8.8MB/s.
The same with forM_ instead of forM took 5.64s, about 13.5MB/s.
The ByteString version from dflemstr's answer took 9.13s, about 8.3MB/s.
The Text version from dflemstr's answer took 5.64s, about 13.5MB/s.
The Vector version from the question took 5.54s, about 13.7MB/s.
main = mapM_ (C.putStrLn . C.pack . show) $ [1 :: Int .. 10000000], where C is Data.ByteString.Char8, took 4.25s, about 17.9MB/s.
putStr . unlines . map show $ [1 :: Int .. 10000000] took 3.06s, about 24.8MB/s.
A manual loop,
main = putStr $ go 1
where
go :: Int -> String
go i
| i > 10000000 = ""
| otherwise = shows i . showChar '\n' $ go (i+1)
took 2.32s, about 32.75MB/s.
main = putStrLn $ replicate 78888896 'a' took 1.15s, about 66MB/s.
main = C.putStrLn $ C.replicate 78888896 'a' where C is Data.ByteString.Char8, took 0.143s, about 530MB/s, roughly the same figures for lazy ByteStrings.
What can we learn from that?
First, don't use forM or mapM unless you really want to collect the results. Performancewise, that sucks.
Then, ByteString output can be very fast (10.), but if the construction of the ByteString to output is slow (3.), you end up with slower code than the naive String output.
What's so terrible about 3.? Well, all the involved Strings are very short. So you get a list of
Chunk "1234567" Empty
and between any two such, a Chunk "\n" Empty is put, then the resulting list is concatenated, which means all these Emptys are tossed away when a ... (Chunk "1234567" (Chunk "\n" (Chunk "1234568" (...)))) is built. That's a lot of wasteful construct-deconstruct-reconstruct going on. Speed comparable to that of the Text and the fixed "naive" String version can be achieved by packing to strict ByteStrings and using fromChunks (and Data.List.intersperse for the newlines). Better performance, slightly better than 6., can be obtained by eliminating the costly singletons. If you glue the newlines to the Strings, using \k -> shows k "\n" instead of show, the concatenation has to deal with half as many slightly longer ByteStrings, which pays off.
I'm not familiar enough with the internals of either text or vector to offer more than a semi-educated guess concerning the reasons for the observed performance, so I'll leave them out. Suffice it to say that the performance gain is marginal at best compared to the fixed naive String version.
Now, 6. shows that ByteString output is faster than String output, enough that in this case the additional work of packing is more than compensated. However, don't be fooled by that to believe that is always so. If the Strings to pack are long, the packing can take more time than the String output.
But ten million invocations of putStrLn, be it the String or the ByteString version, take a lot of time. It's faster to grab the stdout Handle just once and construct the output String in non-IO code. unlines already does well, but we still suffer from the construction of the list map show [1 .. 10^7]. Unfortunately, the compiler didn't manage to eliminate that (but it eliminated [1 .. 10^7], that's already pretty good). So let's do it ourselves, leading to 8. That's not too terrible, but still takes more than twice as long as the C programme.
One can make a faster Haskell programme by going low-level and directly filling ByteStrings without going through String via show, but I don't know if the C speed is reachable. Anyway, that low-level code isn't very pretty, so I'll spare you what I have, but sometimes one has to get one's hands dirty if speed matters.
Using lazy byte strings gives you some buffering, because the string will be written instantly and more numbers will only be produced as they are needed. This code shows the basic idea (there might be some optimizations that could be made):
import qualified Data.ByteString.Lazy.Char8 as ByteString
main =
ByteString.putStrLn .
ByteString.intercalate (ByteString.singleton '\n') .
map (ByteString.pack . show) $
([1..10000000] :: [Int])
I still use Strings for the numbers here, which leads to horrible slowdowns. If we switch to the text library instead of the bytestring library, we get access to "native" show functions for ints, and can do this:
import Data.Monoid
import Data.List
import Data.Text.Lazy.IO as Text
import Data.Text.Lazy.Builder as Text
import Data.Text.Lazy.Builder.Int as Text
main :: IO ()
main =
Text.putStrLn .
Text.toLazyText .
mconcat .
intersperse (Text.singleton '\n') .
map Text.decimal $
([1..10000000] :: [Int])
I don't know how you are measuring the "speed" of these programs (with the pv tool?) but I imagine that one of these procedures will be the fastest trivial program you can get.
If you are going for maximum performance, then it helps to take a holistic view; i.e., you want to write a function that maps from [Int] to series of system calls that write chunks of memory to a file.
Lazy bytestrings are good representation for a sequence of chunks of memory. Mapping a lazy bytestring to a series of systems calls that write chunks of memory is what L.hPut is doing (assuming an import qualified Data.ByteString.Lazy as L). Hence, we just need a means to efficiently construct the corresponding lazy bytestring. This is what lazy bytestring builders are good at. With the new bytestring builder (here is the API documentation), the following code does the job.
import qualified Data.ByteString.Lazy as L
import Data.ByteString.Lazy.Builder (toLazyByteString, charUtf8)
import Data.ByteString.Lazy.Builder.ASCII (intDec)
import Data.Foldable (foldMap)
import Data.Monoid (mappend)
import System.IO (openFile, IOMode(..))
main :: IO ()
main = do
h <- openFile "/dev/null" WriteMode
L.hPut h $ toLazyByteString $
foldMap ((charUtf8 '\n' `mappend`) . intDec) [1..10000000]
Note that I output to /dev/null to avoid interference by the disk driver. The effort of moving the data to the OS remains the same. On my machine, the above code runs in 0.45 seconds, which is 12 times faster than the 5.4 seconds of your original code. This implies a throughput of 168 MB/s. We can squeeze out an additional 30% speed (220 MB/s) using bounded encodings].
import qualified Data.ByteString.Lazy.Builder.BasicEncoding as E
L.hPut h $ toLazyByteString $
E.encodeListWithB
((\x -> (x, '\n')) E.>$< E.intDec `E.pairB` E.charUtf8)
[1..10000000]
Their syntax looks a bit quirky because a BoundedEncoding a specifies the conversion of a Haskell value of type a to a bounded-length sequence of bytes such that the bound can be computed at compile-time. This allows functions such as E.encodeListWithB to perform some additional optimizations for implementing the actual filling of the buffer. See the the documentation of Data.ByteString.Lazy.Builder.BasicEncoding in the above link to the API documentation (phew, stupid hyperlink limit for new users) for more information.
Here is the source of all my benchmarks.
The conclusion is that we can get very good performance from a declarative solution provided that we understand the cost model of our implementation and use the right datastructures. Whenever constructing a packed sequence of values (e.g., a sequence of bytes represented as a bytestring), then the right datastructure to use is a bytestring Builder.

How to store recursive datatype with Data.Binary

Data.Binary is great. There is just one question I have. Let's imagine I've got a datatype like this:
import Data.Binary
data Ref = Ref {
refName :: String,
refRefs :: [(String, Ref)]
}
instance Binary Ref where
put a = put (refName a) >> put (refRefs a)
get = liftM2 Ref get get
It's easily to see that this is a recursive datatype, which works because Haskell is lazy. Since Haskell as a language uses neither references nor pointers, but presents the data as-is, I am not sure how this is going to be saved. I have the strong indication that this naive reproach will lead to an infinite bytestring...
So how can this type be safely saved?
If your data has no cycles you'll be fine. But a cycle, like
r = Ref "a" [("b", r)]
is indeed going to generate an infinite result. The only way around this is for you to give unique labels to all nodes and use those to avoid cycles when converting to binary.

Resources