Accelerate code passes intepreter but fails under CUDA

Accelerate code passes intepreter but fails under CUDA - haskell

I have been trying to write a function in that will take a histogram of a vector using the accelerate library. I recognize that histograms aren't the idea case for GPU processing, but I'm generating a fairly large dataset from a small seed and it would be nice if it could be reduced to a few kilobyte array before transferring it back to main memory.
The code that I've come up with is below. It takes a number of output bins then then creates a new array where the values of a[x] is the number of occurrences of x in xs
hist :: A.Exp Int -> A.Acc (A.Vector Int) -> A.Acc (A.Vector Int)
hist bins xs = A.permute
(const (+1))
(A.fill (A.index1 bins) 0)
(A.index1 . (xs A.!))
xs
The code appears to run properly under the Accelerate interpreter. However, if I try to call it through accelerate-cuda, I get the following error message.
./Data/Array/Accelerate/CUDA/State.hs:85:9: (unhandled): CUDA Exception: unspecified launch failure
My question is two-fold. First, what am I doing that causes CUDA to fail? Second, is there a better way to take a histogram through Accelerate?

This was a bug in Accelerate (and/or underlying change in CUDA) which has now been fixed. Apologies for taking so long to get to it, this slipped off my radar.

Related

Convert function to exploit parallelization of the GPU

I have a function that uses values stored in one array to operate on another array. This behaves similar to the numpy.hist function. For example:
import numpy as np
from numba import jit
#jit(nopython=True)
def array_func(x, y, output_counts, output_weights):
for row in range(x.size):
col = int(x[row] * 10)
output_counts[col] += 1
output_weights[col] += y[row]
return (output_counts, output_weights)
# in the current code these arrays exists ad pytorch tensors
# on the GPU and get converted to numpy arrays on the CPU before
# being passed to "array_func"
x = np.random.randint(0, 11, (1000)) / 10
y = np.random.randint(0, 100, (10000))
output_counts, output_weights = array_func(x, y, np.zeros(y.size), np.zeros(y.size))
While this works for arrays it does not work for torch tensors that are on the GPU. This is close to what histogram functions do, but I also need the summation of binned values (i.e., the output_weights array/tensor). The current function requires me to continually pass the data from GPU to CPU, followed by the CPU function being run in series.
Can this function be converted to run in parallel on the GPU?
##EDIT##
The challenge is caused by the following line:
output_weights[col] += y[row]
If it weren't for that line I could just use the torch.histc function.
Here's my thought: GPUs are "fast" because they have hundreds/thousands of threads available and can run parts of a big job (or many smaller jobs) on these threads. However, if I convert the function above to work on torch tensors then there is no benefit to running on the GPU (it actually kills the performance). I wonder if there is a way I can break of x so each value gets sent to different threads (similar to how apply_async does within multiprocessing)?
I'm open to other options.
In it's current form the function is fast, but the GPU-to-CPU data transfer is killing me.

Your computation is indeed a general histogram operation. There are multiple ways to compute this on a GPU regarding the number of items to scan, the size of the histogram and the distribution of the values.
For example, one solution consist in building local histograms in each separate kernel blocks and then perform a reduction. However, this solution is not well suited in your case since len(x) / len(y) is relatively small.
An alternative solution is to perform atomic updates of the histogram in parallel. This solutions only scale well if there is no atomic conflicts which is dependent of the actual input data. Indeed, if all value of x are equal, then all updates will be serialized which is slower than doing the accumulation sequentially on a CPU (due to the overhead of the atomic operations). Such a case is frequent on small histograms but assuming the distribution is close to uniform, this can be fine.
This operation can be done with Numba using CUDA (targetting Nvidia GPUs). Here is an example of kernel solving your problem:
#cuda.jit
def array_func(x, y, output_counts, output_weights):
tx = cuda.threadIdx.x # Thread id in a 1D block
ty = cuda.blockIdx.x # Block id in a 1D grid
bw = cuda.blockDim.x # Block width, i.e. number of threads per block
pos = tx + ty * bw # Compute flattened index inside the array
if pos < x.size:
col = int(x[pos] * 10)
cuda.atomic.add(output_counts, col, 1)
cuda.atomic.add(output_weights, col, y[pos])
For more information about how to run this kernel, please read the documentation. Note that the arrays output_counts and output_weights can possibly be directly created on the GPU so to avoid transfers. x and y should be on the GPU for better performance (otherwise a CPU reduction will be certainly faster). Also note that the kernel should be pretty fast so the overhead to run/wait it and allocate/free temporary array may be significant and even possibly slower than the kernel itself (but certainly faster than doing a double transfer from/to the CPU so to compute things on the CPU assuming data was on the GPU). Note also that such atomic accesses are only fast on quite recent Nvidia GPU that benefit from specific computing units for atomic operations.

Heap profiling in Haskell. Am I having space leaks?

I am writing a small snake game in Haskell as sort of a guided tutorial for beginners. The "rendering" just takes a Board and produces a Data.ByteString.Builder which is printed in the terminal. (the html profiles are pushed to the repo, you can inspect them without compiling the programm)
The problem
The problem I have is that the heap profiling looks weird: There are many spikes, and suddenly Builder, PAP and BuildStep take as same memory as the rest of the program. Considering that rendering is happenning 10 times in a second (i.e. every second we produce 10 builders), it seems inconsistent that every once in a while the builder just takes that much memory. I don't know if this is considered an space leak, since there is no thunks in the profile, but the PAP doesn't look right (I don't know...)
Implementation
The board is represented as an inmutable array of builders indexed by coordinaates (tuples) type Board = Array (Int, Int) Builder (essentialy, what should be printed in each coordinate). The function which converts the board into a builder is the expected strict fold which handle new lines using height and width of the board.
toBuilder :: RenderState -> Builder
-- |- The Array (Int, Int) Builder
toBuilder (RenderState b binf#(BoardInfo h w) gOver s) =
-- ^^^ height and width
if gOver
then ppScore s <> fst (boardToString $ emptyGrid binf) -- Not interesting. Case of game over print build an empty grid
else ppScore s <> fst (boardToString b) -- print the current board
where
boardToString = foldl' fprint (mempty, 0) -- concatenate builders and count the number, such that when #width builders have been concatenated, add a new line.
fprint (!s, !i) cell =
if ((i + 1) `mod` w) == 0
then (s <> cell <> B.charUtf8 '\n', i + 1 )
else (s <> cell , i + 1)
Up to the .prof file this function take most of the time and space (92%, which is expected). Moreover, this is the only part of the code that produces a big builder, so the problem should be here.
The buffering mode
The above profile happens when BufferMode is set to LineBuffering (default), but interestingly if I change it to NoBuffering then the profile looks the same but a thunk appears and the builder disappear...
The questions
I have reached a point which I don't know whats going on, hence my questions are a little bit vague:
Is my code with line buffering (first profile) actually leaking? No thunk appears but the PAP eating so much memory looks like a warning
The second profile clearly(?) leaks, is there an standard way to inspect which part of the code is producing the thunk?
Am I completely missing something, and actually the profile looks fine?

In case anyone is interested, I think I've found the problem. It is the terminal speed... If I run an smaller board size or a slower rendering time (the picture is for a 50x70 board with 10 renders a second), then the memory usage is completely normal.
What I think is happening, is that the board is printed into the console using B.hPutBuilder stdout, this action takes shorter than the console to actually print it, so the haskell thread continues and creates another board which should wait to be printed because the console is busy. I guess this leads to some how, two boards living in memory for a short time.
Other guesses are welcome!

Haskell 'count occurrences' function

I implemented a count function in Haskell and I am wondering will this behave badly on large lists :
count :: Eq a => a -> [a] -> Int
count x = length . filter (==x)
I believe the length function runs in linear time, is this correct?
Edit: Refactor suggested by #Higemaru

Length runs in linear time to the size of the list, yes.
Normally, you would be worried that your code had to take two passes through the list: first one to filter and then one to count the length of the resulting list. However, I believe this does not happen here because filter is not strict on the structure of the list. Instead, the length function forces the elements of the filtered list as it goes along, doing the actual count in one pass.

I think you can make it slightly shorter
count :: Eq a => a -> [a] -> Int
count x = length . filter (x==)
(I would have written a (lowly) comment if I could)

That really depends on the list. For a normal, lazily evaluated list of Ints on my computer, I see this function running in about 2 seconds for 10^9 elements, 0.2 seconds for 10^8, and 0.3 seconds for 10^7, so it appears to run in linear time. You can check this yourself by passing the flags +RTS -s -RTS to your executable when running it from the command line.
I also tried running it with more cores, but it doesn't seem to do anything but increase the memory usage a bit.
An added bonus of the lazy computation is that you only make a single pass over the list. filter and length get turned into a single loop by the compiler (with optimizations turned on), so you save memory and efficiency.

Basic I/O performance in Haskell

Another microbenchmark: Why is this "loop" (compiled with ghc -O2 -fllvm, 7.4.1, Linux 64bit 3.2 kernel, redirected to /dev/null)
mapM_ print [1..100000000]
about 5x slower than a simple for-cycle in plain C with write(2) non-buffered syscall? I am trying to gather Haskell gotchas.
Even this slow C solution is much faster than Haskell
int i;
char buf[16];
for (i=0; i<=100000000; i++) {
sprintf(buf, "%d\n", i);
write(1, buf, strlen(buf));
}

Okay, on my box the C code, compiled per gcc -O3 takes about 21.5 seconds to run, the original Haskell code about 56 seconds. So not a factor of 5, a bit above 2.5.
The first nontrivial difference is that
mapM_ print [1..100000000]
uses Integers, that's a bit slower because it involves a check upfront, and then works with boxed Ints, while the Show instance of Int does the conversion work on unboxed Int#s.
Adding a type signature, so that the Haskell code works on Ints,
mapM_ print [1 :: Int .. 100000000]
brings the time down to 47 seconds, a bit above twice the time the C code takes.
Now, another big difference is that show produces a linked list of Char and doesn't just fill a contiguous buffer of bytes. That is slower too.
Then that linked list of Chars is used to fill a byte buffer that then is written to the stdout handle.
So, the Haskell code does more, and more complicated things than the C code, thus it's not surprising that it takes longer.
Admittedly, it would be desirable to have an easy way to output such things more directly (and hence faster). However, the proper way to handle it is to use a more suitable algorithm (that applies to C too). A simple change to
putStr . unlines $ map show [0 :: Int .. 100000000]
almost halves the time taken, and if one wants it really fast, one uses the faster ByteString I/O and builds the output efficiently as exemplified in applicative's answer.

On my (rather slow and outdated) machine the results are:
$ time haskell-test > haskell-out.txt
real 1m57.497s
user 1m47.759s
sys 0m9.369s
$ time c-test > c-out.txt
real 7m28.792s
user 1m9.072s
sys 6m13.923s
$ diff haskell-out.txt c-out.txt
$
(I have fixed the list so that both C and Haskell start with 0).
Yes you read this right. Haskell is several times faster than C. Or rather, normally buffered Haskell is faster than C with write(2) non-buffered syscall.
(When measuring output to /dev/null instead of a real disk file, C is about 1.5 times faster, but who cares about /dev/null performance?)
Technical data: Intel E2140 CPU, 2 cores, 1.6 GHz, 1M cache, Gentoo Linux, gcc4.6.1, ghc7.6.1.

The standard Haskell way to hand giant bytestrings over to the operating system is to use a builder monoid.
import Data.ByteString.Lazy.Builder -- requires bytestring-0.10.x
import Data.ByteString.Lazy.Builder.ASCII -- omit for bytestring-0.10.2.x
import Data.Monoid
import System.IO
main = hPutBuilder stdout $ build [0..100000000::Int]
build = foldr add_line mempty
where add_line n b = intDec n <> charUtf8 '\n' <> b
which gives me:
$ time ./printbuilder >> /dev/null
real 0m7.032s
user 0m6.603s
sys 0m0.398s
in contrast to Haskell approach you used
$ time ./print >> /dev/null
real 1m0.143s
user 0m58.349s
sys 0m1.032s
That is, it's child's play to do nine times better than mapM_ print, contra Daniel Fischer's suprising defeatism. Everything you need to know is here: http://hackage.haskell.org/packages/archive/bytestring/0.10.2.0/doc/html/Data-ByteString-Builder.html I won't compare it with your C since my results were much slower than Daniel's and n.m. so I figure something was going wrong.
Edit: Made the imports consistent with all versions of bytestring-0.10.x It occurred to me the following might be clearer -- the Builder equivalent of unlines . map show:
main = hPutBuilder stdout $ unlines_ $ map intDec [0..100000000::Int]
where unlines_ = mconcat . map (<> charUtf8 '\n')

No error message in Haskell

Just out of curiosity, I made a simple script to check speed and memory efficiency of constructing a list in Haskell:
wasteMem :: Int -> [Int]
wasteMem 0 = [199]
wasteMem x = (12432483483467856487256348746328761:wasteMem (x-1))
main = do
putStrLn("hello")
putStrLn(show (wasteMem 10000000000000000000000000000000000))
The strange thing is, when I tried this, it didn't run out of memory or stack space, it only prints [199], the same as running wasteMem 0. It doesn't even print an error message... why? Entering this large number in ghci just prints the number, so I don't think it's a rounding or reading error.

Your program is using a number greater than maxBound :: Int32. This means it will behave differently on different platforms. For GHC x86_64 Int is 64 bits (32 bits otherwise, but the Haskell report only promises 29 bits). This means your absurdly large value (1x10^34) is represented as 4003012203950112768 for me and zero for you 32-bit folks:
GHCI> 10000000000000000000000000000000000 :: Int
4003012203950112768
GHCI> 10000000000000000000000000000000000 :: Data.Int.Int32
0
This could be made platform independent by either using a fixed-size type (ex: from Data.Word or Data.Int) or using Integer.
All that said, this is a poorly conceived test to begin with. Haskell is lazy, so the amount of memory consumed by wastedMem n for any value n is minimal - it's just a thunk. Once you try to show this result it will grab elements off the list one at a time - first generating "[12432483483467856487256348746328761, and leaving the rest of the list as a thunk. The first value can be garbage collected before the second value is even considered (a constant-space program).

Adding to Thomas' answer, if you really want to waste space, you have to perform an operation on the list, which needs the whole list in memory at once. One such operation is sorting:
print . sort . wasteMem $ (2^16)
Also note that it's almost impossible to estimate the run-time memory usage of your list. If you want a more predictable memory benchmark, create an unboxed array instead of a list. This also doesn't require any complicated operation to ensure that everything stays in memory. Indexing a single element in an array already makes sure that the array is in memory at least once.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string