I'm trying to learn/evaluate Haskell and I'm struggling with getting efficient executable for a simple case.
The test I'm using is a PRNG sequence (replicating PCG32 RNG). I've written it as an iteration of a basic state transition function (I'm looking only at the state for now).
{-# LANGUAGE BangPatterns #-}
import System.Environment (getArgs)
import Data.Bits
import Data.Word
iterate' f !x = x : iterate' f (f x)
main = print $ pcg32_rng 100000000
pcg32_random_r :: Word64 -> Word64 -> Word64
pcg32_random_r !i !state = state * (6364136223846793005 :: Word64) + (i .|. 1)
{-# INLINE pcg32_random_r #-}
pcg32_rng_s = iterate' (pcg32_random_r 1) 0
pcg32_rng n = pcg32_rng_s !! (n - 1)
I can get that code to compile and run. It still uses a lot more memory than it should and runs 10x slower than the C equivalent. The main issue seems to be that the iteration is not turned into a simple loop.
What am I missing to get GHC to produce faster / more efficient code here?
EDIT
This is the C version I compare against which captures in essence what I'm trying to achieve. I try for a fair comparison but let me know if I missed something.
#include <stdio.h>
#include <stdint.h>
int main() {
uint64_t oldstate,state;
int i;
for(i=0;i<100000000;i++) {
oldstate = state;
// Advance internal state
state = oldstate * 6364136223846793005ULL + (1|1);
}
printf("%ld\n",state);
}
I tried initally with the Prelude iterate function but this result in lazy evaluation and a stack overflow. The ìterate'`is aimed at fixing that issue.
My next step was to try to get GHC to inline pcg32_random_rand that's where I added the strictness to it but that doesn't seem to be enough. When I look at the GHC core, it is not inlined.
#WillemVanOnsem I confirm with performthe result is on par with C and actually the pcg32_random_rfunction was inlined. I'm reaching the limit of my grasp of Haskell and GHC at this stage. Can you elaborate on why perform performs better and how to decide when to use what?
Would this transformation be feasible automatically by the compiler or is it something that requires a design decision?
The reason to ask the last question is that I would like as much to separate functionality and implementation choice (speed / space tradeoffs, ...) to maximize reuse and I was hoping Haskell to help me there.
It looks to me that the issue is more that you produce a list, and obtain the i-th element from that list. As a result you are going to unfold that list function, and each time you construct a new element if you need to move further in the list.
Instead of constructing such list (which will construct new nodes, and perform memory allocations, and consume a lot of memory). You can construct a function that will perform a given function n times:
perform_n :: (a -> a) -> Int -> a -> a
perform_n !f = step
where step !n !x | n <= 0 = x
| otherwise = step (n-1) (f x)
So now we can perform a function f n times. We can thus rewrite it like:
pcg32_rng n = perform_n (pcg32_random_r 1) (n-1) 0
If I compile this with ghc -O2 file.hs (GHC 8.0.2) run this file with time, I get:
$ time ./file
2264354473547460187
0.14user 0.00system 0:00.14elapsed 99%CPU (0avgtext+0avgdata 3408maxresident)k
0inputs+0outputs (0major+161minor)pagefaults 0swaps
the original file produces the following benchmarks:
$ time ./file2
2264354473547460187
0.54user 0.00system 0:00.55elapsed 99%CPU (0avgtext+0avgdata 3912maxresident)k
0inputs+0outputs (0major+287minor)pagefaults 0swaps
EDIT:
As #WillNess says, if you do not name the list, at runtime the list will be garbage collected: if you process through a list, and do not keep a reference to the head of the list, then that head can be removed once we step over it.
If we however construct a file like:
{-# LANGUAGE BangPatterns #-}
import System.Environment (getArgs)
import Data.Bits
import Data.Word
iterate' f !x = x : iterate' f (f x)
main = print $ pcg32_rng 100000000
pcg32_random_r :: Word64 -> Word64 -> Word64
pcg32_random_r !i !state = state * (6364136223846793005 :: Word64) + (i .|. 1)
{-# INLINE pcg32_random_r #-}
pcg32_rng n = iterate' (pcg32_random_r 1) 0 !! (n - 1)
we obtain:
$ time ./speedtest3
2264354473547460187
0.54user 0.01system 0:00.56elapsed 99%CPU (0avgtext+0avgdata 3908maxresident)k
0inputs+0outputs (0major+291minor)pagefaults 0swaps
although the memory burden can be reduced, there is little impact on time. The reason is probably that working with list elements creates cons objects. So we do a lot of packing and unpacking into lists. This also results in constructing a lot of objects (and memory allocations) which still produces overhead.
Related
I'm looking to try to run a moderately expensive function on a large list of inputs, using part of the output of that function as one of its inputs. The code runs as expected, unfortunately it consumes a large amount of memory in the process (just under 22GiB on the heap, just over 1GiB maximum residency). Here is a simplified example of what I mean:
{-# LANGUAGE OverloadedStrings #-}
import Data.List (foldl')
import qualified Data.Text as T
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.IO as TL
import qualified Data.Text.Lazy.Builder as TB
main :: IO ()
main = TL.putStr $ TB.toLazyText showInts
showInts :: TB.Builder
showInts = foldMap fst shownLines
where
shownLines = map (showInt maxwidth) [0..10^7]
maxwidth = foldl' (\n -> max n . snd) 0 shownLines
showInt :: Int -> Int -> (TB.Builder, Int)
showInt maxwidth n = (builder, len)
where
builder = TB.fromText "This number: "
<> TB.fromText (T.replicate (maxwidth - len) " ") <> thisText
<> TB.singleton '\n'
(thisText, len) = expensiveShow n
expensiveShow :: Int -> (TB.Builder, Int)
expensiveShow n = (TB.fromText text, T.length text)
where text = T.pack (show n)
Note that in the where clause of showInts, showInt takes maxwidth as an argument, where maxwidth itself depends on the output of running showInt maxwidth on the whole list.
If, on the other hand, I do the naïve thing and replace the definition of maxwidth with foldl' max 0 $ map (snd . expensiveShow) [0..10^7], then maximum residency falls to just 44KiB. I would hope that performance like this would be achievable without workarounds like precomputing expensiveShow and then zipping it with the list [0..10^7].
I tried consuming the list strictly (using the foldl package), but this did not improve the situation.
I'm trying to have my cake and eat it too: exploiting laziness, while also making things strict enough that we don't build up a mountain of thunks. Is this possible to do? Or is there a better technique for accomplishing this?
You can't do it like this.
The problem is that your showInts has to traverse the list twice, first to find the longest number, second to print the numbers with the necessary format. That means the list has to be held in memory between the first and second passes. This isn't a problem with unevaluated thunks; it is simply that the whole list, completely evaluated, is being traversed twice.
The only solution is to generate the same list twice. In this case it is trivial; just have two [0..10^7] values, one for the maximum length and the second to format them. I suspect in your real application you are reading them from a file or something, in which case you need to read the file twice.
I have a small program that have reasonable maximum residency but allocates linearly. At first, I thought that should be cons cells or I#, but running the program with -p -hc shows heap overwhelmed by PINNED. Does anyone understand the reason and/or can suggest an improvement?
The program
-- task27.hs
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad
import Control.Monad.ST
import Control.Exception
import System.Random
import Data.Functor
import qualified Data.Vector.Generic.Mutable as V
import qualified Data.Vector.Unboxed as U
m = 120
task27 :: [Int] -> (Int, Int)
task27 l = runST $ do
r <- V.replicate m 0 :: ST s (U.MVector s Int)
let go [] = return (1,2)
go (a:as) = do
let p = a `mod` m
cur_lead <- r `V.read` p
when (a > cur_lead) (V.write r p a)
go as
go l
randTest ::
Int -> -- Length of random testing sequence
IO ()
randTest n =
newStdGen <&>
randoms <&>
take n <&>
task27 >>=
print
main = randTest 1000000
My package.yaml:
name: task27
dependencies:
- base == 4.*
- vector
- random
executables:
task27:
main: task27.hs
ghc-options: -O2
My cabal.project.local:
profiling: True
I do cabal -v0 run task27 -- +RTS -p -hc && hp2ps -e8in -c task27.hp and get this:
I tried to add bangs here and there but that did not seem to help.
As #WillemVanOnsem says, in GHC terms, 35kB resident is miniscule. Whatever performance issue you have, it's got nothing to do with this tiny bit of pinned memory. Originally, I said that this was probably the Vectors, but that's wrong. Data.Text uses pinned memory, but Data.Vector doesn't. This bit of PINNED memory looks like it's actually from the runtime system itself, so you can ignore it (see below).
In GHC code, "total allocation" is a measure of processing. A GHC program is an allocation engine. If it's not allocating, it's probably not doing anything (with rare exceptions). So, if you expect your algorithm to run in O(n) time, then it will also be O(n) in total allocation, usually gigabytes worth.
With respect to the "rare exceptions", a GHC program can run in constant "total allocation" but non-constant time if aggressive optimization allows computations using fully unboxed values. So, for example:
main = print (sum [1..10000000] :: Int)
runs in constant total allocation (e.g., 50kB allocated on the heap), because the Ints can be unboxed. For comparison,
main = print (sum [1..10000000] :: Integer)
runs with O(n) total allocation (e.g., 320MB allocated on the heap). By the way, try profiling this last program (and bump the count up until it runs long enough to generate a few seconds of profile data). You'll see that it uses the same amount of PINNED memory as your program, and the amount doesn't really change with the upper limit. So, this is just runtime system overhead.
Back to your example... If you are concerned about performance, the culprit is probably System.Random. This is an EXTREMELY slow random number generator. If I run your program with n = 10000000, it takes 4secs. If I replace the random number generator with a simple LCG:
randoms :: Word32 -> [Word32]
randoms seed = tail $ iterate lcg seed
where lcg x = (a * x + c)
a = 1664525
c = 1013904223
it runs in 0.2secs, so 20 times faster.
In an ongoing endeavour to efficiently fiddle with bits (e.g. see this SO question) the newest challenge is the efficient streaming and consumption of bits.
As a first simple task I choose to find the longest sequence of identical bits in a bitstream generated by /dev/urandom. A typical incantation would be head -c 1000000 </dev/urandom | my-exe. The actual goal is to stream bits and decode an Elias gamma code, for example, i.e. codes that are not chunks of bytes or multiples thereof.
For such codes of variable length it is nice to have the take, takeWhile, group, etc. language for list manipulation. Since a BitStream.take would actually consume part of the bistream some monad would probably come into play.
The obvious starting point is the lazy bytestring from Data.ByteString.Lazy.
A. Counting bytes
This very simple Haskell program performs on par with a C program, as is to be expected.
import qualified Data.ByteString.Lazy as BSL
main :: IO ()
main = do
bs <- BSL.getContents
print $ BSL.length bs
B. Adding bytes
Once I start using unpack things should get worse.
main = do
bs <- BSL.getContents
print $ sum $ BSL.unpack bs
Suprisingly, Haskell and C show the almost same performance.
C. Longest sequence of identical bits
As a first nontrivial task the longest sequence of identical bits can be found like this:
module Main where
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import Data.List (group)
import Data.Word8 (Word8)
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
bitStream :: BSL.ByteString -> [Bool]
bitStream bs = concat $ map splitByte (BSL.unpack bs)
main :: IO ()
main = do
bs <- BSL.getContents
print $ maximum $ length <$> (group $ bitStream bs)
The lazy bytestring is converted to a list [Word8] and then, using shifts, each Word is split into the bits, resulting in a list [Bool]. This list of lists is then flattened with concat. Having obtained a (lazy) list of Bool, use group to split the list into sequences of identical bits and then map length over it. Finally maximum gives the desired result. Quite simple, but not very fast:
# C
real 0m0.606s
# Haskell
real 0m6.062s
This naive implementation is exactly one order of magnitude slower.
Profiling shows that quite a lot of memory gets allocated (about 3GB for parsing 1MB of input). There is no massive space leak to be observed, though.
From here I start poking around:
There is a bitstream package that promises "Fast, packed, strict bit streams (i.e. list of Bools) with semi-automatic stream fusion.". Unfortunately it is not up-to-date with the current vector package, see here for details.
Next, I investigate streaming. I don't quite see why I should need 'effectful' streaming that brings some monad into play - at least until I start with the reverse of the posed task, i.e. encoding and writing bitstreams to file.
How about just fold-ing over the ByteString? I'd have to introduce state to keep track of consumed bits. That's not quite the nice take, takeWhile, group, etc. language that is desirable.
And now I'm not quite sure where to go.
Update:
I figured out how to do this with streaming and streaming-bytestring. I'm probably not doing this right because the result is catastrophically bad.
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Streaming as BSS
import Data.Word8 (Word8)
import qualified Streaming as S
import Streaming.Prelude (Of, Stream)
import qualified Streaming.Prelude as S
splitByte :: Word8 -> [Bool]
splitByte w = (\i-> (w `shiftR` i) .&. 1 == 1) <$> [0..7]
bitStream :: Monad m => Stream (Of Word8) m () -> Stream (Of Bool) m ()
bitStream s = S.concat $ S.map splitByte s
main :: IO ()
main = do
let bs = BSS.unpack BSS.getContents :: Stream (Of Word8) IO ()
gs = S.group $ bitStream bs :: Stream (Stream (Of Bool) IO) IO ()
maxLen <- S.maximum $ S.mapped S.length gs
print $ S.fst' maxLen
This will test your patience with anything beyond a few thousand bytes of input from stdin. The profiler says it spends an insane amount of time (quadratic in the input size) in Streaming.Internal.>>=.loop and Data.Functor.Of.fmap. I'm not quite sure what the first one is, but the fmap indicates (?) that the juggling of these Of a b isn't doing us any good and because we're in the IO monad it can't be optimised away.
I also have the streaming equivalent of the byte adder here: SumBytesStream.hs, which is slightly slower than the simple lazy ByteString implementation, but still decent. Since streaming-bytestring is proclaimed to be "bytestring io done right" I expected better. I'm probably not doing it right, then.
In any case, all these bit-computations shouldn't be happening in the IO monad. But BSS.getContents forces me into the IO monad because getContents :: MonadIO m => ByteString m () and there's no way out.
Update 2
Following the advice of #dfeuer I used the streaming package at master#HEAD. Here's the result.
longest-seq-c 0m0.747s (C)
longest-seq 0m8.190s (Haskell ByteString)
longest-seq-stream 0m13.946s (Haskell streaming-bytestring)
The O(n^2) problem of Streaming.concat is solved, but we're still not getting closer to the C benchmark.
Update 3
Cirdec's solution produces a performance on par with C. The construct that is used is called "Church encoded lists", see this SO answer or the Haskell Wiki on rank-N types.
Source files:
All the source files can be found on github. The Makefile has all the various targets to run the experiments and the profiling. The default make will just build everything (create a bin/ directory first!) and then make time will do the timing on the longest-seq executables. The C executables get a -c appended to distinguish them.
Intermediate allocations and their corresponding overhead can be removed when operations on streams fuse together. The GHC prelude provides foldr/build fusion for lazy streams in the form of rewrite rules. The general idea is if one function produces a result that looks like a foldr (it has the type (a -> b -> b) -> b -> b applied to (:) and []) and another function consumes a list that looks like a foldr, constructing the intermediate list can be removed.
For your problem I'm going to build something similar, but using strict left folds (foldl') instead of foldr. Instead of using rewrite rules that try to detect when something looks like a foldl, I'll use a data type that forces lists to look like left folds.
-- A list encoded as a strict left fold.
newtype ListS a = ListS {build :: forall b. (b -> a -> b) -> b -> b}
Since I've started by abandoning lists we'll be re-implementing part of the prelude for lists.
Strict left folds can be created from the foldl' functions of both lists and bytestrings.
{-# INLINE fromList #-}
fromList :: [a] -> ListS a
fromList l = ListS (\c z -> foldl' c z l)
{-# INLINE fromBS #-}
fromBS :: BSL.ByteString -> ListS Word8
fromBS l = ListS (\c z -> BSL.foldl' c z l)
The simplest example of using one is to find the length of a list.
{-# INLINE length' #-}
length' :: ListS a -> Int
length' l = build l (\z a -> z+1) 0
We can also map and concatenate left folds.
{-# INLINE map' #-}
-- fmap renamed so it can be inlined
map' f l = ListS (\c z -> build l (\z a -> c z (f a)) z)
{-# INLINE concat' #-}
concat' :: ListS (ListS a) -> ListS a
concat' ll = ListS (\c z -> build ll (\z l -> build l c z) z)
For your problem we need to be able to split a word into bits.
{-# INLINE splitByte #-}
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
{-# INLINE splitByte' #-}
splitByte' :: Word8 -> ListS Bool
splitByte' = fromList . splitByte
And a ByteString into bits
{-# INLINE bitStream' #-}
bitStream' :: BSL.ByteString -> ListS Bool
bitStream' = concat' . map' splitByte' . fromBS
To find the longest run we'll keep track of the previous value, the length of the current run, and the length of the longest run. We make the fields strict so that the strictness of the fold prevents chains of thunks from being accumulated in memory. Making a strict data type for a state is an easy way to get control over both its memory representation and when its fields are evaluated.
data LongestRun = LongestRun !Bool !Int !Int
{-# INLINE extendRun #-}
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
where
current = if x == previous then run + 1 else 1
{-# INLINE longestRun #-}
longestRun :: ListS Bool -> Int
longestRun l = longest
where
(LongestRun _ _ longest) = build l extendRun (LongestRun False 0 0)
And we're done
main :: IO ()
main = do
bs <- BSL.getContents
print $ longestRun $ bitStream' bs
This is much faster, but not quite the performance of c.
longest-seq-c 0m00.12s (C)
longest-seq 0m08.65s (Haskell ByteString)
longest-seq-fuse 0m00.81s (Haskell ByteString fused)
The program allocates about 1 Mb to read 1000000 bytes from input.
total alloc = 1,173,104 bytes (excludes profiling overheads)
Updated github code
I found another solution that is on par with C. The Data.Vector.Fusion.Stream.Monadic has a stream implementation based on this Coutts, Leshchinskiy, Stewart 2007 paper. The idea behind it is to use a destroy/unfoldr stream fusion.
Recall that list's unfoldr :: (b -> Maybe (a, b)) -> b -> [a] creates a list by repeatedly applying (unfolding) a step-forward function, starting with an initial value. A Stream is just an unfoldr function with starting state. (The Data.Vector.Fusion.Stream.Monadic library uses GADTs to create constructors for Step that can be pattern-matched conveniently. It could just as well be done without GADTs, I think.)
The central piece of the solution is the mkBitstream :: BSL.ByteString -> Stream Bool function that turns a BytesString into a stream of Bool. Basically, we keep track of the current ByteString, the current byte, and how much of the current byte is still unconsumed. Whenever a byte is used up another byte is chopped off ByteString. When Nothing is left, the stream is Done.
The longestRun function is taken straight from #Cirdec's solution.
Here's the etude:
{-# LANGUAGE CPP #-}
#define PHASE_FUSED [1]
#define PHASE_INNER [0]
#define INLINE_FUSED INLINE PHASE_FUSED
#define INLINE_INNER INLINE PHASE_INNER
module Main where
import Control.Monad.Identity (Identity)
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import Data.Functor.Identity (runIdentity)
import qualified Data.Vector.Fusion.Stream.Monadic as S
import Data.Word8 (Word8)
type Stream a = S.Stream Identity a -- no need for any monad, really
data Step = Step BSL.ByteString !Word8 !Word8 -- could use tuples, but this is faster
mkBitstream :: BSL.ByteString -> Stream Bool
mkBitstream bs' = S.Stream step (Step bs' 0 0) where
{-# INLINE_INNER step #-}
step (Step bs w n) | n==0 = case (BSL.uncons bs) of
Nothing -> return S.Done
Just (w', bs') -> return $
S.Yield (w' .&. 1 == 1) (Step bs' (w' `shiftR` 1) 7)
| otherwise = return $
S.Yield (w .&. 1 == 1) (Step bs (w `shiftR` 1) (n-1))
data LongestRun = LongestRun !Bool !Int !Int
{-# INLINE extendRun #-}
extendRun :: LongestRun -> Bool -> LongestRun
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
where current = if x == previous then run + 1 else 1
{-# INLINE longestRun #-}
longestRun :: Stream Bool -> Int
longestRun s = runIdentity $ do
(LongestRun _ _ longest) <- S.foldl' extendRun (LongestRun False 0 0) s
return longest
main :: IO ()
main = do
bs <- BSL.getContents
print $ longestRun (mkBitstream bs)
I have a problem with Haskell module optimization.
There is Main module.
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.DeepSeq
import Formatting
import Formatting.Clock
import System.Clock
import Data.Array
size :: Int
size = 200 :: Int
stdMult :: (Ix a, Ix b, Ix c, Num d) =>
Array (a,b) d -> Array (b,c) d -> Array (a,c) d
stdMult x y = array resultBounds
[((i,j), sum [ x!(i,k) * y!(k,j) | k <- range (lj,uj)])
| i <- range (li,ui),
j <- range (lj',uj') ]
where ((li,lj),(ui,uj)) = bounds x
((li',lj'),(ui',uj')) = bounds y
resultBounds
| (lj,uj)==(li',ui') = ((li,lj'),(ui,uj'))
| otherwise = error "error"
main :: IO ()
main = do
let a = array ((1,1),(size, size)) [((i,j), 2*i-j) |
i <- range (1,size),
j <- range (1,size)]
let b = array ((1,1),(size, size)) [((i,j), 2*i+3*j) |
i <- range (1,size)`,
j <- range (1,size)]
start <- getTime ProcessCPUTime
let
c = stdMult a b
end <- c `deepseq` getTime ProcessCPUTime
fprint (timeSpecs % "\n") start end
return()
When stdMult in Main module, everything works ok. I replace stdMult to another module.
When I don't use ghc optimization, execution time is the same.
When I use ghc options -O3, when stdMult in Main module time execution decreases, but when stdMult in another module, execution time is almost unchanged!
For example, when stdMult in Main I have time 3 seconds, and when stdMult not in Main I have time 30 seconds, for matrix 500x500.
It is very strange!
(You need the clock and formatting packages from Hackage to compile the code.)
I can reproduce the 10x slowdown when stdMult is in a different module. Luckily a fix is easy: in the module where stdMult is defined, add an INLINABLE pragma:
{-# INLINABLE stdMult #-}
It adds the definition to the interface file (.hi) which allows inlining in the modules that uses it, which in turn allows it to be specialized to fast machine Int instead of slow abstract Ix and Num polymorphic code. (If it's in the same module GHC can inline and specialize at will, and things aren't INLINABLE by default because it can cause executable code bloat and slower compilation.)
Alternatively to INLINABLE, you can manually SPECIALIZE to the types you want optimized implementations for. This is a bit more verbose, but should be faster to compile in big projects (it will be specialized once per export, instead of once per import, at a rough guess).
{-# SPECIALIZE stdMult :: Array (Int, Int) Int -> Array (Int, Int) Int -> Array (Int, Int) Int #-}
I am trying to write a solution for one of the Hackerrank problems. The challenge is to count elements in a list, the elements vary from 0 to 99, so it is possible to count them in linear time. Here is what I got:
{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -O3 #-}
module Main where
import Data.STRef
import Data.Foldable
import Control.Monad
import Control.Monad.ST
main = do
line1 <- getLine
line2 <- getLine
let
!ns = map read $ words line2 :: [Int]
res = runST $ do
refs <- forM [0..99] $ \i ->
newSTRef (0 :: Int)
traverse_ (\x -> modifySTRef' (refs !! x) (+1) ) ns
mapM (\ref -> readSTRef ref) refs
putStrLn . unwords . map show $ res
This code works but not fast enough to pass the last test case. Can someone recommend an improvement to it? (link to the problem)
This can be done as a one-liner using accumArray from Data.Array. Something like accumArray (+) 0 (0,99) . zip values $ repeat 1 where values is the input.
It appears to still not be fast enough, which is somewhat vexing. accumArray is more or less as efficient as possible for what it does. Testing on my system reveals the time for processing 1,000,000 input values to be about 1 second, even without compiling it, and that time is dominated by generating the random inputs. That's a far cry from the 5 seconds on the test site.. I have to wonder how overloaded that system is.
One problem you have is that you're looking up your STRefs in a list which means that you'll have to traverse O(n) steps for every lookup and modification. This can be alleviated by using something like Data.Map.Map which has O(log(n)) lookup and modification time.
You could also use a mutable Array or Vector for O(1) lookup/modification time in the ST monad. This is probably the fastest method.