Haskell threads heap overflow despite only 22Mb total memory usage? - haskell

I am trying to parallelize a ray-tracer. This means I have a very long list of small computations. The vanilla program runs on a specific scene in 67.98 seconds and 13 MB of total memory use and 99.2% productivity.
In my first attempt I used the parallel strategy parBuffer with a buffer size of 50. I chose parBuffer because it walks through the list only as fast as sparks are consumed, and does not force the spine of the list like parList, which would use a lot of memory since the list is very long. With -N2, it ran in a time of 100.46 seconds and 14 MB of total memory use and 97.8% productivity. The spark information is: SPARKS: 480000 (476469 converted, 0 overflowed, 0 dud, 161 GC'd, 3370 fizzled)
The large proportion of fizzled sparks indicates that the granularity of sparks was too small, so next I tried using the strategy parListChunk, which splits the list into chunks and creates a spark for each chunk. I got the best results with a chunk size of 0.25 * imageWidth. The program ran in 93.43 seconds and 236 MB of total memory use and 97.3% productivity. The spark information is: SPARKS: 2400 (2400 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled). I believe the much greater memory use is because parListChunk forces the spine of the list.
Then I tried to write my own strategy that lazily divided the list into chunks and then passed the chunks to parBuffer and concatenated the results.
concat $ withStrategy (parBuffer 40 rdeepseq) (chunksOf 100 (map colorPixel pixels))
This ran in 95.99 seconds and 22MB of total memory use and 98.8% productivity. This was succesful in the sense that all the sparks are being converted and the memory usage is much lower, however the speed is not improved. Here is an image of part of the eventlog profile.
As you can see the threads are being stopped due to heap overflows. I tried adding +RTS -M1G which increases the default heap size all the way up to 1Gb. The results did not change. I read that Haskell main thread will use memory from the heap if its stack overflows, so I also tried increasing the default stack size too with +RTS -M1G -K1G but this also had no impact.
Is there anything else I can try? I can post more detailed profiling info for memory usage or eventlog if needed, I did not include it all because it is a lot of information and I did not think all of it was necessary to include.
EDIT: I was reading about the Haskell RTS multicore support, and it talks about there being a HEC (Haskell Execution Context) for each core. Each HEC contains, among other things, an Allocation Area (which is a part of a single shared heap). Whenever any HEC's Allocation Area is exhausted, a garbage collection must be performed. The appears to be a RTS option to control it, -A. I tried -A32M but saw no difference.
EDIT2:
Here is a link to a github repo dedicated to this question. I have included the profiling results in the profiling folder.
EDIT3: Here is the relevant bit of code:
render :: [([(Float,Float)],[(Float,Float)])] -> World -> [Color]
render grids world = cs where
ps = [ (i,j) | j <- reverse [0..wImgHt world - 1] , i <- [0..wImgWd world - 1] ]
cs = map (colorPixel world) (zip ps grids)
--cs = withStrategy (parListChunk (round (wImgWd world)) rdeepseq) (map (colorPixel world) (zip ps grids))
--cs = withStrategy (parBuffer 16 rdeepseq) (map (colorPixel world) (zip ps grids))
--cs = concat $ withStrategy (parBuffer 40 rdeepseq) (chunksOf 100 (map (colorPixel world) (zip ps grids)))
The grids are random floats that are precomputed and used by colorPixel.The type of colorPixel is:
colorPixel :: World -> ((Float,Float),([(Float,Float)],[(Float,Float)])) -> Color

Not the solution to your problem, but a hint to the cause:
Haskell seems to be very conservative in memory reuse and when the interpreter sees the potential to reclaim a memory block, it goes for it. Your problem description fits the minor GC behavior described here (bottom)
https://wiki.haskell.org/GHC/Memory_Management.
New data are allocated in 512kb "nursery". Once it's exhausted, "minor
GC" occurs - it scans the nursery and frees unused values.
So if you chop the data into smaller chunks, you enable the engine to do the cleanup earlier - GC kicks in.

Related

Making sense from GHC profiler

I'm trying to make sense from GHC profiler. There is a rather simple app, which uses werq and lens-aeson libraries, and while learning about GHC profiling, I decided to play with it a bit.
Using different options (time tool, +RTS -p -RTS and +RTS -p -h) I acquired entirely different numbers of my memory usage. Having all those numbers, I'm now completely lost trying to understand what is going on, and how much memory the app actually uses.
This situation reminds me the phrase by Arthur Bloch: "A man with a watch knows what time it is. A man with two watches is never sure."
Can you, please, suggest me, how I can read all those numbers, and what is the meaning of each of them.
Here are the numbers:
time -l reports around 19M
#/usr/bin/time -l ./simple-wreq
...
3.02 real 0.39 user 0.17 sys
19070976 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
21040 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
71 messages sent
71 messages received
2991 signals received
43 voluntary context switches
6490 involuntary context switches
Using +RTS -p -RTS flag reports around 92M. Although it says "total alloc" it seems strange to me, that a simple app like this one can allocate and release 91M
# ./simple-wreq +RTS -p -RTS
# cat simple-wreq.prof
Fri Oct 14 15:08 2016 Time and Allocation Profiling Report (Final)
simple-wreq +RTS -N -p -RTS
total time = 0.07 secs (69 ticks # 1000 us, 1 processor)
total alloc = 91,905,888 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
main.g Main 60.9 88.8
MAIN MAIN 24.6 2.5
decodeLenient/look Data.ByteString.Base64.Internal 5.8 2.6
decodeLenientWithTable/fill Data.ByteString.Base64.Internal 2.9 0.1
decodeLenientWithTable.\.\.fill Data.ByteString.Base64.Internal 1.4 0.0
decodeLenientWithTable.\.\.fill.\ Data.ByteString.Base64.Internal 1.4 0.1
decodeLenientWithTable.\.\.fill.\.\.\.\ Data.ByteString.Base64.Internal 1.4 3.3
decodeLenient Data.ByteString.Base64.Lazy 1.4 1.4
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 443 0 24.6 2.5 100.0 100.0
main Main 887 0 0.0 0.0 75.4 97.4
main.g Main 889 0 60.9 88.8 75.4 97.4
object_ Data.Aeson.Parser.Internal 925 0 0.0 0.0 0.0 0.2
jstring_ Data.Aeson.Parser.Internal 927 50 0.0 0.2 0.0 0.2
unstream/resize Data.Text.Internal.Fusion 923 600 0.0 0.3 0.0 0.3
decodeLenient Data.ByteString.Base64.Lazy 891 0 1.4 1.4 14.5 8.1
decodeLenient Data.ByteString.Base64 897 500 0.0 0.0 13.0 6.7
....
+RTS -p -h and hp2ps show me the following picture and two numbers: 114K in the header and something around 1.8Mb on the graph.
And, just in case, here is the app:
module Main where
import Network.Wreq
import Control.Lens
import Data.Aeson.Lens
import Control.Monad
main :: IO ()
main = replicateM_ 10 g
where
g = do
r <- get "http://httpbin.org/get"
print $ r ^. responseBody
. key "headers"
. key "User-Agent"
. _String
UPDATE 1: Thank everyone for incredible good responses. As was suggested, I add +RTS -s output, so the entire picture builds up for everyone who read it.
#./simple-wreq +RTS -s
...
128,875,432 bytes allocated in the heap
32,414,616 bytes copied during GC
2,394,888 bytes maximum residency (16 sample(s))
355,192 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 194 colls, 0 par 0.018s 0.022s 0.0001s 0.0022s
Gen 1 16 colls, 0 par 0.027s 0.031s 0.0019s 0.0042s
UPDATE 2: The size of the executable:
#du -h simple-wreq
63M simple-wreq
A man with a watch knows what time it is. A man with two watches is never sure.
Ah, but what do does two watches show? Are both meant to show the current time in UTC? Or is one of them supposed to show the time in UTC, and the other one the time on a certain point on Mars? As long as they are in sync, the second scenario wouldn't be a problem, right?
And that is exactly what is happening here. You compare different memory measurements:
the maximum residency
the total amount of allocated memory
The maximum residency is the highest amount of memory your program ever uses at a given time. That's 19MB. However, the total amount of allocated memory is a lot more, since that's how GHC works: it "allocates" memory for objects that are garbage collected, which is almost everything that's not unpacked.
Let us inspect a C example for this:
int main() {
int i;
char * mem;
for(i = 0; i < 5; ++i) {
mem = malloc(19 * 1000 * 1000);
free(mem);
}
return 0;
}
Whenever we use malloc, we will allocate 19 megabytes of memory. However, we free the memory immediately after. The highest amount of memory we ever have at one point is therefore 19 megabytes (and a little bit more for the stack and the program itself).
However, in total, we allocate 5 * 19M, 95M total. Still, we could run our little program with just 20 megs of RAM fine. That's the difference between total allocated memory and maximum residency. Note that the residency reported by time is always at least du <executable>, since that has to reside in memory too.
That being said, the easiest way to generate statistics is -s, which will show how what was the maximum residency from the Haskell's program point of view. In your case, it will be the 1.9M, the number in your heap profile (or double the amount due to profiling). And yeah, Haskell executables tend to get extremely large, since libraries are statically linked.
time -l is displaying the (resident, i.e. not swapped out) size of the process as seen by the operating system (obviously). This includes twice the maximum size of the Haskell heap (due to the way that GHC's GC works), plus anything else allocated by the RTS or other C libraries, plus the code of your executable itself plus the libraries it depends on, etc. I'm guessing in this case the primary contributor to the 19M is the size of your exectuable.
total alloc is the total amount allocated onto the Haskell heap. It is not at all a measure of maximum heap size (which is what people usually mean by "how much memory is my program using"). Allocation is very cheap and allocation rates of around 1GB/s are typical for a Haskell program.
The number in the header of the hp2ps output "114,272 bytes x seconds" is something completely different again: it is the integral of the graph, and is measured in bytes * seconds, not in bytes. For example if your program holds onto a 10 MB structure for 4 seconds then that will cause this number to increase by 40 MB*s.
The number around 1.8 MB shown in the graph is the actual maximum size of the Haskell heap, which is probably the number you're most interested in.
You've omitted the most useful source of numbers about your program's execution, which is running it with +RTS -s (this doesn't even require it to have been built with profiling).

yield loss with OpenMP in Fortran

First of all sorry if I make grammatical mistakes. I'm not english.
I'm trying to improve the yield loss that is occurring when increasing the number of threads using OpenMP in Fortran.
I'm using two Intel Xeon X5650 (12 physical cores) with 96 Gb of RAM
The best results that I've obtained are the following:
1 proc -> 15.50 sec; 2 proc -> 8.10 sec; 4 proc -> 4.42 sec; 8 proc -> 2.81 sec; 12 proc -> 2.43 sec
Like you see, the improvement decreases the more threads I run.
Here's the code:
allocate(SUM(PUNTOST,PUNTOSP,NUM_DATA,2))
allocate(SUMATORIO(1))
allocate(SUMATORIO(1)%REGION(REGIONS))
DO i=1,REGIONES
allocate(SUMATORIO(1)%REGION(i)%VALOR(2,PUNTOSP,PUNTOST))
SUMATORIO(1)%REGION(i)%VALOR= cmplx(0.0,0.0)
END DO
allocate(valor_aux(2,PUNTOSP,PUNTOST))
!...
call SYSTEM_CLOCK(counti,count_rate)
!$OMP PARALLEL NUM_THREADS(THREADS) DEFAULT(PRIVATE) FIRSTPRIVATE(REGIONS) &
!$OMP SHARED(SUMATORIO,SUM,PP,VEC_1,VEC_2,IDENT,TIPO,MUESTRA,PUNTOST,PUNTOSP)
!$OMP DO SCHEDULE(DYNAMIC,8)
DO i=1,REGIONS
INDICE=VEC_1(i)
valor_aux = cmplx(0.0,0.0)
DO j=1,VEC_2(i)
ii=IDENT(INDICE+1)
INDICE=INDICE+1
IF(TIPO(ii).ne.4) THEN
j1=MUESTRA(ii)
DO I1=1,PUNTOST
DO I2=1,PUNTOSP
valor_aux(1,I2,I1)=valor_aux(1,I2,I1)+SUM(I1,I2,J1,1)*PP(ii)
valor_aux(2,I2,I1)=valor_aux(2,I2,I1)+SUM(I1,I2,J1,2)*PP(ii)
END DO
END DO
END IF
END DO
SUMATORIO(1)%REGION(i)%VALOR= valor_aux
END DO
!$OMP END DO
!$OMP END PARALLEL
call SYSTEM_CLOCK(countf)
dt=REAL(countf-counti)/REAL(count_rate)
write(*,*)'FASE_1: Time: ',dt,'seconds'
Some points to know:
All data types are COMPLEX, except for loop vectors
NUM_DATA = 14000000
REGIONS = 1000000
Values contained in VEC_2 are between 10 and 20
PUNTOST = 21
PUNTOSP = 20
All allocated memory consume about 60 Gb of RAM
I've tried to change the dimensions of the matrixes to evade excesive memory caching (SUM(2,PUNTOSP,PUNTOST,NUM_DATA) for example) but this is the way in which I've obtained the best performance (I don't know the reason because in most of documents I've read they say that you have to try to make memory access be "sequential" to make the CPU brings the least amount of memory to cachee).
Also I've changed memory alignment to 32, 64 and 128 bytes but It didn't improve nothing.
Also I've changed the SCHEDULE option to STATIC with different chunk sizes and DYNAMIC with different chunk sizes but the results are the same or worse.
Do you have some ideas that I could use to improve the performance when using 8 or more cores?
Thank you so much for your attention and help.
Dividing by 6 the cpu time using 12 processors is rather good. In my applications I get rarely more than 4 or 5 (but it always remains sequential parts which is possibly the reason of that).
You could try the option collapse allowing to merge two loops together... But I don't know whether this is possible in your case because they are conditions to fulfill (for instance no instruction between the two loops).
While working on multidimensional arrays in Fortran, the leftmost index should change the fastest. You could try to change the order of the indices of valor_aux and SUM to
valor_aux(PUNTOSP, PUNTOST, 2)
SUM(PUNTOSP, PUNTOST, NUM_DATA, 2)
Additionally, you should always mind Amdahl's law. There is always some overhead, which yields additional speedup impossible.
Also: In your two innermost loops, the factor PP(ii) doesn't change. You should try to apply it after these loops (except, you know, that you are using FMA). And these loops are only a SUM of many values. You should try the intrinsic function SUM to remove these loops. Both things could require a massive redesign of your loops.

Profiling multithreading performance in a Haskell program — no speedups using parallel strategies

After attempting to add multithreading functionality in a Haskell program, I noticed that performance didn't improve at all. Chasing it down, I got the following data from threadscope:
Green indicates running, and orange is garbage collection.
Here vertical green bars indicate spark creation, blue bars are parallel GC requests, and light blue bars indicate thread creation.
The labels are: spark created, requesting parallel GC, creating thread n, and stealing spark from cap 2.
On average, I'm only getting about 25% activity over 4 cores, which is no improvement at all over the single-threaded program.
Of course, the question would be void without a description of the actual program. Essentially, I create a traversable data structure (e.g. a tree), and then fmap a function over it, before then feeding it into an image writing routine (explaining the unambiguously single-threaded segment at the end of the program run, past 15s). Both the construction and the fmapping of the function take a significant amount of time to run, although the second slightly more so.
The above graphs were made by adding a parTraversable strategy for that data structure before it is consumed by the image writing. I have also tried using toList on the data structure and then using various parallel list strategies (parList, parListChunk, parBuffer), but the results were similar each time for a wide range of parameters (even using large chunks).
I also tried to fully evaluate the traversable data structure before fmapping the function over it, but the exact same problem occurred.
Here are some additional statistics (for a different run of the same program):
5,702,829,756 bytes allocated in the heap
385,998,024 bytes copied during GC
55,819,120 bytes maximum residency (8 sample(s))
1,392,044 bytes maximum slop
133 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10379 colls, 10378 par 5.20s 1.40s 0.0001s 0.0327s
Gen 1 8 colls, 8 par 1.01s 0.25s 0.0319s 0.0509s
Parallel GC work balance: 1.24 (96361163 / 77659897, ideal 4)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 0.00s ( 15.92s) 0.02s ( 0.02s)
Task 1 (worker) : 0.27s ( 14.00s) 1.86s ( 1.94s)
Task 2 (bound) : 14.24s ( 14.30s) 1.61s ( 1.64s)
Task 3 (worker) : 0.00s ( 15.94s) 0.00s ( 0.00s)
Task 4 (worker) : 0.25s ( 14.00s) 1.66s ( 1.93s)
Task 5 (worker) : 0.27s ( 14.09s) 1.69s ( 1.84s)
SPARKS: 595854 (595854 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 15.67s ( 14.28s elapsed)
GC time 6.22s ( 1.66s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 21.89s ( 15.94s elapsed)
Alloc rate 363,769,460 bytes per MUT second
Productivity 71.6% of total user, 98.4% of total elapsed
I'm not sure what other useful information I can give to assist answering. Profiling doesn't reveal anything interesting: it's the same as the single core statistics, except with an added IDLE taking up 75% of the time, as expected from the above.
What's happening that's preventing useful parallelisation?
Sorry that I couldn't provide code in a timely manner to assist respondents. It took me a while to untangle the exact location of the issue.
The problem was as follows: I was fmapping a function
f :: a -> S b
over the traversable data structure
structure :: T a
where S and T are two traversable functors.
Then, when using parTraversable, I was mistakenly writing
Compose (fmap f structure) `using` parTraversable rdeepseq
instead of
Compose $ fmap f structure `using` parTraversable rdeepseq
so I was wrongly using the Traversable instance for Compose T S to do the multithreading (using Data.Functor.Compose).
(This looks like it should've been easy to catch, but it took me a while to extract the above mistake from the code!)
This now looks much better!

How do I stop Gloss requiring an increasing amount of memory?

I'm using gloss to create am RTS game in Haskell, but I've noticed that even a very simple program will occupy more and more memory as it runs. The following program, for example, will gradually increase its memory use (it will require ~0.025mb per second ).
module Main (
main
)
where
import Graphics.Gloss
import Graphics.Gloss.Interface.IO.Game
main =
playIO (InWindow "glossmem" (500, 500) (0,0)) white 10 0
(\world -> return (translate (-250) 0 (text $ show world)))
(\event -> (\world -> return world))
(\timePassed -> (\world -> return $ world + timePassed))
I've tried limiting the heap size at runtime but that just causes the program to crash when it hits the limit. I'm concerned this behaviour will become a performance issue when I have a more complex world, is there a way to use gloss such that this won't be an issue? Or am I using the wrong tool for the job?
Thanks, I fixed this in gloss-1.7.7.1. It was a typical laziness-induced space leak in the code that manages the frame timing for animations. Your example program now runs in constant space.

Tools for analyzing performance of a Haskell program

While solving some Project Euler Problems to learn Haskell (so currently I'm a completly beginner) I came over Problem 12. I wrote this (naive) solution:
--Get Number of Divisors of n
numDivs :: Integer -> Integer
numDivs n = toInteger $ length [ x | x<-[2.. ((n `quot` 2)+1)], n `rem` x == 0] + 2
--Generate a List of Triangular Values
triaList :: [Integer]
triaList = [foldr (+) 0 [1..n] | n <- [1..]]
--The same recursive
triaList2 = go 0 1
where go cs n = (cs+n):go (cs+n) (n+1)
--Finds the first triangular Value with more than n Divisors
sol :: Integer -> Integer
sol n = head $ filter (\x -> numDivs(x)>n) triaList2
This Solution for n=500 (sol 500) is extremely slow (running for more than 2 hours now), so I wondered how to find out why this solution is so slow. Are there any commands that tell me where most of the computation-time is spent so I know which part of my haskell-program is slow? Something like a simple profiler.
To make it clear, I'm not asking for a faster solution but for a way to find this solution. How would you start if you would have no haskell knowledge?
I tried to write two triaList functions but found no way to test which one is faster, so this is where my problems start.
Thanks
how to find out why this solution is so slow. Are there any commands that tell me where most of the computation-time is spend so I know which part of my haskell-program is slow?
Precisely! GHC provides many excellent tools, including:
runtime statistics
time profiling
heap profiling
thread analysis
core analysis.
comparative benchmarking
GC tuning
A tutorial on using time and space profiling is part of Real World Haskell.
GC Statistics
Firstly, ensure you're compiling with ghc -O2. And you might make sure it is a modern GHC (e.g. GHC 6.12.x)
The first thing we can do is check that garbage collection isn't the problem.
Run your program with +RTS -s
$ time ./A +RTS -s
./A +RTS -s
749700
9,961,432,992 bytes allocated in the heap
2,463,072 bytes copied during GC
29,200 bytes maximum residency (1 sample(s))
187,336 bytes maximum slop
**2 MB** total memory in use (0 MB lost due to fragmentation)
Generation 0: 19002 collections, 0 parallel, 0.11s, 0.15s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 13.15s ( 13.32s elapsed)
GC time 0.11s ( 0.15s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 13.26s ( 13.47s elapsed)
%GC time **0.8%** (1.1% elapsed)
Alloc rate 757,764,753 bytes per MUT second
Productivity 99.2% of total user, 97.6% of total elapsed
./A +RTS -s 13.26s user 0.05s system 98% cpu 13.479 total
Which already gives us a lot of information: you only have a 2M heap, and GC takes up 0.8% of time. So no need to worry that allocation is the problem.
Time Profiles
Getting a time profile for your program is straight forward: compile with -prof -auto-all
$ ghc -O2 --make A.hs -prof -auto-all
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A ...
And, for N=200:
$ time ./A +RTS -p
749700
./A +RTS -p 13.23s user 0.06s system 98% cpu 13.547 total
which creates a file, A.prof, containing:
Sun Jul 18 10:08 2010 Time and Allocation Profiling Report (Final)
A +RTS -p -RTS
total time = 13.18 secs (659 ticks # 20 ms)
total alloc = 4,904,116,696 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
numDivs Main 100.0 100.0
Indicating that all your time is spent in numDivs, and it is also the source of all your allocations.
Heap Profiles
You can also get a break down of those allocations, by running with +RTS -p -hy, which creates A.hp, which you can view by converting it to a postscript file (hp2ps -c A.hp), generating:
which tells us there's nothing wrong with your memory use: it is allocating in constant space.
So your problem is algorithmic complexity of numDivs:
toInteger $ length [ x | x<-[2.. ((n `quot` 2)+1)], n `rem` x == 0] + 2
Fix that, which is 100% of your running time, and everything else is easy.
Optimizations
This expression is a good candidate for the stream fusion optimization, so I'll rewrite it
to use Data.Vector, like so:
numDivs n = fromIntegral $
2 + (U.length $
U.filter (\x -> fromIntegral n `rem` x == 0) $
(U.enumFromN 2 ((fromIntegral n `div` 2) + 1) :: U.Vector Int))
Which should fuse into a single loop with no unnecessary heap allocations. That is, it will have better complexity (by constant factors) than the list version. You can use the ghc-core tool (for advanced users) to inspect the intermediate code after optimization.
Testing this, ghc -O2 --make Z.hs
$ time ./Z
749700
./Z 3.73s user 0.01s system 99% cpu 3.753 total
So it reduced running time for N=150 by 3.5x, without changing the algorithm itself.
Conclusion
Your problem is numDivs. It is 100% of your running time, and has terrible complexity. Think about numDivs, and how, for example, for each N you are generating [2 .. n div 2 + 1] N times.
Try memoizing that, since the values don't change.
To measure which of your functions is faster, consider using criterion, which will provide statistically robust information about sub-microsecond improvements in running time.
Addenda
Since numDivs is 100% of your running time, touching other parts of the program won't make much difference,
however, for pedagogical purposes, we can also rewrite those using stream fusion.
We can also rewrite trialList, and rely on fusion to turn it into the loop you write by hand in trialList2,
which is a "prefix scan" function (aka scanl):
triaList = U.scanl (+) 0 (U.enumFrom 1 top)
where
top = 10^6
Similarly for sol:
sol :: Int -> Int
sol n = U.head $ U.filter (\x -> numDivs x > n) triaList
With the same overall running time, but a bit cleaner code.
Dons' answer is great without being a spoiler by giving a direct solution to the problem.
Here I want to suggest a little tool that I wrote recently. It saves you the time to write SCC annotations by hand when you want a more detailed profile than the default ghc -prof -auto-all. Besides that it's colorful!
Here's an example with the code you gave(*), green is OK, red is slow:
All the time goes in creating the list of divisors. This suggests a few things you can do:
1. Make the filtering n rem x == 0 faster, but since it's a built-in function probably it's already fast.
2. Create a shorter list. You've already done something in that direction by checking only up to n quot 2.
3. Throw away the list generation completely and use some math to get a faster solution. This is the usual way for project Euler problems.
(*) I got this by putting your code in a file called eu13.hs, adding a main function main = print $ sol 90. Then running visual-prof -px eu13.hs eu13 and the result is in eu13.hs.html.
Haskell related note: triaList2 is of course faster than triaList because the latter performs a lot of unnecessary computations. It will take quadratic time to compute n first elements of triaList, but linear for triaList2. There is another elegant (and efficient) way to define an infinite lazy list of triangle numbers:
triaList = 1 : zipWith (+) triaList [2..]
Math related note: there is no need to check all divisors up to n / 2, it's enough to check up to sqrt(n).
You can run your program with flags to enable time profiling. Something like this:
./program +RTS -P -sprogram.stats -RTS
That should run the program and produce a file called program.stats which will have how much time was spent in each function. You can find more information about profiling with GHC in the GHC user guide. For benchmarking, there is the Criterion library. I've found this blog post has a useful introduction.

Resources