varnish round robin director not picking backends - varnish
I have varnish setup with 2 backend servers with a round-robin director.
The 2 backends are showing up in varnishstat and varnishadm as healthy.
varnishadm output:
Backend name Admin Probe
boot.app1 probe Healthy 5/5
boot.app2 probe Healthy 5/5
VCL Configuration:
probe ping {
.interval = 5s;
.timeout = 1s;
.threshold = 3;
.window = 5;
.url = "/ping";
}
backend app1 {
.host = "app-1.example.com";
.port = "80";
.probe = ping;
}
backend app2 {
.host = "app-2.example.com";
.port = "80";
.probe = ping;
}
new application_servers = directors.round_robin();
application_servers.add_backend(app1);
application_servers.add_backend(app2);
set req.backend_hint = application_servers;
varnishstat output:
VBE.boot.app1.happy ffffffffff VVVVVVVVVVVVVVVVVVVVVVVV
VBE.boot.app1.bereq_hdrbytes 66.17K 0.00 91.00 0.00 0.00 0.00
VBE.boot.app1.beresp_hdrbytes 76.72K 0.00 106.00 0.00 0.00 0.00
VBE.boot.app1.beresp_bodybytes 11.91M 0.00 16.50K 0.00 0.00 0.00
VBE.boot.app1.conn 251 0.00 . 251.00 251.00 251.00
VBE.boot.app1.req 251 0.00 . 0.00 0.00 0.00
VBE.boot.app2.happy ffffffffff VVVVVVVVVVVVVVVVVVVVVVVV
You can see from the varnishstat command that traffic appears to only be sent to the first server in the round-robin configuration. There's no other lines for the app2 server other than .happy
Any thoughts on what would be causing the director to pick the first server every time?
Varnishstat -1 Output
MAIN.uptime 218639 1.00 Child process uptime
MAIN.sess_conn 5253150 24.03 Sessions accepted
MAIN.sess_drop 0 0.00 Sessions dropped
MAIN.sess_fail 0 0.00 Session accept failures
MAIN.client_req_400 0 0.00 Client requests received, subject to 400 errors
MAIN.client_req_417 0 0.00 Client requests received, subject to 417 errors
MAIN.client_req 1174495 5.37 Good client requests received
MAIN.cache_hit 61 0.00 Cache hits
MAIN.cache_hitpass 395 0.00 Cache hits for pass
MAIN.cache_miss 1927 0.01 Cache misses
MAIN.backend_conn 0 0.00 Backend conn. success
MAIN.backend_unhealthy 0 0.00 Backend conn. not attempted
MAIN.backend_busy 0 0.00 Backend conn. too many
MAIN.backend_fail 0 0.00 Backend conn. failures
MAIN.backend_reuse 7720 0.04 Backend conn. reuses
MAIN.backend_recycle 8926 0.04 Backend conn. recycles
MAIN.backend_retry 0 0.00 Backend conn. retry
MAIN.fetch_head 0 0.00 Fetch no body (HEAD)
MAIN.fetch_length 1350 0.01 Fetch with Length
MAIN.fetch_chunked 7572 0.03 Fetch chunked
MAIN.fetch_eof 3 0.00 Fetch EOF
MAIN.fetch_bad 0 0.00 Fetch bad T-E
MAIN.fetch_none 0 0.00 Fetch no body
MAIN.fetch_1xx 0 0.00 Fetch no body (1xx)
MAIN.fetch_204 0 0.00 Fetch no body (204)
MAIN.fetch_304 27 0.00 Fetch no body (304)
MAIN.fetch_failed 0 0.00 Fetch failed (all causes)
MAIN.fetch_no_thread 0 0.00 Fetch failed (no thread)
MAIN.pools 2 . Number of thread pools
MAIN.threads 20 . Total number of threads
MAIN.threads_limited 0 0.00 Threads hit max
MAIN.threads_created 1377 0.01 Threads created
MAIN.threads_destroyed 1357 0.01 Threads destroyed
MAIN.threads_failed 0 0.00 Thread creation failed
MAIN.thread_queue_len 0 . Length of session queue
MAIN.busy_sleep 3 0.00 Number of requests sent to sleep on busy objhdr
MAIN.busy_wakeup 3 0.00 Number of requests woken after sleep on busy objhdr
MAIN.busy_killed 0 0.00 Number of requests killed after sleep on busy objhdr
MAIN.sess_queued 1728 0.01 Sessions queued for thread
MAIN.sess_dropped 0 0.00 Sessions dropped for thread
MAIN.n_object 135 . object structs made
MAIN.n_vampireobject 0 . unresurrected objects
MAIN.n_objectcore 141 . objectcore structs made
MAIN.n_objecthead 146 . objecthead structs made
MAIN.n_waitinglist 17 . waitinglist structs made
MAIN.n_backend 6 . Number of backends
MAIN.n_expired 840 . Number of expired objects
MAIN.n_lru_nuked 0 . Number of LRU nuked objects
MAIN.n_lru_moved 52 . Number of LRU moved objects
MAIN.losthdr 0 0.00 HTTP header overflows
MAIN.s_sess 5253150 24.03 Total sessions seen
MAIN.s_req 1174495 5.37 Total requests seen
MAIN.s_pipe 0 0.00 Total pipe sessions seen
MAIN.s_pass 7025 0.03 Total pass-ed requests seen
MAIN.s_fetch 8952 0.04 Total backend fetches initiated
MAIN.s_synth 1165482 5.33 Total synthethic responses made
MAIN.s_req_hdrbytes 58007743 265.31 Request header bytes
MAIN.s_req_bodybytes 8324 0.04 Request body bytes
MAIN.s_resp_hdrbytes 250174363 1144.23 Response header bytes
MAIN.s_resp_bodybytes 658785662 3013.12 Response body bytes
MAIN.s_pipe_hdrbytes 0 0.00 Pipe request header bytes
MAIN.s_pipe_in 0 0.00 Piped bytes from client
MAIN.s_pipe_out 0 0.00 Piped bytes to client
MAIN.sess_closed 1170177 5.35 Session Closed
MAIN.sess_closed_err 5244623 23.99 Session Closed with error
MAIN.sess_readahead 0 0.00 Session Read Ahead
MAIN.sess_herd 3208 0.01 Session herd
MAIN.sc_rem_close 3518 0.02 Session OK REM_CLOSE
MAIN.sc_req_close 0 0.00 Session OK REQ_CLOSE
MAIN.sc_req_http10 1165458 5.33 Session Err REQ_HTTP10
MAIN.sc_rx_bad 0 0.00 Session Err RX_BAD
MAIN.sc_rx_body 0 0.00 Session Err RX_BODY
MAIN.sc_rx_junk 4079015 18.66 Session Err RX_JUNK
MAIN.sc_rx_overflow 0 0.00 Session Err RX_OVERFLOW
MAIN.sc_rx_timeout 276 0.00 Session Err RX_TIMEOUT
MAIN.sc_tx_pipe 0 0.00 Session OK TX_PIPE
MAIN.sc_tx_error 0 0.00 Session Err TX_ERROR
MAIN.sc_tx_eof 0 0.00 Session OK TX_EOF
MAIN.sc_resp_close 4688 0.02 Session OK RESP_CLOSE
MAIN.sc_overload 0 0.00 Session Err OVERLOAD
MAIN.sc_pipe_overflow 0 0.00 Session Err PIPE_OVERFLOW
MAIN.sc_range_short 0 0.00 Session Err RANGE_SHORT
MAIN.shm_records 92391706 422.58 SHM records
MAIN.shm_writes 24787122 113.37 SHM writes
MAIN.shm_flushes 4278 0.02 SHM flushes due to overflow
MAIN.shm_cont 72956 0.33 SHM MTX contention
MAIN.shm_cycles 30 0.00 SHM cycles through buffer
MAIN.backend_req 8952 0.04 Backend requests made
MAIN.n_vcl 3 0.00 Number of loaded VCLs in total
MAIN.n_vcl_avail 3 0.00 Number of VCLs available
MAIN.n_vcl_discard 0 0.00 Number of discarded VCLs
MAIN.bans 1 . Count of bans
MAIN.bans_completed 1 . Number of bans marked 'completed'
MAIN.bans_obj 0 . Number of bans using obj.*
MAIN.bans_req 0 . Number of bans using req.*
MAIN.bans_added 1 0.00 Bans added
MAIN.bans_deleted 0 0.00 Bans deleted
MAIN.bans_tested 0 0.00 Bans tested against objects (lookup)
MAIN.bans_obj_killed 0 0.00 Objects killed by bans (lookup)
MAIN.bans_lurker_tested 0 0.00 Bans tested against objects (lurker)
MAIN.bans_tests_tested 0 0.00 Ban tests tested against objects (lookup)
MAIN.bans_lurker_tests_tested 0 0.00 Ban tests tested against objects (lurker)
MAIN.bans_lurker_obj_killed 0 0.00 Objects killed by bans (lurker)
MAIN.bans_dups 0 0.00 Bans superseded by other bans
MAIN.bans_lurker_contention 0 0.00 Lurker gave way for lookup
MAIN.bans_persisted_bytes 16 . Bytes used by the persisted ban lists
MAIN.bans_persisted_fragmentation 0 . Extra bytes in persisted ban lists due to fragmentation
MAIN.n_purges 0 . Number of purge operations executed
MAIN.n_obj_purged 0 . Number of purged objects
MAIN.exp_mailed 2879 0.01 Number of objects mailed to expiry thread
MAIN.exp_received 2879 0.01 Number of objects received by expiry thread
MAIN.hcb_nolock 2383 0.01 HCB Lookups without lock
MAIN.hcb_lock 975 0.00 HCB Lookups with lock
MAIN.hcb_insert 975 0.00 HCB Inserts
MAIN.esi_errors 0 0.00 ESI parse errors (unlock)
MAIN.esi_warnings 0 0.00 ESI parse warnings (unlock)
MAIN.vmods 2 . Loaded VMODs
MAIN.n_gzip 0 0.00 Gzip operations
MAIN.n_gunzip 2945 0.01 Gunzip operations
MAIN.vsm_free 972480 . Free VSM space
MAIN.vsm_used 83961104 . Used VSM space
MAIN.vsm_cooling 1024 . Cooling VSM space
MAIN.vsm_overflow 0 . Overflow VSM space
MAIN.vsm_overflowed 0 0.00 Overflowed VSM space
MGT.uptime 218640 1.00 Management process uptime
MGT.child_start 1 0.00 Child process started
MGT.child_exit 0 0.00 Child process normal exit
MGT.child_stop 0 0.00 Child process unexpected exit
MGT.child_died 0 0.00 Child process died (signal)
MGT.child_dump 0 0.00 Child process core dumped
MGT.child_panic 0 0.00 Child process panic
MEMPOOL.busyobj.live 0 . In use
MEMPOOL.busyobj.pool 10 . In Pool
MEMPOOL.busyobj.sz_wanted 65536 . Size requested
MEMPOOL.busyobj.sz_actual 65504 . Size allocated
MEMPOOL.busyobj.allocs 8952 0.04 Allocations
MEMPOOL.busyobj.frees 8952 0.04 Frees
MEMPOOL.busyobj.recycle 8934 0.04 Recycled from pool
MEMPOOL.busyobj.timeout 2477 0.01 Timed out from pool
MEMPOOL.busyobj.toosmall 0 0.00 Too small to recycle
MEMPOOL.busyobj.surplus 0 0.00 Too many for pool
MEMPOOL.busyobj.randry 18 0.00 Pool ran dry
MEMPOOL.req0.live 0 . In use
MEMPOOL.req0.pool 10 . In Pool
MEMPOOL.req0.sz_wanted 65536 . Size requested
MEMPOOL.req0.sz_actual 65504 . Size allocated
MEMPOOL.req0.allocs 2622296 11.99 Allocations
MEMPOOL.req0.frees 2622296 11.99 Frees
MEMPOOL.req0.recycle 2622295 11.99 Recycled from pool
MEMPOOL.req0.timeout 1604 0.01 Timed out from pool
MEMPOOL.req0.toosmall 0 0.00 Too small to recycle
MEMPOOL.req0.surplus 0 0.00 Too many for pool
MEMPOOL.req0.randry 1 0.00 Pool ran dry
MEMPOOL.sess0.live 0 . In use
MEMPOOL.sess0.pool 10 . In Pool
MEMPOOL.sess0.sz_wanted 512 . Size requested
MEMPOOL.sess0.sz_actual 480 . Size allocated
MEMPOOL.sess0.allocs 2620824 11.99 Allocations
MEMPOOL.sess0.frees 2620824 11.99 Frees
MEMPOOL.sess0.recycle 2620823 11.99 Recycled from pool
MEMPOOL.sess0.timeout 2001 0.01 Timed out from pool
MEMPOOL.sess0.toosmall 0 0.00 Too small to recycle
MEMPOOL.sess0.surplus 0 0.00 Too many for pool
MEMPOOL.sess0.randry 1 0.00 Pool ran dry
MEMPOOL.req1.live 0 . In use
MEMPOOL.req1.pool 10 . In Pool
MEMPOOL.req1.sz_wanted 65536 . Size requested
MEMPOOL.req1.sz_actual 65504 . Size allocated
MEMPOOL.req1.allocs 2633786 12.05 Allocations
MEMPOOL.req1.frees 2633786 12.05 Frees
MEMPOOL.req1.recycle 2633785 12.05 Recycled from pool
MEMPOOL.req1.timeout 1589 0.01 Timed out from pool
MEMPOOL.req1.toosmall 0 0.00 Too small to recycle
MEMPOOL.req1.surplus 0 0.00 Too many for pool
MEMPOOL.req1.randry 1 0.00 Pool ran dry
MEMPOOL.sess1.live 0 . In use
MEMPOOL.sess1.pool 10 . In Pool
MEMPOOL.sess1.sz_wanted 512 . Size requested
MEMPOOL.sess1.sz_actual 480 . Size allocated
MEMPOOL.sess1.allocs 2632326 12.04 Allocations
MEMPOOL.sess1.frees 2632326 12.04 Frees
MEMPOOL.sess1.recycle 2632325 12.04 Recycled from pool
MEMPOOL.sess1.timeout 1908 0.01 Timed out from pool
MEMPOOL.sess1.toosmall 0 0.00 Too small to recycle
MEMPOOL.sess1.surplus 0 0.00 Too many for pool
MEMPOOL.sess1.randry 1 0.00 Pool ran dry
SMA.s0.c_req 93 0.00 Allocator requests
SMA.s0.c_fail 0 0.00 Allocator failures
SMA.s0.c_bytes 905611 4.14 Bytes allocated
SMA.s0.c_freed 849277 3.88 Bytes freed
SMA.s0.g_alloc 7 . Allocations outstanding
SMA.s0.g_bytes 56334 . Bytes outstanding
SMA.s0.g_space 6442394610 . Bytes available
SMA.Transient.c_req 2363316 10.81 Allocator requests
SMA.Transient.c_fail 0 0.00 Allocator failures
SMA.Transient.c_bytes 2083208664 9528.07 Bytes allocated
SMA.Transient.c_freed 2083145488 9527.79 Bytes freed
SMA.Transient.g_alloc 132 . Allocations outstanding
SMA.Transient.g_bytes 63176 . Bytes outstanding
SMA.Transient.g_space 0 . Bytes available
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app1.happy 18446744073709551615 . Happy health probes
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app1.bereq_hdrbytes 1944414 8.89 Request header bytes
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app1.bereq_bodybytes 8324 0.04 Request body bytes
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app1.beresp_hdrbytes 1608040 7.35 Response header bytes
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app1.beresp_bodybytes 154396823 706.17 Response body bytes
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app1.pipe_hdrbytes 0 0.00 Pipe request header bytes
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app1.pipe_out 0 0.00 Piped bytes to backend
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app1.pipe_in 0 0.00 Piped bytes from backend
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app1.conn 4297 . Concurrent connections to backend
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app1.req 4297 0.02 Backend requests sent
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app2.happy 18446744073709551615 . Happy health probes
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app2.bereq_hdrbytes 0 0.00 Request header bytes
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app2.bereq_bodybytes 0 0.00 Request body bytes
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app2.beresp_hdrbytes 0 0.00 Response header bytes
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app2.beresp_bodybytes 0 0.00 Response body bytes
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app2.pipe_hdrbytes 0 0.00 Pipe request header bytes
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app2.pipe_out 0 0.00 Piped bytes to backend
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app2.pipe_in 0 0.00 Piped bytes from backend
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app2.conn 0 . Concurrent connections to backend
VBE.58b9f33d-bb8c-4540-ab9e-73da4e8c1cf9.app2.req 0 0.00 Backend requests sent
LCK.backend.creat 7 0.00 Created locks
LCK.backend.destroy 0 0.00 Destroyed locks
LCK.backend.locks 194025 0.89 Lock Operations
LCK.backend_tcp.creat 2 0.00 Created locks
LCK.backend_tcp.destroy 0 0.00 Destroyed locks
LCK.backend_tcp.locks 34549 0.16 Lock Operations
LCK.ban.creat 1 0.00 Created locks
LCK.ban.destroy 0 0.00 Destroyed locks
LCK.ban.locks 1193862 5.46 Lock Operations
LCK.busyobj.creat 8951 0.04 Created locks
LCK.busyobj.destroy 8952 0.04 Destroyed locks
LCK.busyobj.locks 227907 1.04 Lock Operations
LCK.cli.creat 1 0.00 Created locks
LCK.cli.destroy 0 0.00 Destroyed locks
LCK.cli.locks 72890 0.33 Lock Operations
LCK.exp.creat 1 0.00 Created locks
LCK.exp.destroy 0 0.00 Destroyed locks
LCK.exp.locks 17964 0.08 Lock Operations
LCK.hcb.creat 1 0.00 Created locks
LCK.hcb.destroy 0 0.00 Destroyed locks
LCK.hcb.locks 3030 0.01 Lock Operations
LCK.lru.creat 2 0.00 Created locks
LCK.lru.destroy 0 0.00 Destroyed locks
LCK.lru.locks 6675 0.03 Lock Operations
LCK.mempool.creat 5 0.00 Created locks
LCK.mempool.destroy 0 0.00 Destroyed locks
LCK.mempool.locks 22011020 100.67 Lock Operations
LCK.objhdr.creat 1094 0.01 Created locks
LCK.objhdr.destroy 947 0.00 Destroyed locks
LCK.objhdr.locks 4780984 21.87 Lock Operations
LCK.pipestat.creat 1 0.00 Created locks
LCK.pipestat.destroy 0 0.00 Destroyed locks
LCK.pipestat.locks 0 0.00 Lock Operations
LCK.sess.creat 5250338 24.01 Created locks
LCK.sess.destroy 5252842 24.03 Destroyed locks
LCK.sess.locks 11 0.00 Lock Operations
LCK.smp.creat 0 0.00 Created locks
LCK.smp.destroy 0 0.00 Destroyed locks
LCK.smp.locks 0 0.00 Lock Operations
LCK.vbe.creat 1 0.00 Created locks
LCK.vbe.destroy 0 0.00 Destroyed locks
LCK.vbe.locks 72885 0.33 Lock Operations
LCK.vcapace.creat 1 0.00 Created locks
LCK.vcapace.destroy 0 0.00 Destroyed locks
LCK.vcapace.locks 0 0.00 Lock Operations
LCK.vcl.creat 1 0.00 Created locks
LCK.vcl.destroy 0 0.00 Destroyed locks
LCK.vcl.locks 33732 0.15 Lock Operations
LCK.vxid.creat 1 0.00 Created locks
LCK.vxid.destroy 0 0.00 Destroyed locks
LCK.vxid.locks 1348 0.01 Lock Operations
LCK.waiter.creat 2 0.00 Created locks
LCK.waiter.destroy 0 0.00 Destroyed locks
LCK.waiter.locks 43236 0.20 Lock Operations
LCK.wq.creat 3 0.00 Created locks
LCK.wq.destroy 0 0.00 Destroyed locks
LCK.wq.locks 16545779 75.68 Lock Operations
LCK.wstat.creat 1 0.00 Created locks
LCK.wstat.destroy 0 0.00 Destroyed locks
LCK.wstat.locks 5362050 24.52 Lock Operations
LCK.sma.creat 2 0.00 Created locks
LCK.sma.destroy 0 0.00 Destroyed locks
LCK.sma.locks 4726679 21.62 Lock Operations
Generally you would call .backend() to pass at as the backend_hint.
req.backend_hint = application_servers.backend();
I would think that would be a syntax error to not do that, but it's possible that it simply returns the first backend in the case that you use the director instance as a backend.
Where in your vcl is the line set req.backend.hint = application_servers.backend(); ?
It should be the first line in sub vcl_recv
Also, what does the probe window look like? You listed 'happy' but if you run varnishstat -1 you should see the full probe window.
To help debug it further put in some syslog calls in your vcl. Use either vmod std or just inline them with
C{
syslog(LOG_ERR, "I am at line X in my vcl");
}C
You need to turn-on inline C option. In Varnish 4 it defaults to off.
Found that the issue was related to the puppet module for deploying varnish.
The puppet module wasn't including the relevant template in the vcl file.
I've submitted a pull request to GitHub to fix this
Related
Finding the source of latency in an RPC system
As the name suggests, I am using a simple RPC system between a PC (windows x64) and an embedded linux PC running ubuntu. The embedded linux pc is the RPC server and the PC is the RPC client. The RPC framework is: erpc. I have noticed that the transaction rate I am getting is particularly low - on the order of 20 transactions/sec. The issue is definitely not hardware related as I have an alternate RPC system (which I'm trying to replace with the contentious one) which can easily get over 1000 transactions/sec using the exact same hardware configuration. To further prove this,I also wrote a simple python script which acts as a simple socket client or server depending on a switch. I run it on the embedded machine as a server and as a client on the pc. The script simply has the client send some random data to the server which in turn sends the data back. The client does this a few hundred times and determines the transaction rate based on this. The amount of data transmitted is of the same order as what erpc uses. Using this setup I can get 3000+ transactions/sec. The RPC system in question is half duplex. Only a single thread is used. Server recvs, processes the request and sends the response in a loop. Only a single socket is used for the duration of the test. I.e. no close and accepts occur during the loop. No other IO occurs. Or at least, I have refactored it for the purposes of these tests to not do any other IO. On the Windows client side, I have a python unit test which I have run with profiling on. The results don't seem to indicate that the problem is on the client. ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 23.998 23.998 runner.py:105(pytest_runtest_call) 1 0.000 0.000 23.998 23.998 python.py:1313(runtest) 1 0.000 0.000 23.998 23.998 __init__.py:603(__call__) 1 0.000 0.000 23.998 23.998 __init__.py:219(_hookexec) 1 0.000 0.000 23.998 23.998 __init__.py:213(<lambda>) 1 0.000 0.000 23.998 23.998 callers.py:151(_multicall) 1 0.000 0.000 23.998 23.998 python.py:183(pytest_pyfunc_call) 1 0.003 0.003 23.998 23.998 test_static_if.py:4(test_read_version) 400 0.014 0.000 23.993 0.060 client.py:16(get_version) 400 0.017 0.000 23.942 0.060 client.py:79(perform_request) 400 0.006 0.000 23.828 0.060 transport.py:75(receive) 800 0.016 0.000 23.820 0.030 transport.py:139(_base_receive) 800 23.803 0.030 23.803 0.030 {method 'recv' of '_socket.socket' objects} 400 0.007 0.000 0.061 0.000 transport.py:65(send) 400 0.002 0.000 0.053 0.000 transport.py:135(_base_send) 400 0.050 0.000 0.050 0.000 {method 'sendall' of '_socket.socket' objects} 400 0.012 0.000 0.032 0.000 basic_codec.py:113(start_read_message) 400 0.006 0.000 0.015 0.000 basic_codec.py:39(start_write_message) 1600 0.007 0.000 0.015 0.000 basic_codec.py:130(_read) 800 0.002 0.000 0.012 0.000 basic_codec.py:156(read_uint32) The server is a C++ application. I have tried profiling it with gprof but the results of that show practically no time consumed by the application at all. After reading up a bit more about how gprof works and how gprof doesn't accumulate time spent in system calls, this indicates that the program is (obviously) IO bound and that the vast majority of time is spent in blocking system calls. I won't add the entire output here for brevity but below is an exerpt: Flat profile: Each sample counts as 0.01 seconds. no time accumulated % cumulative self self total time seconds seconds calls Ts/call Ts/call name 0.00 0.00 0.00 2407 0.00 0.00 erpc::MessageBuffer::get() 0.00 0.00 0.00 2400 0.00 0.00 erpc::MessageBuffer::setUsed(unsigned short) 0.00 0.00 0.00 2000 0.00 0.00 erpc::MessageBuffer::getUsed() const 0.00 0.00 0.00 1600 0.00 0.00 erpc::MessageBuffer::Cursor::write(void const*, unsigned int) 0.00 0.00 0.00 1201 0.00 0.00 erpc::Codec::getBuffer() 0.00 0.00 0.00 803 0.00 0.00 erpc::MessageBuffer::Cursor::set(erpc::MessageBuffer*) 0.00 0.00 0.00 803 0.00 0.00 erpc::MessageBuffer::getLength() const 0.00 0.00 0.00 802 0.00 0.00 erpc::Codec::reset() 0.00 0.00 0.00 801 0.00 0.00 erpc::TCPTransport::underlyingReceive(unsigned char*, unsigned int) 0.00 0.00 0.00 800 0.00 0.00 erpc::TCPTransport::underlyingSend(unsigned char const*, unsigned int) 0.00 0.00 0.00 800 0.00 0.00 erpc::BasicCodec::read(unsigned int*) 0.00 0.00 0.00 800 0.00 0.00 erpc::BasicCodec::write(int) 0.00 0.00 0.00 800 0.00 0.00 erpc::BasicCodec::write(unsigned int) 0.00 0.00 0.00 800 0.00 0.00 erpc::MessageBuffer::Cursor::read(void*, unsigned int) 0.00 0.00 0.00 800 0.00 0.00 erpc::Service::getServiceId() const 0.00 0.00 0.00 403 0.00 0.00 erpc::Service::getNext() 0.00 0.00 0.00 401 0.00 0.00 erpc::SimpleServer::runInternal(erpc::Codec*) 0.00 0.00 0.00 401 0.00 0.00 erpc::TCPTransport::accept() 0.00 0.00 0.00 401 0.00 0.00 erpc::TCPTransport::receive(erpc::MessageBuffer*) 0.00 0.00 0.00 401 0.00 0.00 erpc::FramedTransport::receive(erpc::MessageBuffer*) 0.00 0.00 0.00 400 0.00 0.00 write_p_version_t_struct(erpc::Codec*, p_version_t const*) 0.00 0.00 0.00 400 0.00 0.00 StaticIF_service::handleInvocation(unsigned int, unsigned int, erpc::Codec*, erpc::MessageBufferFactory*) 0.00 0.00 0.00 400 0.00 0.00 StaticIF_service::get_version_shim(erpc::Codec*, erpc::MessageBufferFactory*, unsigned int) 0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::endReadMessage() 0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::endWriteStruct() 0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::endWriteMessage() 0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::startReadMessage(erpc::_message_type*, unsigned int*, unsigned int*, unsigned int*) 0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::startWriteStruct() 0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::startWriteMessage(erpc::_message_type, unsigned int, unsigned int, unsigned int) 0.00 0.00 0.00 400 0.00 0.00 erpc::FramedTransport::send(erpc::MessageBuffer*) 0.00 0.00 0.00 400 0.00 0.00 erpc::MessageBufferFactory::prepareServerBufferForSend(erpc::MessageBuffer*) 0.00 0.00 0.00 400 0.00 0.00 erpc::Server::processMessage(erpc::Codec*, erpc::_message_type&) 0.00 0.00 0.00 400 0.00 0.00 erpc::Server::findServiceWithId(unsigned int) 0.00 0.00 0.00 400 0.00 0.00 get_version 0.00 0.00 0.00 5 0.00 0.00 erpc::ManuallyConstructed<erpc::SimpleServer>::get() 0.00 0.00 0.00 4 0.00 0.00 operator new(unsigned int, void*) 0.00 0.00 0.00 3 0.00 0.00 erpc::ManuallyConstructed<erpc::SimpleServer>::operator->() 0.00 0.00 0.00 2 0.00 0.00 erpc::ManuallyConstructed<erpc::TCPTransport>::get() 0.00 0.00 0.00 2 0.00 0.00 erpc::ManuallyConstructed<erpc::BasicCodecFactory>::get() 0.00 0.00 0.00 2 0.00 0.00 erpc::Server::addService(erpc::Service*) 0.00 0.00 0.00 2 0.00 0.00 erpc::Service::Service(unsigned int) 0.00 0.00 0.00 2 0.00 0.00 erpc::Service::~Service() 0.00 0.00 0.00 2 0.00 0.00 erpc_add_service_to_server 0.00 0.00 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z5usagev Using strace, the problem becomes apparent in the first recv of every request. For context, an initial header is transmitted first which indicates the amount of data the request proper contains. Here's a couple of excerpts from the output (the full output is 2000 lines). I used the -r, -T and -C switches which show a relative timestamp for each call, prints the time spent in each call and also shows the summary respectively. In the transaction loop: 0.000161 recv(4, "\10\0", 2, 0) = 2 <0.059478> 0.059589 recv(4, "q\1\1\2\0\1\0\0", 8, 0) = 8 <0.000047> 0.000167 send(4, "\20\0", 2, 0) = 2 <0.000073> 0.000183 send(4, "q\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000050> 0.000160 recv(4, "\10\0", 2, 0) = 2 <0.059513> 0.059625 recv(4, "r\1\1\2\0\1\0\0", 8, 0) = 8 <0.000046> 0.000167 send(4, "\20\0", 2, 0) = 2 <0.000071> 0.000182 send(4, "r\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000049> 0.000161 recv(4, "\10\0", 2, 0) = 2 <0.059059> 0.059172 recv(4, "s\1\1\2\0\1\0\0", 8, 0) = 8 <0.000047> 0.000183 send(4, "\20\0", 2, 0) = 2 <0.000073> 0.000183 send(4, "s\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000049> 0.000161 recv(4, "\10\0", 2, 0) = 2 <0.059330> 0.059441 recv(4, "t\1\1\2\0\1\0\0", 8, 0) = 8 <0.000046> 0.000166 send(4, "\20\0", 2, 0) = 2 <0.000072> 0.000182 send(4, "t\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000050> 0.000163 recv(4, "\10\0", 2, 0) = 2 <0.059506> 0.059618 recv(4, "u\1\1\2\0\1\0\0", 8, 0) = 8 <0.000046> 0.000166 send(4, "\20\0", 2, 0) = 2 <0.000070> 0.000181 send(4, "u\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000049> 0.000160 recv(4, "\10\0", 2, 0) = 2 <0.059359> 0.059488 recv(4, "v\1\1\2\0\1\0\0", 8, 0) = 8 <0.000048> 0.000175 send(4, "\20\0", 2, 0) = 2 <0.000077> 0.000189 send(4, "v\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000051> 0.000165 recv(4, "\10\0", 2, 0) = 2 <0.059496> 0.059612 recv(4, "w\1\1\2\0\1\0\0", 8, 0) = 8 <0.000046> 0.000170 send(4, "\20\0", 2, 0) = 2 <0.000074> 0.000182 send(4, "w\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000050> The summary: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 98.59 0.010000 12 801 recv 1.41 0.000143 0 800 send 0.00 0.000000 0 12 read 0.00 0.000000 0 3 write 0.00 0.000000 0 25 19 open 0.00 0.000000 0 7 close 0.00 0.000000 0 1 execve 0.00 0.000000 0 8 lseek 0.00 0.000000 0 6 6 access 0.00 0.000000 0 3 brk 0.00 0.000000 0 1 readlink 0.00 0.000000 0 1 munmap 0.00 0.000000 0 2 setitimer 0.00 0.000000 0 1 uname 0.00 0.000000 0 9 mprotect 0.00 0.000000 0 5 writev 0.00 0.000000 0 2 rt_sigaction 0.00 0.000000 0 16 mmap2 0.00 0.000000 0 16 15 stat64 0.00 0.000000 0 6 fstat64 0.00 0.000000 0 1 socket 0.00 0.000000 0 1 bind 0.00 0.000000 0 1 listen 0.00 0.000000 0 1 accept 0.00 0.000000 0 1 setsockopt 0.00 0.000000 0 1 set_tls ------ ----------- ----------- --------- --------- ---------------- 100.00 0.010143 1731 40 total In passing, I am not sure I completely understand the summary. The summary suggests that recv happens very quick compared to the time indicated in each call to recv. It looks like the time spent in the first recv is what is killing the RPC system at nearly 60ms per call. Am I misreading this? I am not sure of the units but so I am guessing seconds. So, after profiling both the client and the server, it appears the vast amount of time is spent in recv. If we assumed that extra time spent in the intitial recv on the server side was because the client was still processing something and hadn't send it yet, that should have shown up when profiling the client. Any suggestions you may have as to how to further debug this would be greatly appreciated. Thanks!
Why latest version of sys-stat not showing average values after kill..?
The old output from my logs which showing Avg values after ctrl+c #pidstat 1 -p `pgrep bgpd` Linux 3.16.7-gd1a374d-dellz9100on (rtr1) Tuesday 29 May 2018 _x86_64_ (4 CPU) 05:07:01 UTC UID PID %usr %system %guest %CPU CPU Command 05:07:07 UTC 0 2144 0.00 0.00 0.00 0.00 2 bgpd 05:07:08 UTC 0 2144 0.00 0.00 0.00 0.00 2 bgpd 05:07:09 UTC 0 2144 0.00 0.00 0.00 0.00 2 bgpd 05:07:10 UTC 0 2144 0.00 0.00 0.00 0.00 2 bgpd 05:07:11 UTC 0 2144 0.00 0.00 0.00 0.00 2 bgpd 05:07:12 UTC 0 2144 0.00 0.00 0.00 0.00 2 bgpd ^C Average: 0 2144 0.09 0.00 0.00 0.09 - bgpd Now it is not showing Avg values # pidstat 1 -p `pgrep bgpd` Linux 3.16.7-gd1a374d-dellz9100on (rtr1) 06/13/18 _x86_64_ (4 CPU) 07:32:51 PID %usr %system %guest %CPU CPU Command 07:32:56 2144 0.00 0.00 0.00 0.00 0 bgpd 07:32:57 2144 0.00 0.00 0.00 0.00 1 bgpd 07:32:58 2144 0.00 0.00 0.00 0.00 1 bgpd ^C # Version of pidstat root#rtr1:/home/ocnos# pidstat -V sysstat version 10.0.5 (C) Sebastien Godard (sysstat <at> orange.fr) root#rtr1:/home/ocnos#
The mail reply from Developer of this tools No, your previous version was not 9.x. This feature (displaying average stats when Ctrl/C is hit) was added to pidstat in release 10.1.4. Regards, Sebastien.
Figuring out Linux memory usage
I've got a bit weird Linux memory usage I'm trying to figure out. I've got 2 processes: nxtcapture & nxtexport. None of these processes really allocate much memory however they both mmap a 1 TB file each. nxtexport has no heap allocations (apart from during startup). nxtcapture writes sequentially to the file and nxtexport reads sequentially. Since nxtexport reads from the tail of nxtcapture I don't really have any read IO. ing992:~# iostat -m Linux 4.4.52-nxt (ing992) 05/25/17 _x86_64_ (32 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 29.17 1.99 0.96 0.06 0.00 67.82 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn loop0 0.02 0.00 0.00 2 0 sdf 16.47 0.06 0.85 4207 61442 sdf1 0.00 0.00 0.00 5 0 sdf2 0.01 0.00 0.00 77 0 sdf3 16.45 0.06 0.85 4115 61442 sdf4 0.00 0.00 0.00 7 0 sde 15.45 0.01 0.85 1032 61442 sde1 0.00 0.00 0.00 5 0 sde2 0.00 0.00 0.00 0 0 sde3 15.44 0.01 0.85 1017 61442 sde4 0.00 0.00 0.00 7 0 sdb 43.08 0.00 15.72 22 1136368 sda 43.07 0.00 15.72 21 1136406 sdc 43.42 0.04 15.72 2711 1136332 sdd 43.07 0.00 15.72 20 1136301 md127 0.01 0.00 0.00 77 0 md126 23.77 0.07 0.85 5132 61145 This all great. However, looking at the memory usage I can see the following: Which shows that more than half of my memory is unavailable?! How is this possible? I understand that mmap will keep pages cached. But shouldn't such (non-dirty) pages be counted as available? What's going on here? How can I debug this? free -m total used free shared buffers cached Mem: 32020 31608 412 221 13 9655 -/+ buffers/cache: 21939 10081 Swap: 0 0 0
Haskell: Leaking memory from ST / GC not collecting?
I have a computation inside ST which allocates memory through a Data.Vector.Unboxed.Mutable. The vector is never read or written, nor is any reference retained to it outside of runST (to the best of my knowledge). The problem I have is that when I run my ST computation multiple times, I sometimes seem to keep the memory for the vector around. Allocation statistics: 5,435,386,768 bytes allocated in the heap 5,313,968 bytes copied during GC 134,364,780 bytes maximum residency (14 sample(s)) 3,160,340 bytes maximum slop 518 MB total memory in use (0 MB lost due to fragmentation) Here I call runST 20x with different values for my computation and a 128MB vector (again - unused, not returned or referenced outside of ST). The maximum residency looks good, basically just my vector plus a few MB of other stuff. But the total memory use indicates that I have four copies of the vector active at the same time. This scales perfectly with the size of the vector, for 256MB we get 1030MB as expected. Using a 1GB vector runs out of memory (4x1GB + overhead > 32bit). I don't understand why the RTS keeps seemingly unused, unreferenced memory around instead of just GC'ing it, at least at the point where an allocation would otherwise fail. Running with +RTS -S reveals the following: Alloc Copied Live GC GC TOT TOT Page Flts bytes bytes bytes user elap user elap 134940616 13056 134353540 0.00 0.00 0.09 0.19 0 0 (Gen: 1) 583416 6756 134347504 0.00 0.00 0.09 0.19 0 0 (Gen: 0) 518020 17396 134349640 0.00 0.00 0.09 0.19 0 0 (Gen: 1) 521104 13032 134359988 0.00 0.00 0.09 0.19 0 0 (Gen: 0) 520972 1344 134360752 0.00 0.00 0.09 0.19 0 0 (Gen: 0) 521100 828 134360684 0.00 0.00 0.10 0.19 0 0 (Gen: 0) 520812 592 134360528 0.00 0.00 0.10 0.19 0 0 (Gen: 0) 520936 1344 134361324 0.00 0.00 0.10 0.19 0 0 (Gen: 0) 520788 1480 134361476 0.00 0.00 0.10 0.20 0 0 (Gen: 0) 134438548 5964 268673908 0.00 0.00 0.19 0.38 0 0 (Gen: 0) 586300 3084 268667168 0.00 0.00 0.19 0.38 0 0 (Gen: 0) 517840 952 268666340 0.00 0.00 0.19 0.38 0 0 (Gen: 0) 520920 544 268666164 0.00 0.00 0.19 0.38 0 0 (Gen: 0) 520780 428 268666048 0.00 0.00 0.19 0.38 0 0 (Gen: 0) 520820 2908 268668524 0.00 0.00 0.19 0.38 0 0 (Gen: 0) 520732 1788 268668636 0.00 0.00 0.19 0.39 0 0 (Gen: 0) 521076 564 268668492 0.00 0.00 0.19 0.39 0 0 (Gen: 0) 520532 712 268668640 0.00 0.00 0.19 0.39 0 0 (Gen: 0) 520764 956 268668884 0.00 0.00 0.19 0.39 0 0 (Gen: 0) 520816 420 268668348 0.00 0.00 0.20 0.39 0 0 (Gen: 0) 520948 1332 268669260 0.00 0.00 0.20 0.39 0 0 (Gen: 0) 520784 616 268668544 0.00 0.00 0.20 0.39 0 0 (Gen: 0) 521416 836 268668764 0.00 0.00 0.20 0.39 0 0 (Gen: 0) 520488 1240 268669168 0.00 0.00 0.20 0.40 0 0 (Gen: 0) 520824 1608 268669536 0.00 0.00 0.20 0.40 0 0 (Gen: 0) 520688 1276 268669204 0.00 0.00 0.20 0.40 0 0 (Gen: 0) 520252 1332 268669260 0.00 0.00 0.20 0.40 0 0 (Gen: 0) 520672 1000 268668928 0.00 0.00 0.20 0.40 0 0 (Gen: 0) 134553500 5640 402973292 0.00 0.00 0.29 0.58 0 0 (Gen: 0) 586776 2644 402966160 0.00 0.00 0.29 0.58 0 0 (Gen: 0) 518064 26784 134342772 0.00 0.00 0.29 0.58 0 0 (Gen: 1) 520828 3120 134343528 0.00 0.00 0.29 0.59 0 0 (Gen: 0) 521108 756 134342668 0.00 0.00 0.30 0.59 0 0 (Gen: 0) Here it seems we have 'live bytes' exceeding ~128MB. The +RTS -hy profile basically just says we allocate 128MB: http://imageshack.us/a/img69/7765/45q8.png I tried reproducing this behavior in a simpler program, but even with replicating the exact setup with ST, a Reader containing the Vector, same monad/program structure etc. the simple test program doesn't show this. Simplifying my big program the behavior also stops eventually when removing apparently completely unrelated code. Qs: Am I really keeping this vector around 4 times out of 20? If yes, how do I actually tell since +RTS -Hy and maximum residency claim I'm not, and what can I do to stop this behavior? If no, why is Haskell not GC'ing it and running out of address space / memory, and what can I do to stop this behavior? Thanks!
I suspect this is a bug in GHC and/or the RTS. First, I'm confident there is no actual space leak or anything like that. Reasons: The vector is never used anywhere. Not read, not written, not referenced. It should be collected once runST is done. Even when the ST computation returns a single Int which is immediately printed out to evaluate it, the memory issue still exists. There is no reference to that data. Every profiling mode the RTS offers is in violent agreement that I never actually have more than a single vector's worth of memory allocated/referenced. Every statistic and pretty chart says that. Now, here's the interesting bit. If I manually force the GC by calling System.Mem.performGC after every run of my function, the problem goes away, completely. So we have a case where the runtime has GBs worth of memory which (demonstrably!) can be reclaimed by the GC and even according to its own statistic is not held by anybody anymore. When running out of its memory pool the runtime does not collect, but instead asks the OS for more memory. And even when that finally fails, the runtime still does not collect (which would reclaim GBs of memory, demonstrably) but instead chooses to terminate the program with an out-of-memory error. I'm no expert on Haskell, GHC or GC. But this does look awfully broken to me. I'll report this as a bug.
How should I interpret the output of the ghc heap profiler?
I have a server process implemented in haskell that acts as a simple in-memory db. Client processes can connect then add and retrieve data. The service uses more memory than I would expect, and I'm attempting to work out why. The crudest metric I have is linux "top". When I start the process I see an "VIRT" image size of ~27MB. After running a client to insert 60,000 data items, I see an image size of ~124MB. Running the process to capture GC statistics (+RTS -S), I see initially Alloc Copied Live GC GC TOT TOT Page Flts bytes bytes bytes user elap user elap 28296 8388 9172 0.00 0.00 0.00 0.32 0 0 (Gen: 1) and on adding the 60k items I see the live bytes grow smoothly to ... 532940 14964 63672180 0.00 0.00 23.50 31.95 0 0 (Gen: 0) 532316 7704 63668672 0.00 0.00 23.50 31.95 0 0 (Gen: 0) 530512 9648 63677028 0.00 0.00 23.50 31.95 0 0 (Gen: 0) 531936 10796 63686488 0.00 0.00 23.51 31.96 0 0 (Gen: 0) 423260 10047016 63680532 0.03 0.03 23.53 31.99 0 0 (Gen: 1) 531864 6996 63693396 0.00 0.00 23.55 32.01 0 0 (Gen: 0) 531852 9160 63703536 0.00 0.00 23.55 32.01 0 0 (Gen: 0) 531888 9572 63711876 0.00 0.00 23.55 32.01 0 0 (Gen: 0) 531928 9716 63720128 0.00 0.00 23.55 32.01 0 0 (Gen: 0) 531856 9640 63728052 0.00 0.00 23.55 32.02 0 0 (Gen: 0) 529632 9280 63735824 0.00 0.00 23.56 32.02 0 0 (Gen: 0) 527948 8304 63742524 0.00 0.00 23.56 32.02 0 0 (Gen: 0) 528248 7152 63749180 0.00 0.00 23.56 32.02 0 0 (Gen: 0) 528240 6384 63756176 0.00 0.00 23.56 32.02 0 0 (Gen: 0) 341100 10050336 63731152 0.03 0.03 23.58 32.35 0 0 (Gen: 1) 5080 10049728 63705868 0.03 0.03 23.61 32.70 0 0 (Gen: 1) This appears to be telling me that the heap has ~63MB of live data. This could well be consistent with numbers from top, by the time you add on stack space, code space, GC overhead etc. So I attempted to use the heap profiler to work out what's making up this 63MB. The results are confusing. Running with "+RTS -h", and looking at the generated hp file, the last and largest snapshot has: containers-0.3.0.0:Data.Map.Bin 1820400 bytestring-0.9.1.7:Data.ByteString.Internal.PS 1336160 main:KV.Store.Memory.KeyTree 831972 main:KV.Types.KF_1 750328 base:GHC.ForeignPtr.PlainPtr 534464 base:Data.Maybe.Just 494832 THUNK 587140 All of the other numbers in the snapshot are much smaller than this. Adding these up gives the peak memory usage as ~6MB, as reflected in the chart output: Why is this inconsistent with the live bytes as shown in the GC statistics? It's hard to see how my data structures could be requiring 63MB, and the profiler says they are not. Where is the memory going? Thanks for any tips or pointers on this. Tim
I have a theory. My theory is that your program is using a lot of something like ByteStrings. My theory is that because the main content of ByteStrings is mallocated, they are not displayed while profiling. Thus you could run out of heap without the largest content of your heap showing up on the profiling graph. To make matters even worse, when you grab substrings of ByteStrings, they by default retain the pointer to the originally allocated block of memory. So even if you are trying to only store a small fragement of some ByteString you could end up retaining the whole of the originally allocated ByteString and this won't show up on your heap profile. That is my theory anyways. I don't know enough facts about how GHC's heap profiler works nor about how ByteStrings are implemented to know for certain. Maybe someone else can chime in and confirm or dispute my theory. Edit2: tibbe notes that the buffer used by ByteStrings are pinned. So if you are allocating/freeing lots of small Bytestrings, you can fragment your heap meaning you run out of useable heap with half of it unallocated. Edit: JaffaCake tells me that sometimes the heap profiler will not display the memory allocated by ByteStrings.
You should use, e.g., hp2ps to get a graphical view of what's going on. Looking at the raw hp file is difficult.
Not everything is included in the profile by default, for example threads and stacks. Try with +RTS -xT.