Why do hGetBuf, hPutBuf, etc. allocate memory? - haskell

In the process of doing some simple benchmarking, I came across something that surprised me. Take this snippet from Network.Socket.Splice:
hSplice :: Int -> Handle -> Handle -> IO ()
hSplice len s t = do
a <- mallocBytes len :: IO (Ptr Word8)
finally
(forever $! do
bytes <- hGetBufSome s a len
if bytes > 0
then hPutBuf t a bytes
else throwRecv0)
(free a)
One would expect that hGetBufSome and hPutBuf here would not need to allocate memory, as they write into and read from a pre-allocated buffer. The docs seem to back this intuition up... But alas:
individual inherited
COST CENTRE %time %alloc %time %alloc bytes
hSplice 0.5 0.0 38.1 61.1 3792
hPutBuf 0.4 1.0 19.8 29.9 12800000
hPutBuf' 0.4 0.4 19.4 28.9 4800000
wantWritableHandle 0.1 0.1 19.0 28.5 1600000
wantWritableHandle' 0.0 0.0 18.9 28.4 0
withHandle_' 0.0 0.1 18.9 28.4 1600000
withHandle' 1.0 3.8 18.8 28.3 48800000
do_operation 1.1 3.4 17.8 24.5 44000000
withHandle_'.\ 0.3 1.1 16.7 21.0 14400000
checkWritableHandle 0.1 0.2 16.4 19.9 3200000
hPutBuf'.\ 1.1 3.3 16.3 19.7 42400000
flushWriteBuffer 0.7 1.4 12.1 6.2 17600000
flushByteWriteBuffer 11.3 4.8 11.3 4.8 61600000
bufWrite 1.7 6.9 3.0 9.9 88000000
copyToRawBuffer 0.1 0.2 1.2 2.8 3200000
withRawBuffer 0.3 0.8 1.2 2.6 10400000
copyToRawBuffer.\ 0.9 1.7 0.9 1.7 22400000
debugIO 0.1 0.2 0.1 0.2 3200000
debugIO 0.1 0.2 0.1 0.2 3200016
hGetBufSome 0.0 0.0 17.7 31.2 80
wantReadableHandle_ 0.0 0.0 17.7 31.2 32
wantReadableHandle' 0.0 0.0 17.7 31.2 0
withHandle_' 0.0 0.0 17.7 31.2 32
withHandle' 1.6 2.4 17.7 31.2 30400976
do_operation 0.4 2.4 16.1 28.8 30400880
withHandle_'.\ 0.5 1.1 15.8 26.4 14400288
checkReadableHandle 0.1 0.4 15.3 25.3 4800096
hGetBufSome.\ 8.7 14.8 15.2 24.9 190153648
bufReadNBNonEmpty 2.6 4.4 6.1 8.0 56800000
bufReadNBNonEmpty.buf' 0.0 0.4 0.0 0.4 5600000
bufReadNBNonEmpty.so_far' 0.2 0.1 0.2 0.1 1600000
bufReadNBNonEmpty.remaining 0.2 0.1 0.2 0.1 1600000
copyFromRawBuffer 0.1 0.2 2.9 2.8 3200000
withRawBuffer 1.0 0.8 2.8 2.6 10400000
copyFromRawBuffer.\ 1.8 1.7 1.8 1.7 22400000
bufReadNBNonEmpty.avail 0.2 0.1 0.2 0.1 1600000
flushCharReadBuffer 0.3 2.1 0.3 2.1 26400528
I have to assume this is on purpose... but I have no idea what that purpose might be. Even worse: I'm just barely clever enough to get this profile, but not quite clever enough to figure out exactly what's being allocated.
Any help along those lines would be appreciated.
UPDATE: I've done some more profiling with two drastically simplified testcases. The first testcase directly uses the read/write ops from System.Posix.Internals:
echo :: Ptr Word8 -> IO ()
echo buf = forever $ do
threadWaitRead $ Fd 0
len <- c_read 0 buf 1
c_write 1 buf (fromIntegral len)
yield
As you'd hope, this allocates no memory on the heap each time through the loop. The second testcase uses the read/write ops from GHC.IO.FD:
echo :: Ptr Word8 -> IO ()
echo buf = forever $ do
len <- readRawBufferPtr "read" stdin buf 0 1
writeRawBufferPtr "write" stdout buf 0 (fromIntegral len)
UPDATE #2: I was advised to file this as a bug in GHC Trac... I'm still not sure it actually is a bug (as opposed to intentional behavior, a known limitation, or whatever) but here it is: https://ghc.haskell.org/trac/ghc/ticket/9696

I'll try to guess based on the code
Runtime tries to optimize small reads and writes, so it maintains internal buffer. If your buffer is 1 byte long, it will be inefficient to use it dirrectly. So internal buffer is used to read bigger chunk of data. It is probably ~32Kb long. Plus something similar for writing. Plus your own buffer.
The code has an optimization -- if you provide buffer bigger then the internal one, and the later is empty, it will use your buffer dirrectly. But the internal buffer is already allocated, so it will not less memory usage. I don't know how to dissable internal buffer, but you can open feature request if it is important for you.
(I realize that my guess can be totally wrong.)
ADD:
This one does seem to allocate, but I still don't know why.
What is your concern, max memory usage or number of allocated bytes?
c_read is a C function, it doesn't allocate on haskell's heap (but may allocate on C heap.)
readRawBufferPtr is Haskell function, and it is usual for haskell functions to allocate a lot of memory, that quickly becomes a garbage. Simply because of immutability. It is common for haskell program to allocate e.g 100Gb while memory usage is under 1Mb.

It seems like the conclusion is: it's a bug.

Related

What are llvm_pipe threads?

I'm writing a Rust app that uses a lot of threads. I noticed the CPU usage was high so I did top and then hit H to see the threads:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
247759 root 20 0 3491496 104400 64676 R 32.2 1.0 0:02.98 my_app
247785 root 20 0 3491496 104400 64676 S 22.9 1.0 0:01.89 llvmpipe-0
247786 root 20 0 3491496 104400 64676 S 21.9 1.0 0:01.71 llvmpipe-1
247792 root 20 0 3491496 104400 64676 S 20.9 1.0 0:01.83 llvmpipe-7
247789 root 20 0 3491496 104400 64676 S 20.3 1.0 0:01.60 llvmpipe-4
247790 root 20 0 3491496 104400 64676 S 20.3 1.0 0:01.64 llvmpipe-5
247787 root 20 0 3491496 104400 64676 S 19.9 1.0 0:01.70 llvmpipe-2
247788 root 20 0 3491496 104400 64676 S 19.9 1.0 0:01.61 llvmpipe-3
What are these llvmpipe-n threads? Why my_app launches them? Are them even from my_app for sure?
As HHK links to, the llvmpipe threads are from your OpenGL driver, which is Mesa.
You said you are running this in a VM. VMs usually don't virtualize GPU hardware, so the Mesa OpenGL driver is doing sofware rendering. To achieve better performance, Mesa spawns threads to do parallel computations on the CPU.

Node.js - spawn is cutting off the results

I'm creating a node program to return the output of linux top command, is working fine the only issue is that the name of command is cutted, instead the full command name like /usr/local/libexec/netdata/plugins.d/apps.plugin 1 returns /usr/local+
My code
const topparser=require("topparser")
const spawn = require('child_process').spawn
let proc=null
let startTime=0
exports.start=function(pid_limit,callback){
startTime=new Date().getTime()
proc = spawn('top', ['-c','-b',"-d","3"])
console.log("started process, pid: "+proc.pid)
let top_data=""
proc.stdout.on('data', function (data) {
console.log('stdout: ' + data);
})
proc.on('close', function (code) {
console.log('child process exited with code ' + code);
});
}//start
exports.stop=function(){
console.log("stoped process...")
if(proc){proc.kill('SIGINT')}// SIGHUP -linux ,SIGINT -windows
}//stop
The results
14861 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [kworker/1+
14864 root 20 0 0 0 0 S 0.0 0.0 0:00.02 [kworker/0+
15120 root 39 19 102488 3344 2656 S 0.0 0.1 0:00.09 /usr/bin/m+
16904 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [kworker/0+
19031 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [kworker/u+
21500 root 20 0 0 0 0 Z 0.0 0.0 0:00.00 [dsc] <def+
22571 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [kworker/0+
Any way to fix it?
Best regards
From a top manpage:
In Batch mode, when used without an argument top will format output using the COLUMNS= and LINES=
environment variables, if set. Otherwise, width will be fixed at the maximum 512 columns. With an
argument, output width can be decreased or increased (up to 512) but the number of rows is considā€
ered unlimited.
Add '-w', '512' to the arguments.
Since you work with node, you can query netdata running on localhost for this.
Example:
http://london.my-netdata.io/api/v1/data?chart=apps.cpu&after=-1&options=ms
For localhost netdata:
http://localhost:19999/api/v1/data?chart=apps.cpu&after=-1&options=ms
You can also get systemd services:
http://london.my-netdata.io/api/v1/data?chart=services.cpu&after=-1&options=ms
If you are not planning to update the screen per second, you can instruct netdata to return the average of a longer duration:
http://london.my-netdata.io/api/v1/data?chart=apps.cpu&after=-5&points=1&group=average&options=ms
The above returns the average of the last 5 seconds.
Finally, you get the latest values all the metrics netdata monitors, with this:
http://london.my-netdata.io/api/v1/allmetrics?format=json
For completeness, netdata can export all the metrics in BASH format for shell scripts. Check this: https://github.com/firehol/netdata/wiki/receiving-netdata-metrics-from-shell-scripts

Logstash - grok is not parsing double digit float values

Grok is able to parse float values with single digit like 1.2 using BASE16FLOAT
but throws [0] "_grokparsefailure" when parsing double digit like 12.5
Example:
works for the log event
02:10:28 CPU Util %: 0.1 / 0.2 / 0.6 Disk Util %: 0.0 / 0.0 / 0.0
but not for
02:09:46 CPU Util %: 1.3 / 2.3 / 4.2 Disk Util %: 5.6 / 12.5 / 40.9
Logstash filter used
"message" => "%{TIME:time} CPU Util %: %{BASE16FLOAT:MIN_CPU} / %{BASE16FLOAT:AVG_CPU} / %{BASE16FLOAT:MAX_CPU} Disk Util %: %{BASE16FLOAT:MIN_DISK} / %{BASE16FLOAT:AVG_DISK} / %{BASE16FLOAT:MAX_DISK}"
I dont understand why it works for single digit float values but not for a double digit values.
You can use %{NUMBER} and ${SPACE}
"message" => "%{TIME:time}%{SPACE}CPU Util %:%{SPACE}%{NUMBER:MIN_CPU} /%{SPACE}%{NUMBER:AVG_CPU} /%{SPACE}%{NUMBER:MAX_CPU}%{SPACE}Disk Util %:%{SPACE}%{NUMBER:MIN_DISK} /%{SPACE}%{NUMBER:AVG_DISK} /%{SPACE}%{NUMBER:MAX_DISK}"

Heap overflow in Haskell

I'm getting Heap exhausted message when running the following short Haskell program on a big enough dataset. For example, the program fails (with heap overflow) on 20 Mb input file with around 900k lines. The heap size was set (through -with-rtsopts) to 1 Gb. It runs ok if longestCommonSubstrB is defined as something simpler, e.g. commonPrefix. I need to process files in the order of 100 Mb.
I compiled the program with the following command line (GHC 7.8.3):
ghc -Wall -O2 -prof -fprof-auto "-with-rtsopts=-M512M -p -s -h -i0.1" SampleB.hs
I would appreciate any help in making this thing run in a reasonable amount of space (in the order of the input file size), but I would especially appreciate the thought process of finding where the bottleneck is and where and how to force the strictness.
My guess is that somehow forcing longestCommonSubstrB function to evaluate strictly would solve the problem, but I don't know how to do that.
{-# LANGUAGE BangPatterns #-}
module Main where
import System.Environment (getArgs)
import qualified Data.ByteString.Lazy.Char8 as B
import Data.List (maximumBy, sort)
import Data.Function (on)
import Data.Char (isSpace)
-- | Returns a list of lexicon items, i.e. [[w1,w2,w3]]
readLexicon :: FilePath -> IO [[B.ByteString]]
readLexicon filename = do
text <- B.readFile filename
return $ map (B.split '\t' . stripR) . B.lines $ text
where
stripR = B.reverse . B.dropWhile isSpace . B.reverse
transformOne :: [B.ByteString] -> B.ByteString
transformOne (w1:w2:w3:[]) =
B.intercalate (B.pack "|") [w1, longestCommonSubstrB w2 w1, w3]
transformOne a = error $ "transformOne: unexpected tuple " ++ show a
longestCommonSubstrB :: B.ByteString -> B.ByteString -> B.ByteString
longestCommonSubstrB xs ys = maximumBy (compare `on` B.length) . concat $
[f xs' ys | xs' <- B.tails xs] ++
[f xs ys' | ys' <- tail $ B.tails ys]
where f xs' ys' = scanl g B.empty $ B.zip xs' ys'
g z (x, y) = if x == y
then z `B.snoc` x
else B.empty
main :: IO ()
main = do
(input:output:_) <- getArgs
lexicon <- readLexicon input
let flattened = B.unlines . sort . map transformOne $ lexicon
B.writeFile output flattened
This is the profile ouput for the test dataset (100k lines, heap size set to 1 GB, i.e. generateSample.exe 100000, the resulting file size is 2.38 MB):
Heap profile over time:
Execution statistics:
3,505,737,588 bytes allocated in the heap
785,283,180 bytes copied during GC
62,390,372 bytes maximum residency (44 sample(s))
216,592 bytes maximum slop
96 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6697 colls, 0 par 1.05s 1.03s 0.0002s 0.0013s
Gen 1 44 colls, 0 par 4.14s 3.99s 0.0906s 0.1935s
INIT time 0.00s ( 0.00s elapsed)
MUT time 7.80s ( 9.17s elapsed)
GC time 3.75s ( 3.67s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 1.44s ( 1.35s elapsed)
EXIT time 0.02s ( 0.00s elapsed)
Total time 13.02s ( 12.85s elapsed)
%GC time 28.8% (28.6% elapsed)
Alloc rate 449,633,678 bytes per MUT second
Productivity 60.1% of total user, 60.9% of total elapsed
Time and Allocation Profiling Report:
SampleB.exe +RTS -M1G -p -s -h -i0.1 -RTS sample.txt sample_out.txt
total time = 3.97 secs (3967 ticks # 1000 us, 1 processor)
total alloc = 2,321,595,564 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
longestCommonSubstrB Main 43.3 33.1
longestCommonSubstrB.f Main 21.5 43.6
main.flattened Main 17.5 5.1
main Main 6.6 5.8
longestCommonSubstrB.g Main 5.0 5.8
readLexicon Main 2.5 2.8
transformOne Main 1.8 1.7
readLexicon.stripR Main 1.8 1.9
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 45 0 0.1 0.0 100.0 100.0
main Main 91 0 6.6 5.8 99.9 100.0
main.flattened Main 93 1 17.5 5.1 89.1 89.4
transformOne Main 95 100000 1.8 1.7 71.6 84.3
longestCommonSubstrB Main 100 100000 43.3 33.1 69.8 82.5
longestCommonSubstrB.f Main 101 1400000 21.5 43.6 26.5 49.5
longestCommonSubstrB.g Main 104 4200000 5.0 5.8 5.0 5.8
readLexicon Main 92 1 2.5 2.8 4.2 4.8
readLexicon.stripR Main 98 0 1.8 1.9 1.8 1.9
CAF GHC.IO.Encoding.CodePage 80 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 74 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 70 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 66 0 0.0 0.0 0.0 0.0
CAF System.Environment 65 0 0.0 0.0 0.0 0.0
CAF Data.ByteString.Lazy.Char8 54 0 0.0 0.0 0.0 0.0
CAF Main 52 0 0.0 0.0 0.0 0.0
transformOne Main 99 0 0.0 0.0 0.0 0.0
readLexicon Main 96 0 0.0 0.0 0.0 0.0
readLexicon.stripR Main 97 1 0.0 0.0 0.0 0.0
main Main 90 1 0.0 0.0 0.0 0.0
UPDATE: The following program can be used to generate sample data. It expects one argument, the number of lines in the generated dataset. The generated data will be saved to the sample.txt file. When I generate 900k lines dataset with it (by running generateSample.exe 900000), the produced dataset makes the above program fail with heap overflow (the heap size was set to 1 GB). The resulting dataset is around 20 MB.
module Main where
import System.Environment (getArgs)
import Data.List (intercalate, permutations)
generate :: Int -> [(String,String,String)]
generate n = take n $ zip3 (f "banana") (f "ruanaba") (f "kikiriki")
where
f = cycle . permutations
main :: IO ()
main = do
(n:_) <- getArgs
let flattened = unlines . map f $ generate (read n :: Int)
writeFile "sample.txt" flattened
where
f (w1,w2,w3) = intercalate "\t" [w1, w2, w3]
It seems to me you've implemented a naive longest common substring, with terrible space complexity (at least O(n^2)). Strictness has nothing to do with it.
You'll want to implement a dynamic programming algo. You may find inspiration in the string-similarity package, or in the lcs function in the guts of the Diff package.

How can I track a thread's Cpu usage in Delphi

I have a program running several threads, but some threads sometimes overload the CPU. so I need to limit these threads CPU usage to %50 something, is it possible in Delphi?
edit: sorry guys my question was not clear.
I actually want to know how could I track threads ( at least make a thread list with their thread IDs) and see how much CPU uses each thread. But I want to do this so I could see which thread is responsible for CPU overload.
sorry for the inconvenience again.
I think the answer to your question can be found in the following Stack Overflow question: How to get the cpu usage per thread on windows (win32).
However, I would advise you to endeavour to understand why your program is behaving as it does and attack the root of the problem rather than killing any threads that you take a dislike to. Of course, if the program in question is purely for your own private use then your approach may be perfectly expedient and pragmatic. But if you are writing professional software then I can't see a situation where killing busy threads sounds like a reasonable approach.
You cannot "limit CPU usage", not in Delphi nor in Windows itself, as far as I know.
You likely want something else: not to interfere with user actions or with other threads. But if there's nothing going on and user aren't doing anything, why run slower than you could? Just use the 100% of the CPU, nobody needs it!
So, if you need those threads not to interfere with user actions, just set them to lower priority with Windows function SetThreadPriority. They'll only run when user doesn't need processor power.
Another trick to give more chance for other threads to run, call Sleep(0) from time to time in your thread body. Every time you call Sleep(), you ask OS to switch to another thread, simply speaking.
I track a rolling CPU usage per thread for every thread in all my applications using some code in my framework (http://www.csinnovations.com/framework/framework.htm). A log output looks like:
15/01/2011 11:17:59.631,Misha,MISHA-DCDEL,Scores Client,V0.2.0.1,Main Thread,Memory Check,Verbose,Globals,"System allocated memory = 8282615808 bytes (change since last check = 4872478720 bytes)"
15/01/2011 11:17:59.632,Misha,MISHA-DCDEL,Scores Client,V0.2.0.1,Main Thread,Memory Check,Verbose,Globals,"Process allocated memory = 152580096 bytes (change since last check = -4579328 bytes)"
15/01/2011 11:17:59.633,Misha,MISHA-DCDEL,Scores Client,V0.2.0.1,Main Thread,CPU Check,Verbose,Globals,"System CPU usage = 15.6 % (average over lifetime = 3.0 %)"
15/01/2011 11:17:59.634,Misha,MISHA-DCDEL,Scores Client,V0.2.0.1,Main Thread,CPU Check,Verbose,Globals,"Process CPU usage = 0.5 % (average over lifetime = 0.7 %)"
15/01/2011 11:17:59.634,Misha,MISHA-DCDEL,Scores Client,V0.2.0.1,Main Thread,CPU Check,Verbose,Globals,"Thread CPU usage = 0.0 % (average over lifetime = 0.0 %)"
15/01/2011 11:17:59.634,Misha,MISHA-DCDEL,Scores Client,V0.2.0.1,Main Thread,CPU Check,Verbose,Globals,"Thread CPU usage = 0.0 % (average over lifetime = 0.0 %)"
15/01/2011 11:17:59.634,Misha,MISHA-DCDEL,Scores Client,V0.2.0.1,Main Thread,CPU Check,Verbose,Globals,"Thread CPU usage = 0.0 % (average over lifetime = 0.0 %)"
15/01/2011 11:17:59.635,Misha,MISHA-DCDEL,Scores Client,V0.2.0.1,Main Thread,CPU Check,Verbose,Globals,"Thread CPU usage = 0.1 % (average over lifetime = 0.1 %)"
15/01/2011 11:17:59.635,Misha,MISHA-DCDEL,Scores Client,V0.2.0.1,Main Thread,CPU Check,Verbose,Globals,"Thread CPU usage = 0.0 % (average over lifetime = 0.0 %)"
15/01/2011 11:17:59.635,Misha,MISHA-DCDEL,Scores Client,V0.2.0.1,Main Thread,CPU Check,Verbose,Globals,"Thread CPU usage = 0.3 % (average over lifetime = 0.5 %)"
15/01/2011 11:17:59.635,Misha,MISHA-DCDEL,Scores Client,V0.2.0.1,Main Thread,CPU Check,Verbose,Globals,"Thread CPU usage = 0.0 % (average over lifetime = 0.0 %)"
15/01/2011 11:17:59.635,Misha,MISHA-DCDEL,Scores Client,V0.2.0.1,Main Thread,CPU Check,Verbose,Globals,"Thread CPU usage = 0.0 % (average over lifetime = 0.0 %)"
15/01/2011 11:17:59.636,Misha,MISHA-DCDEL,Scores Client,V0.2.0.1,Main Thread,CPU Check,Verbose,Globals,"Thread CPU usage = 0.0 % (average over lifetime = 0.0 %)"
15/01/2011 11:17:59.636,Misha,MISHA-DCDEL,Scores Client,V0.2.0.1,Main Thread,CPU Check,Verbose,Globals,"Thread CPU usage = 0.1 % (average over lifetime = 0.1 %)"
The time period is configurable, and I tend to use either 10 seconds, a minute, or 10 minutes. Have a look in the CsiSystemUnt.pas and AppGlobalsUnt.pas files to see how it is done.
Cheers, Misha
PS I also check memory usage as well.

Resources