How can I see the full cost-centre stack in GHC?

How can I see the full cost-centre stack in GHC? - haskell

I'm almost getting the handle of GHC cost centres.... it is an awesome idea, and you can actually fix memory leaks with their profiling tools. But my problem is that the information I'm getting in the .hp profiling is too truncated:
(1319)GHC.Conc.Signal.CAF 640
(1300)GHC.Event.Thread.CAF 560
(2679)hGetReplies/connect/c... 112
(2597)insideConfig/CAF:lvl2... 32
(1311)GHC.IO.Handle.FD.CAF 656
(2566)setLoggerLevels/confi... 208
(2571)configureLoggingToCon... 120
(2727)reply/Database.Redis.... 32
How do I know for example what is the full cost centre stack in (2566) or (2559)? Is there a tool for that or a command-line option?

Pass +RTS -L100 to your the program when you run it with profiling, and change 100 to whatever number of characters you want to see of your cost centres.
The documentation can be found in the GHC user guide, section “RTS options for heap profiling”.

Related

realloc function gives SIGABRT due to limited heap size

I am trying to reproduc a problem .
My c code giving SIGABRT , i traced it back to this line number :3174
https://elixir.bootlin.com/glibc/glibc-2.27/source/malloc/malloc.c
/* Little security check which won't hurt performance: the allocator
never wrapps around at the end of the address space. Therefore
we can exclude some size values which might appear here by
accident or by "design" from some intruder. We need to bypass
this check for dumped fake mmap chunks from the old main arena
because the new malloc may provide additional alignment. */
if ((__builtin_expect ((uintptr_t) oldp > (uintptr_t) -oldsize, 0)
|| __builtin_expect (misaligned_chunk (oldp), 0))
&& !DUMPED_MAIN_ARENA_CHUNK (oldp))
malloc_printerr ("realloc(): invalid pointer");
My understanding is that when i call calloc function memory get allocated when I call realloc function and try to increase memory area ,heap is not available for some reason giving SIGABRT
My another question is, How can I limit the heap area to some bytes say, 10 bytes to replicate the problem. In stackoverflow RSLIMIT and srlimit is mentioned but no sample code is mentioned. Can you provide sample code where heap size is 10 Bytes ?

How can I limit the heap area to some bytes say, 10 bytes
Can you provide sample code where heap size is 10 Bytes ?
From How to limit heap size for a c code in linux , you could do:
You could use (inside your program) setrlimit(2), probably with RLIMIT_AS (as cited by Ouah's answer).
#include <sys/resource.h>
int main() {
setrlimit(RLIMIT_AS, &(struct rlimit){10,10});
}
Better yet, make your shell do it. With bash it is the ulimit builtin.
$ ulimit -v 10
$ ./your_program.out
to replicate the problem
Most probably, limiting heap size will result in a different problem related to heap size limit. Most probably it is unrelated, and will not help you to debug the problem. Instead, I would suggest to research address sanitizer and valgrind.

PEBS records much less memory-access samples than actually present

I have been trying to log memory accesses that are made by a program using Perf and PEBS counters. My intention was to log all of the memory accesses made by a program (I chose programs from SpecCPU2006). By tweaking certain parameters, I seem to record much more samples than there actually is for the program. I know, as has been said previously, that it is tough to record all of the memory access samples but leaving that aside, I want to know how can PEBS record more samples than there actually is?
I followed the below steps :-
First of all, I modified the /proc/sys/kernel/perf_cpu_time_max_percent value. Initially it was 25%, I changed it to 95%. This was because I wanted to see if I can record the maximum number of memory access samples. This would also allow me to probably use a much higher perf_event_max_sample_rate, which is usually 100,000 at a maximum but now I can set it to a higher value without it being lowered down.
I used a much higher value for perf_event_max_sample_rate which is 244,500, instead of the maximum allowable value of 100,000.
Now what I did was I used perf-stat to record the total count of the memory-stores information in a program. I got the below data :-
./perf stat -e cpu/mem-stores/u ../../.././libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for '../../.././libquantum_base.arnab 100':
158,115,509 cpu/mem-stores/u
0.591718162 seconds time elapsed
There are roughly ~158 million events as indicated by perf-stat, which should be a correct indicator, since this is directly coming from the hardware counter values.
But now, as I run the perf record -e command and use PEBS counters to calculate all of the memory store events that are possible :-
./perf record -e cpu/mem-stores/upp -c 1 ../../.././libquantum_base.arnab 100
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.
Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.
Samples in kernel modules won't be resolved at all.
If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.
Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
[ perf record: Woken up 32 times to write data ]
[ perf record: Captured and wrote 7.827 MB perf.data (254125 samples) ]
I can see 254125 samples being recorded. This is much much less than what was returned by perf stat. I am recording all of these accesses in the userspace only (I am using -u in both cases).
Why does this happen ? Am I recording the memory-store events in any wrong way ? Or is there a problem with the CPU behavior ?

Self-hosted WCF service maxing cpu load

I am looking into an issue at work with a WindowsService that is taking 100% CPU on a machine with 16 CPU's.
The service is hosting a self-hosted .NET WCF service.
I have received a crash dump which I have loaded up in windbg, in order to look for clues.
So what I have tried:
!threads :
ThreadCount: 646
UnstartedThread: 0
BackgroundThread: 643
PendingThread: 0
DeadThread: 2
Hosted Runtime: no
642 of the threads were Threadpool workers as following:
8 29 2a34 000000002068b510 3029220 Preemptive 0000000000000000:0000000000000000 0000000000563f50 0 MTA (Threadpool Worker)
~29s -> !CLRStack
000000003c66eb70 00000000770512fa [GCFrame: 000000003c66eb70]
000000003c66ec40 00000000770512fa [GCFrame: 000000003c66ec40]
000000003c66ec78 00000000770512fa [HelperMethodFrame: 000000003c66ec78] System.Threading.Monitor.Enter(System.Object)
000000003c66ed70 000007fef7af1c9c System.Threading.TimerQueueTimer.Fire()
000000003c66ede0 000007fef7a6c2f3 System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
000000003c66ee30 000007fef7a6c92a System.Threading.ThreadPoolWorkQueue.Dispatch()
000000003c66f388 000007fef8d57d33 [DebuggerU2MCatchHandlerFrame: 000000003c66f388]
~29s -> K
000000003c66e858 000007fefd7010dc ntdll!NtWaitForSingleObject+0xa
000000003c66e860 000007fef8d049bf KERNELBASE!WaitForSingleObjectEx+0x79
000000003c66e900 000007fef8d04977 clr!CLREventBase::WaitEx+0x16c
000000003c66e940 000007fef8d048f8 clr!CLREventBase::WaitEx+0x103
000000003c66e9a0 000007fef8e9c5de clr!CLREventBase::WaitEx+0x70
000000003c66ea30 000007fef8dc5a34 clr!WKS::GCHeap::WaitUntilGCComplete+0x2b
000000003c66ea60 000007fef8d0c4f4 clr!Thread::RareDisablePreemptiveGC+0x176
000000003c66eaf0 000007fef8dd1f3d clr!GCCoop::GCCoop+0x3d
000000003c66eb20 000007fef8e898cf clr!AwareLock::Contention+0x137
000000003c66ebe0 000007fef7af1c9c clr!JITutil_MonContention+0xaf
000000003c66ed70 000007fef7a6c2f3 mscorlib_ni+0x521c9c
000000003c66ede0 000007fef7a6c92a mscorlib_ni+0x49c2f3
000000003c66ee30 000007fef8d57d33 mscorlib_ni+0x49c92a
000000003c66eef0 000007fef8d556e6 clr!CallDescrWorkerInternal+0x83
000000003c66ef30 000007fef8d557af clr!CallDescrWorkerWithHandler+0x4a
000000003c66ef70 000007fef8eda2c9 clr!MethodDescCallSite::CallTargetWorker+0x2e6
000000003c66f120 000007fef8ee51b0 clr!QueueUserWorkItemManagedCallback+0x2a
000000003c66f200 000007fef8ee513e clr!DebuggerU2MCatchHandlerFrame::DebuggerU2MCatchHandlerFrame+0xa0
000000003c66f240 000007fef8ee50b5 clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0x38e
000000003c66f340 000007fef8ee51eb clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0x2bd
000000003c66f3d0 000007fef8eda224 clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0x23b
000000003c66f430 000007fef8ee6baf clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0xb4
000000003c66f5c0 000007fef8ee6ab3 clr!ThreadpoolMgr::ExecuteWorkRequest+0x4c
000000003c66f5f0 000007fef8eda8a6 clr!ThreadpoolMgr::WorkerThreadStart+0xf3
000000003c66f6b0 0000000076c9652d clr!Thread::intermediateThreadProc+0x7d
000000003c66f7f0 000000007702c541 kernel32!BaseThreadInitThunk+0xd
000000003c66f820 0000000000000000 ntdll!RtlUserThreadStart+0x1d
Im having a hard time interpreting the stacktraces since they dont hit any of my applicationcode.
Are they all just idle threadworkers, waiting for work?

Threads with WaitForSingleObject are not critical, since they are waiting and not consuming CPU time. But be aware that your dump is only a snapshot and you might have had bad luck when taking the snapshot.
For a performance analysis with WinDbg you'd need several dumps during high CPU and compare them. If they all have similar stack traces, that's fine and you can conclude something. If they are all very different, it's almost useless.
The command !runaway seems more interesting here, since it lists CPU times consumed per thread, so you can identify the one(s) which are on high CPU. Again: having two snapshots that you can compare is helpful, because the main thread may still have consumed more CPU time in total than some short-living 100% threads.
If you can't use a performance profiler, SysInternals Procdump can generate a series of dumps (-n) for you on high CPU (-c). Use -s to set the time between dumps. For .NET, don't forget -ma for full memory.
Other than that, 646 threads sounds a lot to me. The OS itself could be quite busy scheduling them.

Sounds like the issue could be related to GC. Since this is a self-hosted service, it will use the Workstation GC by default, unless you enable the server GC manually:
http://msdn.microsoft.com/en-us/library/ms229357(v=vs.110).aspx
Have you tried that and see if it makes any difference?

Perfview from Microsoft may be helpful. From the link:
http://blogs.msdn.com/b/dotnet/archive/2012/10/09/improving-your-app-s-performance-with-perfview.aspx
"Late last year, Vance Morrison, who is currently an architect on the .NET Framework Performance team, released PerfView, which is a new performance tool for .NET developers. PerfView helps you discover and investigate performance hotspots in .NET Framework apps, and enables you to deliver consistently high-performance apps to your customers.
Using PerfView, you can perform complex CPU performance analyses to solve hard-to-detect performance problems. PerfView's revolutionary grouping and folding features are what makes it possible to grasp and solve these difficult problems."

use WPRUI.exe to capture a trace and analyze the CPU usage with WPA.exe.
Microsoft explained how to analyze the created trace in the following video:
Defrag Tools: #42 - WPT - CPU Analysis
http://channel9.msdn.com/Shows/Defrag-Tools/Defrag-Tools-42-WPT-CPU-Analysis

Collect ETW with Perfview and follow the big % numbers.
try run in windbg ~*e!clstack => call stacks of all threads look for repeatable code.

nodejs profiling; what can 'Unknown' be

While profiling a nodejs program, I see that 61% of the ticks are caused by 'Unknown' (see below). What can this be? What should I look for?
gr,
Coen
Statistical profiling result from node, (14907 ticks, 9132 unaccounted, 0 excluded).
[Unknown]:
ticks total nonlib name
9132 61.3%
[Shared libraries]:
ticks total nonlib name
1067 7.2% 0.0% C:\Windows\SYSTEM32\ntdll.dll
55 0.4% 0.0% C:\Windows\system32\kernel32.dll
[JavaScript]:
ticks total nonlib name
1381 9.3% 10.0% LazyCompile: *RowDataPacket.parse D:\MI\packet.js:9
......

Are you loading any modules that have built dependencies?
Basically by "Unknown" it means "unaccounted for" (check tickprocessor.js for more explanation). For example, the GC will print messages like "scavenge,begin,..." but that is unrecognized by logreader.js.
It would help to know what profiling library your using to parse the v8.log file.
Update
The node-tick package hasn't been updated for over a year and is probably missing a lot of recent prof commands. Try using node-profiler instead. It's created by one of node's maintainers. And if you want the absolute best result you'll need to build it using node-gyp.
Update
I've parsed the v8.log output using the latest from node-profiler (the latest on master, not the latest tag) and posted the results at http://pastebin.com/pdHDPjzE
Allow me to point out a couple key entries which appear about half way down:
[GC]:
ticks total nonlib name
2063 26.2%
[Bottom up (heavy) profile]
6578 83.4% c:\node\node.exe
1812 27.5% LazyCompile: ~parse native json.js:55
1811 99.9% Function: ~<anonymous> C:\workspace\repositories\asyncnode_MySQL\lib\MySQL_DB.js:41
736 11.2% Function: ~Buffer.toString buffer.js:392
So 26.2% of all script type was spent in garbage collection. Which is much higher than it should be. Though it does correlate well with how much time is spent on Buffer.toString. If that many Buffers are being created then converted to strings, both would need to be gc'd when they leave scope.
Also I'm curious why so much time is spent in LazyCompile for json.js. Or more so, why would json.js even be necessary in a node application?
To help you performance tune your application I'm including a few links below that give good instructions on what to do and look for.
Nice slide deck with the basics:
https://mkw.st/p/gdd11-berlin-v8-performance-tuning-tricks/#1
More advanced examples of optimization techniques:
http://floitsch.blogspot.com/2012/03/optimizing-for-v8-introduction.html
Better use of closures:
http://mrale.ph/blog/2012/09/23/grokking-v8-closures-for-fun.html
Now as far as why you couldn't achieve the same output. If you built and used node-profiler and its provided nprof from master and it still doesn't work then I'll assume it has something to do with being on Windows. Think about filing a bug on GitHub and see if he'll help you out.

You are using a 64 bit version of Node.JS to run your application and a 32bit build of the d8 shell to process your v8.log.
Using either a 32 bit version of Node.JS with ia32 as the build target for the d8 shell or a 64 bit version of Node.JS with x64 as the d8 shell build target should solve your problem.

Try to build v8 with profiling support on:
scons prof=on d8
Make sure you run node --prof with version corresponding to version of v8
Then tools/linux-tick-processor path/to/v8.log should show you the full profile info.

How much disk space do shared libraries really save in modern Linux distros?

In the static vs shared libraries debates, I've often heard that shared libraries eliminate duplication and reduces overall disk space. But how much disk space do shared libraries really save in modern Linux distros? How much more space would be needed if all programs were compiled using static libraries? Has anyone crunched the numbers for a typical desktop Linux distro such as Ubuntu? Are there any statistics available?
ADDENDUM:
All answers were informative and are appreciated, but they seemed to shoot down my question rather than attempt to answer it. Kaleb was on the right track, but he chose to crunch the numbers for memory space instead of disk space (my question was for disk space).
Because programs only "pay" for the portions of static libraries that they use, it seems practically impossible to quantitatively know what the disk space difference would be for all static vs all shared.
I feel like trashing my question now that I realize it's practically impossible to answer. But I'll leave it here to preserve the informative answers.
So that SO stops nagging me to choose an answer, I'm going to pick the most popular one (even if it sidesteps the question).

I'm not sure where you heard this, but reduced disk space is mostly a red herring as drive space approaches pennies per gigabyte. The real gain with shared libraries comes with security and bugfix updates for those libraries; applications using static libraries have to be individually rebuilt with the new libraries, whereas all apps using shared libraries can be updated at once by replacing only a few files.

Not only do shared libraries save disk space, they also save memory, and that's a lot more important. The prelinking step is important here... you can't share the memory pages between two instances of the same library unless they are loaded at the same address, and prelinking allows that to happen.

Shared libraries do not necessarily save disk space or memory.
When an application links to a static library, only those parts of the library that the application uses will be pulled into the application binary. The library archive (.a) contains object files (.o), and if they are well factored, the application will use less memory by only linking with the object files it uses. Shared libraries will contain the whole library on disk and in memory whether parts of it are used by applications or not.
For desktop and server systems, this is less likely to result in a win overall, but if you are developing embedded applications, it's worth trying static linking all the applications to see if that gives you an overall saving.

I was able to figure out a partial quantitative answer without having to do an obscene amount of work. Here is my (hair-brained) methodology:
1) Use the following command to generate a list of packages with their installed size and list of dependencies:
dpkg-query -Wf '${Package}\t${Installed-Size}\t${Depends}
2) Parse the results and build a map of statistics for each package:
struct PkgStats
{
PkgStats() : kbSize(0), dependantCount(0) {}
int kbSize;
int dependentCount;
};
typedef std::map<std::string, PkgStats> PkgMap;
Where dependentCount is the number of other packages that directly depend on that package.
Results
Here is the Top 20 list of packages with the most dependants on my system:
Package Installed KB # Deps Dup'd MB
libc6 10096 750 7385
python 624 112 68
libatk1.0-0 200 92 18
perl 18852 48 865
gconf2 248 34 8
debconf 988 23 21
libasound2 1428 19 25
defoma 564 18 9
libart-2.0-2 164 14 2
libavahi-client3 160 14 2
libbz2-1.0 128 12 1
openoffice.org-core 124908 11 1220
gcc-4.4-base 168 10 1
libbonobo2-0 916 10 8
cli-common 336 8 2
coreutils 12928 8 88
erlang-base 6708 8 46
libbluetooth3 200 8 1
dictionaries-common 1016 7 6
where Dup'd MB is the number of megabytes that would be duplicated if there was no sharing (= installed_size * (dependants_count - 1), for dependants_count > 1).
It's not surprising to see libc6 on top. :) BTW, I have a typical Ubuntu 9.10 setup with a few programming-related packages installed, as well as some GIS tools.
Some statistics:
Total installed packages: 1717
Average # of direct dependents: 0.92
Total duplicated size with no sharing (ignoring indirect dependencies): 10.25GB
Histogram of # of direct dependents (note logarithmic Y scale):
Note that the above totally ignores indirect dependencies (i.e. everything should be at least be indirectly dependent on libc6). What I really should have done is built a graph of all dependencies and use that as the basis for my statistics. Maybe I'll get around to it sometime and post a lengthy blog article with more details and rigor.

Ok, perhaps not an answer, but the memory savings is what I'd consider. The savings is going to be based on the number of times a library is loaded after the first application, so lets find out how much savings per library are on the system using a quick script:
#!/bin/sh
lastlib=""
let -i cnt=1
let -i size=0
lsof | grep 'lib.*\.so$' | awk '{print $9}' | sort | while read lib ; do
if [ "$lastlib" == "$lib" ] ; then
let -i cnt="$cnt + 1"
else
let -i size="`ls -l $lib | awk '{print $5}'`"
let -i savings="($cnt - 1) * $size"
echo "$lastlib: $savings"
let -i cnt=1
fi
lastlib="$lib"
done
That will give us savings per lib, as such:
...
/usr/lib64/qt4/plugins/crypto/libqca-ossl.so: 0
/usr/lib64/qt4/plugins/imageformats/libqgif.so: 540640
/usr/lib64/qt4/plugins/imageformats/libqico.so: 791200
...
Then, the total savings:
$ ./checker.sh | awk '{total = total + $2}END{print total}'
263160760
So, roughly speaking on my system I'm saving about 250 Megs of memory. Your mileage will vary.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string