The documentation is quite terse. Is the purpose to benchmark the logging itself, or to benchmark my code? Do I need to put the PERF target inside another target? Or it is the other way around? Can someone give an example of how I should use it?
Never tried using it myself, but I can see there are two options in NLog for dealing with Windows Performance Counters:
PerfCounter-Target - Allows you generate Performance Counter Values, that can be picked up by an external Windows Performance Counter Monitor Application.
PerformanceCounter-Renderer - Allows you to capture the value of an existing Windows Performance Counter in the log-output.
Have updated the PerfCounter-Target-Wiki about using install to create the performance counter. Found some example code here: PerformanceCounterTarget Class
Here is someone that actually have used the PerCounter-Target: Performance Counters with NLog
Related
In a nutshell, I'm trying to achieve the following inside a userland benchmark process (pseudo-code, assuming x86_64 and a UNIX system):
results[] = ...
for (iteration = 0; iteration < num_iterations; iteration++) {
pctr_start = sample_pctr();
the_benchmark();
pctr_stop = sample_pctr();
results[iteration] = pctr_stop - pctr_start;
}
FWIW, the performance counter I am thinking of using is CPU_CLK_UNHALTED.THREAD_ALL, to read the number of core cycles independent of clock frequency changes (In an earlier question I had been planning to use the TSC register for this, but alas, that is not what this register measures at all).
My initial intention was to use inline assembler to first configure a counter using WRMSR, then to read the counter using RDPMC inside sample_pctr().
I stumbled at the first hurdle, as writing MSRs requires kernel privileges. It seems like you can in fact read the counters from user space (if configured correctly), but the act of configuring the counter (with an MSR) needs to be undertaken by the kernel.
Does anyone know a lightweight way to ask the kernel to configure the a performance counters from user-space so that I can then use RDPMC from within my benchmark harness?
Stuff I've looked into/thought about:
Perf tools for Linux. Seems to be geared up for sampling over the whole lifetime of a process, not within a process as specific points (before and after each iteration).
Use perf syscalls directly (i.e. perf_event_open). Looks like the counter value will only update periodically (using a sample rate) or after the counter exceeds a threshold. I need the counter value precisely at the moment I ask. This is why RDPMC seemed so attractive. I imagine that sampling frequently will itself skew the performance counter readings.
PAPI builds on perf, so probably inherits the above problem.
Write a kernel module -- too much effort, too error prone.
Ideally I would like a solution which works on OpenBSD and Linux, but somehow I think that is a tall order. Perhaps just for Linux for now.
Any help is most appreciated. Thanks.
EDIT: I just found the Linux msr device node, which would probably suffice. I'll leave the question up in case a better answer shows up.
It seems the best way -- for Linux at least -- is to use the msr device node.
You simply open a device node, seek to the address of the MSR required, and read or write 8 bytes.
OpenBSD is harder, since (at the time of writing) there is no user-space proxy to the MSRs. So you would need to write a kernel module or implement a sysctl by hand.
Background
I've written a tool to capture CPU usage on a per/thread basis. The output of the tools is a binary file, that I can pump into my parsing utility that I wrote. And the output of the parsing utility is a CSV file that I can import into Excel to chart pretty graphs of process/thread CPU usage.
This CPU usage capture tool is running on an embedded ARM platform running a Linux kernel based on 2.6.35.3. That being said, I was concerned about making the tool light weight. I didn't want it to store directly to a CSV file, in order to minimize the processing time and the file size of the captured data.
Question
The tool works, but I'm wondering if I took the long way around the problem? Is there already a tool out there that does this (or something like it)?
You're probably wondering why I care if I already made a tool that works. Well, it's not as light weight as I'd like. It's taking up about 10% of CPU usage. As a benchmark, top only takes up about 1% (max).
Update
I've decided to continue using my tool for now. At least until a better solution becomes available. I was able to shave off a couple percentage points by using open() instead of fopen() on /proc/stat. I'm also using read() instead of fgets().
IBM has a tool called nmon which does the same(for AIX & Linux): According to IBM's documentation, it takes ~2% CPU. You may want to look at that.
Comparing nmon with your tool could give you a fair idea about your program's performance and how you may improve your csv capture.
This might be a bit of a steep learning curve, but you might want look into SystemTap: http://sourceware.org/systemtap/
I have a program in which significant amount of time is spent loading and saving data. Now I want to know how much time each function is taking in terms of percentage of the total running time. However, I want to exclude the time taken by loading and saving functions from the total time considered by the profiler. Is there any way to do so using gprof or any other popular profiler?
Similarly you can use
valgrind --tool=callgrind --collect-atstart=no --toggle-collect=<function>
Other options to look at:
--instr-atstart # to avoid runtime overhead while not profiling
To get instructionlevel stats:
--collect-jumps=yes
--dump-instr=yes
Alternatively you can 'remote control' it on the fly: callgrind_control or annotate your source code (IIRC also with branch predictions stats): callgrind_annotate.
The excellent tool kcachegrind is a marvellous visualization/navigation tool. I can hardly recommend it enough:
I would consider using something more modern than gprof, such as OProfile. When generating a report using opreport you can use the --exclude-symbols option to exclude functions you are not interested in.
See the OProfile webpage for more details; however for a quick start guide see the OProfile docs page.
Zoom from RotateRight offers a system-wide time profile for Linux. If your code spends a lot of time in i/o, then that time won't show up in a time profile of the CPUs. Alternatively, if you want to account for time spent in i/o, try the "thread time profile".
for a simple, basic solution, you might want log data to a csv file.
e.g. Format [functionKey,timeStamp\n]
... then load that up in Excel. Get the deltas, and then include or exclude based on if functions. Nothing fancy. On the upside, you could get some visualisations fairly cheaply.
I am trying to profile my software (in Linux) with oprofile. My software consists of both userspace and kernel module. First my doubt is what does the --separate=kernel option do? What will be the difference when running without that option? I did try to see it but couldn't find any difference. Could you please post an example?
Can't i profile a kernel module without the --seperate=kernel option?
Thanks,
Bala
In oprofile when used with option --seperate=kernel, it seperates the kernel and kernel modules per application.
--seperate='library' seperates the samples for the dynamically linked object per application basis.
kernel, dynamically linked object are just not specific to the application we want to profile alone. But at the same time our application spends considerable amount of time in them.
So --seperate allows one to get the samples from the point of view of the application we are interested in profiling. It can also give samples based on individual threads also.
Kernel can be profiled by providing --vmlinux option to opcontrol.
Ex:- opcontrol --vmlinux=/boot/vmlinux-2.6.27.23-0.1-preempt
--seperate is additional option that allows us to see the samples at different resolutions.
I need a small, portable framework for logging on embedded linux. Ideally it would output to a file or a socket, and having some sort of log rotation/compression would also be nice.
So far, I've found a lot of frameworks, but almost all of them have daunting build procedures or require the use of application frameworks (e.g. log4cxx requires the Apache Portable Runtime, which I'd rather not bother with...).
Just looking for something simple and robust, but everything I seem to find is complicated or requires lots of secondary junk just to run.
Suggestions? (and if the answer is roll my own, that's fine, but...it's be great to avoid that)
Use syslog(3) and syslogd from BusyBox. BusyBox can be very compact when stripped down and doesn't depend on anything other than libc. You can strip out everything you don't want so it is perfectly possible to use it only for logging.
We use BusyBox on a number of embedded systems, both Linux and uClinux, and find its logging facilities highly reliable.
I have no experience with the log4cxx-module but I am using APR on an embedded target running Linux (it is based on the Atmel AT91SAM926x processor family). It was really simple to configure and compile (more or less ./configure --host=arm-none-linux-gnueabi) so I would not be to afraid of going down the log4cxx-path.
Maybe you should consider spending some time on a good logging framework, since this is what you are going to use on your embedded Linux. ... and printf ...
I cooked something where I can enable/disable various logging levels per module in runtime.
Did you ever try debugging multithreaded apps on Linux?
Good luck!
Implementing very robust logging mechanism in C taking about 1000 code lines (from our code base). 90% of this defines of different sections. This includes different macros DBG_E DBG_W DBG_TRACE etc ... and spliting to the section, run time changing of debug level and debug modules (does not include compression just simple print abstraction that can be implemented in different ways file/socket/serial etc...) .
I will estimate that it take about few days to implement. The down side you will spend a few days the up side that you will get something that works for your needs and nothing more, i understand that you are working on embedded platform and footprint and memory usage are important, the best and optimized solution will be one you write. We invested those few days ones. and using it across different products/project and adjust/improve with the time past according to real needs. Main problem of generic solution that it usually will do sort of what you need and a lot more, this more usually just waist of resources.
I can't imagine that your platform is too small to include log4cxx and APR, neither is a large library, and even the tiniest platform is likely to have space for them.
You could just use syslog, which is provided by the C library - a syslog daemon is provided by busybox (which no doubt, you already use if you're on a really tiny platform). I don't know if busybox's syslogd can log to the network, but it has some level of flexibility. You can do log rotation using shell scripts pretty trivially.
Use klogd it reads the kernel log messages(from /proc/kmsg kernel) interface and redirect those messages to appropriate directory. you can use user configurable syslogd daemon along with klogd that will redirect kernel messages into appropriate files in /var/log/ directory.
For instance logs related to mail service will be stored in /var/log/main.log and logs related to kernel booting process will be stored in /var/log/boot.log . User can configure log parsing using syslogd configuration file.
But the use of syslogd may lead to your system performance degradation because for every log messages syslog daemon will do disk operation to store that log into appropriate file
Log sequence
Messages from kernel
---> klogd ( access messages from kernel ring buffer)-->syslogd --> /var/log/*