Unknown events in nodejs/v8 flamegraph using perf_events

Unknown events in nodejs/v8 flamegraph using perf_events - node.js

I try to do some nodejs profiling using Linux perf_events as described by Brendan Gregg here.
Workflow is following:
run node >0.11.13 with --perf-basic-prof, which creates /tmp/perf-(PID).map file where JavaScript symbol mapping are written.
Capture stacks using perf record -F 99 -p `pgrep -n node` -g -- sleep 30
Fold stacks using stackcollapse-perf.pl script from this repository
Generate svg flame graph using flamegraph.pl script
I get following result (which look really nice at the beginning):
Problem is that there are a lot of [unknown] elements, which I suppose should be my nodejs function calls. I assume that whole process fails somwhere at point 3, where perf data should be folded using mappings generated by node/v8 executed with --perf-basic-prof. /tmp/perf-PID.map file is created and some mapping are written to it during node execution.
How to solve this problem?
I am using CentOS 6.5 x64, and already tried this with node 0.11.13, 0.11.14 (both prebuild, and compiled as well) with no success.

FIrst of all, what "[unknown]" means is the sampler couldn't figure out the name of the function, because it's a system or library function.
If so, that's OK - you don't care, because you're looking for things responsible for time in your code, not system code.
Actually, I'm suggesting this is one of those XY questions.
Even if you get a direct answer to what you asked, it is likely to be of little use.
Here are the reasons why:
1. CPU Profiling is of little use in an I/O bound program
The two towers on the left in your flame graph are doing I/O, so they probably take a lot more wall-time than the big pile on the right.
If this flame graph were derived from wall-time samples, rather than CPU-time samples, it could look more like the second graph below, which tells you where time actually goes:
What was a big juicy-looking pile on the right has shrunk, so it is nowhere near as significant.
On the other hand, the I/O towers are very wide.
Any one of those wide orange stripes, if it's in your code, represents a chance to save a lot of time, if some of the I/O could be avoided.
2. Whether the program is CPU- or I/O-bound, speedup opportunities can easily hide from flame graphs
Suppose there is some function Foo that really is doing something wasteful, that if you knew about it, you could fix.
Suppose in the flame graph, it is a dark red color.
Suppose it is called from numerous places in the code, so it's not all collected in one spot in the flame graph.
Rather it appears in multiple small places shown here by black outlines:
Notice, if all those rectangles were collected, you could see that it accounts for 11% of time, meaning it is worth looking at.
If you could cut its time in half, you could save 5.5% overall.
If what it's doing could actually be avoided entirely, you could save 11% overall.
Each of those little rectangles would shrink down to nothing, and pull the rest of the graph, to its right, with it.
Now I'll show you the method I use. I take a moderate number of random stack samples and examine each one for routines that might be speeded up.
That corresponds to taking samples in the flame graph like so:
The slender vertical lines represent twenty random-time stack samples.
As you can see, three of them are marked with an X.
Those are the ones that go through Foo.
That's about the right number, because 11% times 20 is 2.2.
(Confused? OK, here's a little probability for you. If you flip a coin 20 times, and it has a 11% chance of coming up heads, how many heads would you get? Technically it's a binomial distribution. The most likely number you would get is 2, the next most likely numbers are 1 and 3. (If you only get 1 you keep going until you get 2.) Here's the distribution:)
(The average number of samples you have to take to see Foo twice is 2/0.11 = 18.2 samples.)
Looking at those 20 samples might seem a bit daunting, because they run between 20 and 50 levels deep.
However, you can basically ignore all the code that isn't yours.
Just examine them for your code.
You'll see precisely how you are spending time,
and you'll have a very rough measurement of how much.
Deep stacks are both bad news and good news -
they mean the code may well have lots of room for speedups, and they show you what those are.
Anything you see that you could speed up, if you see it on more than one sample, will give you a healthy speedup, guaranteed.
The reason you need to see it on more than one sample is, if you only see it on one sample, you only know its time isn't zero. If you see it on more than one sample, you still don't know how much time it takes, but you do know it's not small.
Here are the statistics.

Generally speaking it is a bad idea to disagree with a subject matter expert but (with the greatest respect) here we go!
SO urges the answer to do the following:
"Please be sure to answer the question. Provide details and share your research!"
So the question was, at least my interpretation of it is, why are there [unknown] frames in the perf script output (and how do I turn these [unknown] frames in to meaningful names)?
This question could be about "how to improve the performance of my system?" but I don't see it that way in this particular case. There is a genuine problem here about how the perf record data has been post processed.
The answer to the question is that although the prerequisite set up is correct: the correct node version, the correct argument was present to generate the function names (--perf-basic-prof), the generated perf map file must be owned by root for perf script to produce the expected output.
That's it!
Writing some new scripts today I hit apon this directing me to this SO question.
Here's a couple of additional references:
https://yunong.io/2015/11/23/generating-node-js-flame-graphs/
https://github.com/jrudolph/perf-map-agent/blob/d8bb58676d3d15eeaaf3ab3f201067e321c77560/bin/create-java-perf-map.sh#L22
[ non-root files can sometimes be forced ] http://www.spinics.net/lists/linux-perf-users/msg02588.html

Related

How to use xperf to examine the call stack of nodejs?

I'm trying to learn performance tuning for Node.js applications. This first thing I want is a flamegraph. Since I work on Windows platform, I follow this manual to get the flamegraph.
However, I'm stacked at this step:
xperf -i perf.etl -o perf.csv -symbols
I'm no good with xperf. Could someone tell me how to get pass this problem and get a flamegraph?

It's worth pointing out that xperf can record many different types of call stacks. You can get a call stack on every file I/O, disk I/O, context switch, registry access, etc., and you could create a flame graph of any one of these. I assume, however, that you want a flame graph of the CPU Sampled data.
You can find a slightly different technique for creating flame graphs from xperf sampled data on my blog, here:
https://randomascii.wordpress.com/2013/03/26/summarizing-xperf-cpu-usage-with-flame-graphs/
You don't say what your problem was -- what went wrong with that step -- so I'll give a few generic suggestions:
Try with a very short trace -- just a few seconds -- to make the process as fast as possible when experimenting.
Try loading the trace into WPA to make sure you can see the sampled data there. You may find that you don't need the flame graph, since WPA gives you ways to graphically explore the data. Loading the trace into WPA also gives you a chance to make sure the symbols load, and gives WPA a chance to convert the symbols to .symcache files, which will make the processing step run much faster.
Make sure you have _NT_SYMBOL_PATH set to point to Microsoft's symbol servers and any others you might need.
Consider recording the trace with wprui instead of with a batch file: https://randomascii.wordpress.com/2013/04/20/xperf-basics-recording-a-trace-the-easy-way/
You could probably improve on the flame graph generation process by not exporting all of the xperf data to text, by using the somewhat new wpaexporter, which I document here:
https://randomascii.wordpress.com/2013/11/04/exporting-arbitrary-data-from-xperf-etl-files/
However this will require reworking the scripts and may be more work than you want to put in.

Oprofile callgraph: origin of syscalls

I have been using oprofile to try to discover why my program was spending so much time in the kernel. I now have the symbols from the kernel, but apparently no links between my program and kernel that'll tell me which bits of my program are taking so long.
samples % image name app name symbol name
-------------------------------------------------------------------------------
201 0.8911 vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic _raw_spin_lock_irq
746 3.3073 vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic rb_get_reader_page
5000 22.1671 vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic default_spin_lock_flags
16575 73.4838 vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic _raw_spin_lock
22469 11.1862 vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic __ticket_spin_lock
22469 99.6010 vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic __ticket_spin_lock [self]
26 0.1153 vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic ret_from_intr
Where do I go from here? How do I discover the places in my program that are causing __ticket_spin_lock?

Oprofile takes stack samples. What you need to do is not look at summaries of them, but actually examine the raw samples. If you are spending, say, 30% of time in the kernel, then if you can see 10 stack samples chosen at random, you can expect 3 of them, more or less, to show you the full reason of how you got into the kernel.
That way you will see things the summaries or call graph won't show you.
IN CASE IT ISN'T CLEAR: Since __ticket_spin_lock is on the stack 99.6% of the time, then on each and every stack sample you look at, the probability is 99.6% you will see how you got into that routine.
Then if you don't really need to be doing that, you have possibly a 250x speedup.
That's like from four minutes down to one second. Screw the "correct" or "automated" approach - get the results.
ADDED: The thing about profilers is they are popular and some have very nice UIs,
but sadly, I'm afraid, it's a case of "the emperor's new clothes".
If such a tool doesn't find much to fix, you're going to like it, because it says (probably falsely) that your code, as written, is near-optimal.
There are lots of postings recommending this or that profiler, but
I can't point to any claim of saving more than some percent of time, like 40%, using a profiler.
Maybe there are some.
I have never heard of a profiler being used first to get a speedup, and then being used again to get a second speedup, and so on.
That's how you get real speedup - multiple optimizations.
Something that was just a small performance problem at the beginning is no longer small after you've removed a larger one.
This picture shows how, by removing six problems, the speedup is nearly three orders of magnitude.
You can't necessarily do that, but isn't it worth trying?
APOLOGIES for further editing. I just wanted to show easy it is to fool a call graph.
The red lines represent call stack samples. Here A1 spends all its time calling C2, and vice-versa. Then suppose you keep the same behavior, but you put in a "dispatch" routine B.
Now the call graph loses the information that A1 spends all its time in C2, and vice-versa.
You can easily extend this example to multiple levels.
You can say a call tree would have seen that.
Well, here's how you can fool a call tree. A spends all its time in calls to C.
Now if instead A calls B1, B2, ... Bn, and those call C, the "hot path" from A to C is broken up into pieces, so the relationship between A and C is hidden.
There are many other perfectly ordinary programming practices that will confuse these tools, especially when the samples are 10-30 levels deep and the functions are all little, but the relationships cannot hide from a programmer carefully examining a moderate number of samples.

I agree with Mike's answer: a callgraph is not the right way to inspect the source of the problem. What you really want is to look at the callchains of the hottest samples.
If you don't want to inspect "by hand" the raw samples collected by oprofile, you could rerun your application with the record command of perf using the -g option in order to collect the stacktraces. You can then display the samples annotated with their callchains using the report command of perf. Since perf is not aggregating the callchains of the individual samples in a global callgraph, you don't have some of the issues outlined in Mike's post.

text-based viewer for profiling results

Do you know of a text-based application for viewing results of application profiling? The profiling results basically contain a list of C++ function call backtraces and how often these backtraces were encountered; now I'm looking for a console tool to analyze the raw data (which backtrace occurred most often; which function was called most often, independent of call trace...).
So far I've created callgrind-compatible files from the raw data and then used the excellent KCachegrind tool for analysis; but now I'm also looking for a tool that works without on text-based terminal. Any ideas?

Take a look at callgrind_annotate.
This command reads in the profile data, and prints a sorted lists of functions, optionally with source annotation.

I wrote such a viewer once. It focussed on a line of code, showing the percent of samples running through that line, and a butterfly view allowing transitions to superior or subordinate lines of code.
It made a nice demo, but did I really use it? Not for long.
(I'm assuming the stack samples have been taken during the interval that you wish to speed up, i.e. not during user-wait.)
The thing is, the program is probably doing something wasteful in that time. (If it is not, you can't speed it up.)
Whatever that wasteful thing is, it consists of some percent of time being spent for poor reasons, like 10%, 50%, 90%, or whatever. During that time, it is on the stack, so an examination of the stack samples will show it.
And, you don't have to look at very many of them. If something is taking 50% of time, 1000 samples will show it on about 500, and 10 samples will show it on about 5. The larger number of samples will estimate the percentage with an extra digit of precision. If your goal is to isolate the problem so you can fix it, you don't need that extra digit.
So, a tool that shows you, by line, the percent of stack samples going through that line is a very nice thing to have, because the wasteful code will appear on it, showing the percentage.
What it does not show you is the reason why the statement is being executed, which is how you know if it's wasteful. Looking at the statement's context on the stack does tell you that.
So even though I had the viewer, I just ended up examining the samples themselves, and only about 10 or 20 of them. The bigger the percentage is, the smaller the number of samples I need to look at before I find it. Here's an example.

Can perf display raw sample counts?

I'd like perf to output raw sample counts rather than percentages. This is useful for determining whether I've sped up a function I'm trying to optimize.
To be clear, I'd like to do something like
perf record ./a.out
perf report
and see how many times perf sampled each function in a.out.
Shark can do this on Mac, as can (I believe) Xperf. Is this possible on Linux with perf?

perf report (version 2.6.35.7) now supports the -n flag, which does what I want.

You want to see if your changes to a function made a difference.
I presume you also want whatever help you can get in finding out which function you need to change.
Those two objectives are not the same.
Many tools give you as broad a set of statistics or counters as they can dream up, as if having more statistics will help either goal.
Can you get hold of RotateRight/Zoom, or any tool that gives you stack samples on wall-clock time, preferably under user control? Such a tool will give you time and percent spent in any routine or line of code, in particular inclusive time.
The reason inclusive time is so important is that every single line of code that is executed is responsible for a certain fraction of time, such that if the line were not there, that fraction of time would not be spent, and overall time would be reduced by that fraction. During that fraction of time, whether it is spent in one big chunk or thousands of little chunks, that line of code is on the call stack, where stack samples will spot it, at a rate equal to its fraction. That is why stack sampling is so effective in finding code worth optimizing, whether it consists of leaf instructions or calls in the call tree.
Personally, this link gives the how and why of the method I use, which is not fancy, but is as or more effective than any method or tool I've seen. Here's a discussion.

Ubiquitous computing and magnetic interference

Imagine the radio of a car, does the electro magnetic fields through which the car goes through, have interference in the processing? It's easy to understand that a strong field can corrupt data. But what about the data under processment? Can it also be changed?
If so how could you protect your code against this? (without electrial protections just code ones)

For the most robust mission critical systems you use multiple processors and compare results. This is what we did with aircraft auto pilot (autolanding). We had three autopilots, one flying the aircraft and two check that one. If any one of the three disagreed, it was shut down.

You're referring to what Wikipedia calls soft errors. The traditional, industry-accepted work-around for this is through redundancy, as Jim C and fmsf noted.
Several years ago, our repair department's analysis showed an unacceptable number of returned units with single-bit errors in the battery-backed SRAM that held the firmware. Despite our efforts at root-cause analysis, we were unable to explain the source of the problem. At that point a hardware change was out of the question, so we needed a software-only solution to treat the symptom.
We wanted a reliable fix that we could implement simply and quickly, so we generated parity checks on blocks of code in the SRAM. We chose a block size that required very little additional storage for the parity data, yet provided enough redundancy to detect and correct any of the errors we'd seen and then some. It logs the errors it detects and indicates whether it can correct them, so we still know when bit errors occur in the field. So far, so good!
Our product manager did some additional research out of curiosity and convinced himself that the culprit was cosmic radiation. We never proved it unequivocally, but he was satisfied that the number of errors seemed to agree with what would be expected based on the data he found. I'm just glad the returns have stopped.

I doubt you can.
Code that is changed won't run, so likely your program(s) will crash if you have this problem.
This is a hardware problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string