How to use xperf to examine the call stack of nodejs? - node.js

I'm trying to learn performance tuning for Node.js applications. This first thing I want is a flamegraph. Since I work on Windows platform, I follow this manual to get the flamegraph.
However, I'm stacked at this step:
xperf -i perf.etl -o perf.csv -symbols
I'm no good with xperf. Could someone tell me how to get pass this problem and get a flamegraph?

It's worth pointing out that xperf can record many different types of call stacks. You can get a call stack on every file I/O, disk I/O, context switch, registry access, etc., and you could create a flame graph of any one of these. I assume, however, that you want a flame graph of the CPU Sampled data.
You can find a slightly different technique for creating flame graphs from xperf sampled data on my blog, here:
https://randomascii.wordpress.com/2013/03/26/summarizing-xperf-cpu-usage-with-flame-graphs/
You don't say what your problem was -- what went wrong with that step -- so I'll give a few generic suggestions:
Try with a very short trace -- just a few seconds -- to make the process as fast as possible when experimenting.
Try loading the trace into WPA to make sure you can see the sampled data there. You may find that you don't need the flame graph, since WPA gives you ways to graphically explore the data. Loading the trace into WPA also gives you a chance to make sure the symbols load, and gives WPA a chance to convert the symbols to .symcache files, which will make the processing step run much faster.
Make sure you have _NT_SYMBOL_PATH set to point to Microsoft's symbol servers and any others you might need.
Consider recording the trace with wprui instead of with a batch file: https://randomascii.wordpress.com/2013/04/20/xperf-basics-recording-a-trace-the-easy-way/
You could probably improve on the flame graph generation process by not exporting all of the xperf data to text, by using the somewhat new wpaexporter, which I document here:
https://randomascii.wordpress.com/2013/11/04/exporting-arbitrary-data-from-xperf-etl-files/
However this will require reworking the scripts and may be more work than you want to put in.

Related

The best way to load an openstreetmap .osm in a docker-container

My intentions:
Actually, I intend to:
implement vehicles as containers
simulate/move these containers on the .osm maps-based roads
My viewpoint about the problem:
I have loaded the XML-based .osm file and processed it in python using xml.dom. But I am not satisfied with the performance of loading the .osm file because later on, I will have to add/create more vehicles as containers that will be simulated onto the same road.
Suggestions needed:
This is my first time to solve a problem related to maps. In fact, I need suggestions on how to proceed by keeping in mind, the performance/efficiency, with this set of requirements. Suggestions in terms of implementation will be much appreciated. Thanks in advance!
Simulating lots of vehicles by running lots of docker containers in parallel might work I suppose. Maybeyou're initialising the same image with different start locations etc passed in as ENV vars? As a practical way of doing agent simulations this sounds a bit over-engineered to me, but as an interesting docker experiment it might make sense.
Maybe you'll need a central thing for holding and sharing the state (positions of other vehicles) and serving that back to the multiple agents.
The challenge of loading an .osm file into some sort of database or internal map representation doesn't seem like the hardest part, and because it may be done once on initialisation and imagine it's not the most performance critical part of this.
I'm thinking you'll probably want to do "routing" through the road network (taking account of one ways etc?), giving your agents a purposeful path to follow to a destination. This will get more complicated if you want to model interactions with other agents e.g. you might want to model getting stuck in traffic because other agents are going the same way, and even decisions to re-route because of traffic, so you may want quite a flexible routing system, perhaps self-coded.
But there's lots of open source routing systems which work with OSM data, to at least draw inspiration from. See this list: https://wiki.openstreetmap.org/wiki/Routing#Developers
Popular choices like OSRM are designed to scale up to country size or even global openstreetmap data, but I imagine that's overkill for you (you're probably looking at simulating within a city road network?). Even so. Probably easy enough to get working in a docker container.
Or you might find something lightweight like the code of the JOSM routing plugin easier to embed in your docker image and customize (although I see that's using a library called "JGraphT")
Then working backwards from a calculated route you can calculate interpolated steps along that path which will allow you to make your simulated agents take a step on each iteration (simulated movement)

Unknown events in nodejs/v8 flamegraph using perf_events

I try to do some nodejs profiling using Linux perf_events as described by Brendan Gregg here.
Workflow is following:
run node >0.11.13 with --perf-basic-prof, which creates /tmp/perf-(PID).map file where JavaScript symbol mapping are written.
Capture stacks using perf record -F 99 -p `pgrep -n node` -g -- sleep 30
Fold stacks using stackcollapse-perf.pl script from this repository
Generate svg flame graph using flamegraph.pl script
I get following result (which look really nice at the beginning):
Problem is that there are a lot of [unknown] elements, which I suppose should be my nodejs function calls. I assume that whole process fails somwhere at point 3, where perf data should be folded using mappings generated by node/v8 executed with --perf-basic-prof. /tmp/perf-PID.map file is created and some mapping are written to it during node execution.
How to solve this problem?
I am using CentOS 6.5 x64, and already tried this with node 0.11.13, 0.11.14 (both prebuild, and compiled as well) with no success.
FIrst of all, what "[unknown]" means is the sampler couldn't figure out the name of the function, because it's a system or library function.
If so, that's OK - you don't care, because you're looking for things responsible for time in your code, not system code.
Actually, I'm suggesting this is one of those XY questions.
Even if you get a direct answer to what you asked, it is likely to be of little use.
Here are the reasons why:
1. CPU Profiling is of little use in an I/O bound program
The two towers on the left in your flame graph are doing I/O, so they probably take a lot more wall-time than the big pile on the right.
If this flame graph were derived from wall-time samples, rather than CPU-time samples, it could look more like the second graph below, which tells you where time actually goes:
What was a big juicy-looking pile on the right has shrunk, so it is nowhere near as significant.
On the other hand, the I/O towers are very wide.
Any one of those wide orange stripes, if it's in your code, represents a chance to save a lot of time, if some of the I/O could be avoided.
2. Whether the program is CPU- or I/O-bound, speedup opportunities can easily hide from flame graphs
Suppose there is some function Foo that really is doing something wasteful, that if you knew about it, you could fix.
Suppose in the flame graph, it is a dark red color.
Suppose it is called from numerous places in the code, so it's not all collected in one spot in the flame graph.
Rather it appears in multiple small places shown here by black outlines:
Notice, if all those rectangles were collected, you could see that it accounts for 11% of time, meaning it is worth looking at.
If you could cut its time in half, you could save 5.5% overall.
If what it's doing could actually be avoided entirely, you could save 11% overall.
Each of those little rectangles would shrink down to nothing, and pull the rest of the graph, to its right, with it.
Now I'll show you the method I use. I take a moderate number of random stack samples and examine each one for routines that might be speeded up.
That corresponds to taking samples in the flame graph like so:
The slender vertical lines represent twenty random-time stack samples.
As you can see, three of them are marked with an X.
Those are the ones that go through Foo.
That's about the right number, because 11% times 20 is 2.2.
(Confused? OK, here's a little probability for you. If you flip a coin 20 times, and it has a 11% chance of coming up heads, how many heads would you get? Technically it's a binomial distribution. The most likely number you would get is 2, the next most likely numbers are 1 and 3. (If you only get 1 you keep going until you get 2.) Here's the distribution:)
(The average number of samples you have to take to see Foo twice is 2/0.11 = 18.2 samples.)
Looking at those 20 samples might seem a bit daunting, because they run between 20 and 50 levels deep.
However, you can basically ignore all the code that isn't yours.
Just examine them for your code.
You'll see precisely how you are spending time,
and you'll have a very rough measurement of how much.
Deep stacks are both bad news and good news -
they mean the code may well have lots of room for speedups, and they show you what those are.
Anything you see that you could speed up, if you see it on more than one sample, will give you a healthy speedup, guaranteed.
The reason you need to see it on more than one sample is, if you only see it on one sample, you only know its time isn't zero. If you see it on more than one sample, you still don't know how much time it takes, but you do know it's not small.
Here are the statistics.
Generally speaking it is a bad idea to disagree with a subject matter expert but (with the greatest respect) here we go!
SO urges the answer to do the following:
"Please be sure to answer the question. Provide details and share your research!"
So the question was, at least my interpretation of it is, why are there [unknown] frames in the perf script output (and how do I turn these [unknown] frames in to meaningful names)?
This question could be about "how to improve the performance of my system?" but I don't see it that way in this particular case. There is a genuine problem here about how the perf record data has been post processed.
The answer to the question is that although the prerequisite set up is correct: the correct node version, the correct argument was present to generate the function names (--perf-basic-prof), the generated perf map file must be owned by root for perf script to produce the expected output.
That's it!
Writing some new scripts today I hit apon this directing me to this SO question.
Here's a couple of additional references:
https://yunong.io/2015/11/23/generating-node-js-flame-graphs/
https://github.com/jrudolph/perf-map-agent/blob/d8bb58676d3d15eeaaf3ab3f201067e321c77560/bin/create-java-perf-map.sh#L22
[ non-root files can sometimes be forced ] http://www.spinics.net/lists/linux-perf-users/msg02588.html

text-based viewer for profiling results

Do you know of a text-based application for viewing results of application profiling? The profiling results basically contain a list of C++ function call backtraces and how often these backtraces were encountered; now I'm looking for a console tool to analyze the raw data (which backtrace occurred most often; which function was called most often, independent of call trace...).
So far I've created callgrind-compatible files from the raw data and then used the excellent KCachegrind tool for analysis; but now I'm also looking for a tool that works without on text-based terminal. Any ideas?
Take a look at callgrind_annotate.
This command reads in the profile data, and prints a sorted lists of functions, optionally with source annotation.
I wrote such a viewer once. It focussed on a line of code, showing the percent of samples running through that line, and a butterfly view allowing transitions to superior or subordinate lines of code.
It made a nice demo, but did I really use it? Not for long.
(I'm assuming the stack samples have been taken during the interval that you wish to speed up, i.e. not during user-wait.)
The thing is, the program is probably doing something wasteful in that time. (If it is not, you can't speed it up.)
Whatever that wasteful thing is, it consists of some percent of time being spent for poor reasons, like 10%, 50%, 90%, or whatever. During that time, it is on the stack, so an examination of the stack samples will show it.
And, you don't have to look at very many of them. If something is taking 50% of time, 1000 samples will show it on about 500, and 10 samples will show it on about 5. The larger number of samples will estimate the percentage with an extra digit of precision. If your goal is to isolate the problem so you can fix it, you don't need that extra digit.
So, a tool that shows you, by line, the percent of stack samples going through that line is a very nice thing to have, because the wasteful code will appear on it, showing the percentage.
What it does not show you is the reason why the statement is being executed, which is how you know if it's wasteful. Looking at the statement's context on the stack does tell you that.
So even though I had the viewer, I just ended up examining the samples themselves, and only about 10 or 20 of them. The bigger the percentage is, the smaller the number of samples I need to look at before I find it. Here's an example.

Framebuffer Documentation

Is there any documentation on how to write software that uses the framebuffer device in Linux? I've seen a couple simple examples that basically say: "open it, mmap it, write pixels to mapped area." But no comprehensive documentation on how to use the different IOCTLS for it anything. I've seen references to "panning" and other capabilities but "googling it" gives way too many hits of useless information.
Edit:
Is the only documentation from a programming standpoint, not a "User's howto configure your system to use the fb," documentation the code?
You could have a look at fbi's source code, an image viewer which uses the linux framebuffer. You can get it here : http://linux.bytesex.org/fbida/
-- It appears there might not be too many options possible to programming with the fb from user space on a desktop beyond what you mentioned. This might be one reason why some of the docs are so old. Look at this howto for device driver writers and which is referenced from some official linux docs: www.linux-fbdev.org [slash] HOWTO [slash] index.html . It does not reference too many interfaces.. although looking at the linux source tree does offer larger code examples.
-- opentom.org [slash] Hardware_Framebuffer is not for a desktop environment. It reinforces the main methodology, but it does seem to avoid explaining all the ingredients necessary to doing the "fast" double buffer switching it mentions. Another one for a different device and which leaves some key buffering details out is wiki.gp2x.org [slash] wiki [slash] Writing_to_the_framebuffer_device , although it does at least suggest you might be able use fb1 and fb0 to engage double buffering (on this device.. though for desktop, fb1 may not be possible or it may access different hardware), that using volatile keyword might be appropriate, and that we should pay attention to the vsync.
-- asm.sourceforge.net [slash] articles [slash] fb.html assembly language routines that also appear (?) to just do the basics of querying, opening, setting a few basics, mmap, drawing pixel values to storage, and copying over to the fb memory (making sure to use a short stosb loop, I suppose, rather than some longer approach).
-- Beware of 16 bpp comments when googling Linux frame buffer: I used fbgrab and fb2png during an X session to no avail. These each rendered an image that suggested a snapshot of my desktop screen as if the picture of the desktop had been taken using a very bad camera, underwater, and then overexposed in a dark room. The image was completely broken in color, size, and missing much detail (dotted all over with pixel colors that didn't belong). It seems that /proc /sys on the computer I used (new kernel with at most minor modifications.. from a PCLOS derivative) claim that fb0 uses 16 bpp, and most things I googled stated something along those lines, but experiments lead me to a very different conclusion. Besides the results of these two failures from standard frame buffer grab utilities (for the versions held by this distro) that may have assumed 16 bits, I had a different successful test result treating frame buffer pixel data as 32 bits. I created a file from data pulled in via cat /dev/fb0. The file's size ended up being 1920000. I then wrote a small C program to try and manipulate that data (under the assumption it was pixel data in some encoding or other). I nailed it eventually, and the pixel format matched exactly what I had gotten from X when queried (TrueColor RGB 8 bits, no alpha but padded to 32 bits). Notice another clue: my screen resolution of 800x600 times 4 bytes gives 1920000 exactly. The 16 bit approaches I tried initially all produced a similar broken image to fbgrap, so it's not like if I may not have been looking at the right data. [Let me know if you want the code I used to test the data. Basically I just read in the entire fb0 dump and then spit it back out to file, after adding a header "P6\n800 600\n255\n" that creates the suitable ppm file, and while looping over all the pixels manipulating their order or expanding them,.. with the end successful result for me being to drop every 4th byte and switch the first and third in every 4 byte unit. In short, I turned the apparent BGRA fb0 dump into a ppm RGB file. ppm can be viewed with many pic viewers on Linux.]
-- You may want to reconsider the reasons for wanting to program using fb0 (this might also account for why few examples exist). You may not achieve any worthwhile performance gains over X (this was my, if limited, experience) while giving up benefits of using X. This reason might also account for why few code examples exist.
-- Note that DirectFB is not fb. DirectFB has of late gotten more love than the older fb, as it is more focused on the sexier 3d hw accel. If you want to render to a desktop screen as fast as possible without leveraging 3d hardware accel (or even 2d hw accel), then fb might be fine but won't give you anything much that X doesn't give you. X apparently uses fb, and the overhead is likely negligible compared to other costs your program will likely have (don't call X in any tight loop, but instead at the end once you have set up all the pixels for the frame). On the other hand, it can be neat to play around with fb as covered in this comment: Paint Pixels to Screen via Linux FrameBuffer
Check for MPlayer sources.
Under the /libvo directory there are a lot of Video Output plugins used by Mplayer to display multimedia. There you can find the fbdev (vo_fbdev* sources) plugin which uses the Linux frame buffer.
There are a lot of ioctl calls, with the following codes:
FBIOGET_VSCREENINFO
FBIOPUT_VSCREENINFO
FBIOGET_FSCREENINFO
FBIOGETCMAP
FBIOPUTCMAP
FBIOPAN_DISPLAY
It's not like a good documentation, but this is surely a good application implementation.
Look at source code of any of: fbxat,fbida, fbterm, fbtv, directFB library, libxineliboutput-fbe, ppmtofb, xserver-fbdev all are debian packages apps. Just apt-get source from debian libraries. there are many others...
hint: search for framebuffer in package description using your favorite package manager.
ok, even if reading the code is sometimes called "Guru documentation" it can be a bit too much to actually do it.
The source to any splash screen (i.e. during booting) should give you a good start.

How do you visualize logfiles in realtime?

Sometimes it might be useful, but mostly just looking cool or impressive to visualize log files (anything from http requests and to bandwith usage to cups of coffee drunk per day).
I know about Visitorville which I think look a bit silly, and then there's gltail.
How do you "visualize" your log files in realtime?
There is also the logstalgia tool. Visualizes Apache logs. See http://code.google.com/p/logstalgia/ for more details and a youtube video.
You may take a look at Apache Chainsaw. This nifty tool allows Log incomes from nearly everyqhere and has live filtering and colering. If you have an already written Log, I'm not sure if it can read it, it's been a while since I used it last time (was very usefull for the prototyping phase of our JBoss server)
Google has released the Visualization API that is probably flexible enough to help you:
The Google Visualization API lets you access multiple sources of structured data that you can display, choosing from a large selection of visualizations. The Google Visualization API also provides a platform that can be used to create, share and reuse visualizations written by the developer community at large.
It requires some Javascript knowledge and includes Google Docs integration, Spreadsheet integration. Check out the Gallery for some examples.
You could take a look at this. http://www.intalisys.com. 3D realtime vis app
We use Awk and Perl scripts to parse the log files and create summary reports and "databases" (technically databases in that each row corresponds to a unique event with many columns of data about that event, but not stored in a traditional database format. We're moving in that direction). I like Awk because you can very quickly search for specific strings in the log files using regex, keep counters and gather data from the log file entries, and do all kinds of calculations with that data. Then use your favorite plotting software. We use Excel, mainly because that's what was here before I started this job. I prefer MATLAB and it's open-source cousin, Octave, which is built on gnuplot.
I prefer Sawmill for visualizing data. You can basically throw any log file against it, and it will not only autodetect its structure*, but will also decide on how to analyze it. Even if you have a custom log file, you can still define what and how shall be analyzed and visualized.
I mainly use R to visualize data, but I've heard of Orange, too.
Not sure if it fits the question, but I just released this:
numStepCsvLogVis - analyze logfile data in CSV format
It uses Python's matplotlib, is motivated by the need to visualize syslog data in context of debugging kernel circular buffer operation (and variables) in C; and it visualizes by using CSV file format as intermediary to the logfile data (I cannot explain it better in brief - take a look at the README for more detail).
It has a "step" player accessed in terminal, and can handle "live" stdin input, but unfortunately, I cannot get a better response that 1 FPS when plot renders, so I wouldn't really call it "realtime" per se - but you can use it to eventually generate sonified videos of plot animations.
A simple solution is to use Logstalgia alongside the lightweight local-web-server.
First install the above. Then, from the root folder of your site visualise your logs in realtime with:
$ ws --log-format default | logstalgia -
Using SciTe, Notepad++ or other powerful text editor which have file processing routines, so you can create a script that colorizes parts of the log or just delete some non-important lines from it

Resources