Lightweight Asynchronous Sampling Profiler with JProfiler (AsyncGetCallTrace)

Lightweight Asynchronous Sampling Profiler with JProfiler (AsyncGetCallTrace) - jvm-hotspot

I recently read a blog entry by Jeremy Manson (Google), about how a more accurate and lightweight asynchronous sampling profiler. It relies on the "AsyncGetCallTrace" undocumented method in hotspot JVMs to gather the stack trace of a thread.
http://jeremymanson.blogspot.fr/2013/07/lightweight-asynchronous-sampling.html
My question to the JProfiler community is: can JProfiler in its current 7.2.3 version use AsyncGetCallTrace? Is this feature in the work for say JProfiler 8.0?

The tools interface of the JVM (JVMTI) that is used by profilers has a large test harness that ensures its compatibility and stability for each release. AsyncGetCallTrace is not part of that specification. The overhead of GetStackTrace is so low that it is not advisable for a general purpose profiler to sacrifice the benefits of a supported API for the percieved gains of an unsupported method.

Related

What is "CrankShaftScript" in Node.js?

There are more and more references in the Node.js community to "CrankShaftScript" (and "CrankShaftJS") on Twitter, GitHub, and Facebook group discussions. I thought Node.js was written in C++ and JavaScript, so what is CrankShaftScript referring to in performance regression bugs like this one:
https://github.com/nodejs/CTC/issues/146#issue-237435588

CrankShaftScript is the name given by the community to JS idioms (such as certain types of loops) that run faster(est?) on V8's CrankShaft engine.
CrankShaft is being replaced by an engine named TurboFan. Lots of JS code written by devs over the years has been written specifically to run fast on CrankShaft (e.g. written in "CrankShaftScript") using the known idioms that run fast on CrankShaft - this is no longer necessarily the case because the V8 engine is now different and the code that ran fastest on CrankShaft is not necessarily guaranteed to run fastest on TurboFan.
In case my answer is too verbose, here's a great comment on the NodeJS Benchmarks thread that may detail it better:
...I noticed that some parts of Node core are sort-of written in
CrankshaftScript, i.e. carefully tuned towards stuff that works
extremely well in Crankshaft.

CrankShaftScript is a community-adopted term used for non-idiomatic and/or non-standard compliant JavaScript that will only execute and/or perform well in the specific versions of the v8 JavaScript runtime that employ the CrankShaft JIT compiler. Specific examples include: loops written in a difficult-to-maintain fashion to work around JIT optimization deficiencies in v8, and use of v8-specific built-in functions/globals.
This term was originally coined to describe some root performance issues in node-chakracore and spidernode, which are Node.js distributions that employ the ChakraCore and SpiderMonkey runtimes instead of v8.
It is now being used as shorthand to explain why the Node.js 8.1 release series, which updated to a newer version of v8, has several performance regressions in micro- and macro-benchmarks due to v8's CrankShaft JIT being superseded by TurboFan (sometimes referred to as "TF"). As in these issues:
https://github.com/nodejs/node/issues/11851#issuecomment-287253082
https://github.com/nodejs/CTC/issues/146#issuecomment-310229393
https://twitter.com/matteocollina/status/870580613266501632
For these reasons, the Node.js community is actively working on excising instances of CrankShaftScript in Node.js core code, as well as in common npm packages. This should help alternative Node.js distributions like node-chakracore perform better and ease the risk of future upgrades to the JavaScript runtime in Node.js.

CrankShaft is the compilation infrastructure for V8, Node.js' Javascript runtime (details).
It's now being replaced by TurboFan.

Does Swift have any native concurrency and multi-threading support?

I'm writing a Swift client to communicate with a server (written in C) on an embedded system. Its not iOS/OSX related as I'm using the recently released Ubuntu version.
Does Swift have any native support for concurrency? I'm aware that Apple discourages developers from using threads and encourages handing tasks over to dispatch queues via GCD. The issue is that GCD seems to be only on Darwin (and NSThread is a part of Cocoa).
For example, C++11 and Java have threads and concurrency as a part of their standard libraries. I understand that platform specific stuff like posix on unix could be used under some sort of C wrapper, but for me that really ruins the point of using Swift in the first place (clean, easy to understand code etc.).

2021 came and...
Starting with Swift 5.5, more options are available like async/await programming models and actors.
There is still no direct manipulation of threads, and this is (as of today) a design choice.
If you’ve written concurrent code before, you might be used to working with threads. The concurrency model in Swift is built on top of threads, but you don’t interact with them directly. An asynchronous function in Swift can give up the thread that it’s running on, which lets another asynchronous function run on that thread while the first function is blocked.
Original 2015 answer
Quoting from Swift's GitHub, there's a readme for "evolutions" :
Concurrency: Swift 3.0 relies entirely on platform concurrency primitives (libdispatch, Foundation, pthreads, etc.) for concurrency. Language support for concurrency is an often-requested and potentially high-value feature, but is too large to be in scope for Swift 3.0.
I guess this means no language-level "primitives" for threading are in the pipeline for the foreseeable future.

Monitoring Code/Method-level Statistics using AppDynamics

I am now working on Performance Testing of a Java Application that runs on GlassFish Server 4.1.
After going through some statistics that I got from AppDynamics tool, I find that there is no possibility for me to drill down to code/method level issues. For example, I can see the time taken by each method or function using dotTrace or JProfiler but AppDynamics tool seems to skip all these features.
I was also looking for a free solution, hence I choose AppDynamics. Now I feel I am not on the right track. Can someone let me know more about this tool if I am missing something or suggest any other quick and easy solution to this.
Is there a possibility that the monitors on GlassFish server 4.1 can do the same for no cost?

Generally, monitoring tools cannot record method-level data continuously, because they have to operate at a much lower level of overhead compared to profiling tools. They focus on "business transactions" that show you high-level performance measurements with associated semantic information, such as the processing of an order in your web shop.
Method level data only comes in when these business transactions are too slow. The monitoring tool will then start sampling the executing thread and show you a call tree or hot spots. However, you will not get this information for the entire VM for a continuous interval like you're used to from a profiler.
You mentioned JProfiler, so if you are already familiar with that tool, you might be interested in perfino as a monitoring solution. It shows you samples on the method level and has cross-over functionality into profiling with the native JVMTI interface. It allows you to do full sampling of the entire JVM for a selected amount of time and look at the results in the JProfiler GUI.
Disclaimer: My company develops JProfiler and perfino.

Fully utilizing HW accelerator

I would like to use OpenSSL for handling all our SSL communication (both client and server sides). We would like to use HW acceleration card for offloading the heavy cryptographic calculations.
We noticed that in the OpenSSL 'speed' test, there are direct calls to the cryptographic functions (e.g. RSA_sign/decrypt, etc.). In order to fully utilize the HW capacity, multiple threads were needed (up to 128 threads) which load the card with requests and make sure the HW card is never idle.
We would like to use the high level OpenSSL API for handling SSL connections (e.g. SSL_connect/read/write/accept), but this API doesn't expose the point where the actual cryptographic operation is done. For example, when calling SSL_connect, we are not aware of the point where the RSA operations are done, and we don't know in advance which calls will lead to heavy cryptographic calculations and refer only those to the accelerator.
Questions:
How can I use the high level API while still fully utilizing the HW accelerator? Should I use multiple threads?
Is there a 'standard' way of doing this? (implementation example)
(Answered in UPDATE) Are you familiar with Intel's asynchronous OpenSSL ? It seems that they were trying to solve this exact issue, but we cannot find the actual code or usage examples.
UPDATE
From Accelerating OpenSSL* Using Intel® QuickAssist Technology you can see, that Intel also mentions utilization of multiple threads/processes:
The standard release of OpenSSL is serial in nature, meaning it
handles one connection within one context. From the point of view of
cryptographic operations, the release is based on a synchronous/
blocking programming model. A major limitation is throughput can be
scaled higher only by adding more threads (i.e., processes) to take
advantage of core parallelization, but this will also increase context
management overhead.
The Intel's OpenSSL branch is finally found here.
More info can be found in pdf contained here.
It looks like Intel changed the way OpenSSL ENGINE works - it posts work to driver and immediately returns, while the corresponding result should be polled.
If you use other SSL accelerator, than corresponding OpenSSL ENGINE should be modified too.

According to Interpreting openssl speed output for rsa with multi option , -multi doesn't "parallelize" work or something, it just runs multiple benchmarks in parallel.
So, your HW card's load will be essentially limited by how much work is available at the moment (note that in industry in general, 80% planned capacity load is traditionally considered optimal in case of load spikes). Of course, running multiple server threads/processes will give you the same effect as multiple benchmarks.
OpenSSL supports multiple threads provided that you give it callbacks to lock shared data. For multiple processes, it warns about reusing data state inherited from parent.
That's it for scaling vertically. For scaling horizontally:
openssl supports asynchronous I/O through asynchronous BIOs
but, its elemental crypto operations and internal ENGINE calls are synchronous, and changing this would require a logic overhaul
private efforts to make them provide asynchronous operation have met severe criticism due to major design flaws
Intel announced some "Asynchronous OpenSSL" project (08.2014) to use with its hardware, but the linked white paper gives little details about its implementation and development state. One developer published some related code (10.2015), noting that it's "stable enough to get an overview".

As jww has mentioned in the comments, you should use the engine API to accomplish the task. There is an example in the above link on how to use that API. Usually, the hardware accelerator provider implements a library that is called an "ENGINE" this engine provides cryptographic acceleration and can be used by OpenSSL internally. Assuming that the accelerator you want to use has an ENGINE implemented(for example "cswitft") you should get the Engine by calling ENGINE *e = ENGINE_by_id("cswift"); and then initialize it ENGINE_init(e); and set it to be the default for the operations you want to use, for example ENGINE_set_default_RSA(e);
After calling these functions, you can use the high level API of OpenSSL (e.g. SSL_connect/read/write/accept)

Garbage Collector in Real-Time System

I'm new to C#/Java and plan to prototype it for soft real-time system.
If I wrote C#/Java app just like how I do in C++ in terms of memory management, that is, I explicitly "delete" the objects that I no longer use, then would the app still be affected by garbage collector? If so, how does it affect my app?
Sorry if this sounds like an obvious answer, but being new, I want to be thorough.

Take a look at IBM's Metronome, their garbage collector for hard real-time systems.

Your premise is wrong: you cannot explicitly “delete” objects in either Java or C#, so your application will always be affected by the GC.
You may try to trigger a collection by calling GC.Collect (C#) with an appropriate parameter (e.g. GC.MaxGeneration) but this still doesn’t guarantee that the GC won’t be working at other moments during execution.

By explicitly "delete" if you mean releasing the reference to the object then you are reliant on the garbage collector in C# managed code - see the System.GC class for ways of controlling it.
If you choose to write unmanaged C# code then you will have more control over memory, akin to C++, and will be responsible for deleting your instantiated objects, able to use pointers, etc. For more info see MSDN doc - Unsafe Code and Pointers (C# Programming Guide).
In unmanaged code you will not be at the mercy of the the Garbage Collector and its indeterminate cleanup algorithms.
I don't know if Java has an equivalent unmanaged mode, but this Microsoft info might help provide some direction on C#/.NET to use its available features for your requirement of dealing with the garbage collector.

In Csharp or Java you can't delete object. What you can do is only mark them available for deletion. The memory free up will be done by Garbage Collector.. It might be the case that Garbage Collector may not run during the life time of your application. However it's likely to run. When your system is becoming short of resources it is the most likely time when GC routines are run by the runtime. And when resources are low GC becomes the highest priority thread. So your application do get effected. However you can minimize the effect by calculating the correct load and required resources for your application life time and make sure to buy the right hardware which is good enough for that. But still you can't just bench mark your performance.
Besides just GC the managed application do get a slight overhead over the traditional C++ application due to the extra delegation layer involved. And a slight first time performance panelty since the run time needs to be up and running before your application get started.

Here are some references for developing real-time systems with the .net compact framework:
IEEE - C# and the .NET Framework: Ready for Real Time?
MSDN - Real-Time Behavior of the .NET Compact Framework
They both talk about the memory requirements of using the .net framework.

C# and Java are not for Real-Time development. Soft real-time is attainable however as you note.
For C#, the best you can do is implement the finalize/dispose pattern:
http://msdn.microsoft.com/en-us/library/b1yfkh5e(VS.71).aspx
You can request it to collect, but typically it's much better at determining how to do this.
http://msdn.microsoft.com/en-us/library/system.gc(VS.71).aspx
For Java, there are many options to optimize it:
http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html
Along with third party solutions like IBM Metronome as noted above.
This is a real science within CS itself.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string