Based on multiple answers to other questions, calling native C++ from Javascript is expensive.
I checked myself with the node module "benchmark" and came to the same conclusion.
A simple JS function can get ~90 000 000 calls directly, when calling a C++ function I can get a maximum of about 25 000 000 calls. That in itself is not that bad.
But when adding the creation of an object the JS still is about 70 000 000 calls/sec, but the native version suffers dramatically and goes down to about 2 000 000.
I assume this has todo with the dynamic nature of how the v8 engine works, and that it compiles the JS code to byte code.
But what keeps them from implementing the same optimizations for the C++ code? (or at least calling / insight into what would help there)
(V8 developer here.) Without seeing the code that you ran, it's hard to be entirely sure what effect you were observing, and based on your descriptions I can't reproduce it. Microbenchmarks in particular tend to be tricky, and the relative speedups or slowdowns they appear to be measuring are often misleading, unless you've verified that what happens under the hood is exactly what you expect to be happening. For example, it could be the case that the optimizing compiler was able to eliminate the entire workload because it could statically prove that the result isn't used anywhere. Or it could be the case that no calls were happening at all, because the compiler chose to inline the callee.
Generally speaking, crossing the JS/C++ boundary is what has a certain cost, due to different calling conventions and some other checks and preparations that need to be done, like checking for exceptions that may have been thrown. Both one JavaScript function calling another, and one C++ function calling another, will be faster than JavaScript calling into C++ or the other way round.
This boundary crossing cost is unrelated to the level of compiler optimization on either side. It's also unrelated to byte code. ("Hot", i.e. frequently executed, JavaScript functions are compiled to machine code anyway.)
Lastly, V8 is not a C++ compiler. It's simply not built to do any optimizations for C++ code. And even if it tried to, there's no reason to assume it could do a better job than your existing C++ compiler with -O3. (V8 also doesn't even see the source code of your C++ module, so before you could experiment with recompiling that, you'd have to figure out how to provide that source.)
Without delving into specific V8 versions and their intrinsic reasons, I can say that the overhead is not the in the way the C++ backend works vs. the Javascript, instead the pathway between the languages - that is, the binary interface which implements the invocation of a native method from the Javascript land, and vice versa.
The operations involved in a cross-invocation, in my understanding are:
Prepare the arguments.
Save the JS context.
Invoke a gate code [ which implements the bridge ]
The bridge translates the arguments into C++ style params
The bridge also translates the calling convention to match C++
Invokes a C++ Runtime API wrapper in V8.
This wrapper calls the actual method to perform the operation.
The same is reversely performed when the C++ function returns.
May be there are additional steps involved here, but I guess this in itself suffices to explain why the overhead surfaces.
Now, coming to JS optimizations: the JIT compiler which comes with the V8 engine has 2 parts to it: the first just converts the script into machine code, and the second optimizes the code based on collected profile information. This, is a purely dynamic event and a great, unique opportunity which a C++ compiler cannot match, which works in the static compilation space. For example, the information that an Object is created and destroyed in a block of JS code without escaping its scope outside the block would cause the JIT version to optimize the object creation, such stack allocation (OSR), whereas this will always be in the JS heap, when the native version is invoked.
Thanks for bringing this up, it is an interesting conversation!
Related
The closer option I have found is pyo3, but it isn't clear to me if it adds any extra overhead when compared to the traditional C-extensions.
From here it seems like such C-extension behavior is possible through borrowed objects (I still have to understand this concept in detail).
Part of my question comes from the fact the build process (Section Python with Rust here) is entirely managed by cargo which references both cpython and pyo3.
For an example of approach that adds some overhead, but not rust-based, see this comparison.
A related question is about portability, since it seems there is a kind of overhead-portability tradeoff.
For those who prefer to know the concrete case, it is about small hash-like operations that are used millions of times in unpredictable order. So neither a pure Python nor a batch native approach are going to help here. Additionally, there are already gains in a first attempt using a C-extension when compared to pure Python. Now, I'd like to implement it in Rust before writing the remaining functions.
Is there a way to write a nodejs builtins with CodeStubAssembly, which calls a dynamic linked c++ library in it? so I can call it from javascript. I don't want to use addons since it introduces extra compilation which I don't want. The reason I want to use CSA is that it is called during Runtime, and I only need the information during nodejs Runtime and want to eliminate overheads.
It is possible to write builtins using the CodeStubAssembler, yes, and these builtins can be called from JavaScript. (That's what the CSA is for.) However, the CSA is not exposed on the V8 API, so you would have to do this in V8 itself. In other words, you'd have to modify V8. I do not recommend this (because it makes updating difficult, and means you need to build and deploy your own custom Node binaries), but it is possible.
CSA builtins are quite limited in what kinds of C++ functions they can call. There are two mechanisms: regular calls via the CEntryStub to V8's own "runtime functions" and C++ "builtins", and "fast C calls" to an "external reference", which have less call overhead but don't support doing heap allocations or throwing exceptions on the C++ side. Either way, the call target has to be known at compile time. So you'd need a V8-side C++ function that's the call target, and which then calls through to whatever external library function you want. On the plus side, this intermediate function could translate parameters and results between the types that V8 uses internally and the types that your external library understands/produces.
I suppose in theory you could use CSA as a general-purpose assembler and use it to generate machine code that knows how to load a dynamically linked library and call functions in it. That'd be entirely new functionality though, so you'd have to do a bunch of work to accomplish that.
You can also use the public V8 API to create JavaScript-callable functions that are backed by arbitrary C++ implementations (such as external libraries). I assume that would be the best fit for your purposes. There would be no CSA involved, and no dynamic compilations either, and using NAPI it would even be pretty robust regarding Node version updates. I recommend that you explore this approach.
I'm struggling a bit with the code base of nodejs + v8.
The Goal is to get the bytecode of a function / module (looking at the code they are the same) and disassemble it using the BytecodeArray::Disassemble function, possibly, without side effects, a.k.a, executing the code.
The problem is that it's not clear how to get the bytecode in the first place.
(V8 developer here.) V8's API does not provide access to functions' bytecode. That's intentional, because bytecode is an internal implementation detail. For inspecting bytecode, the --print-bytecode flag is the way to go.
If you insist on mucking with internal details, then of course you can circumvent the public API and poke at V8's internals. From a v8::internal::JSFunction you can get to the v8::internal::SharedFunctionInfo, check whether it HasBytecodeArray(), and if so, call GetBytecodeArray() on it. Disassembling bytecode never has side effects, and never executes the bytecode. It's entirely possible that a function doesn't have bytecode at a given moment in time -- bytecode is created lazily when it's needed, and thrown away if it hasn't been used in a while. If you dig far enough, you can interfere with those mechanisms too, but...:
Needless to say, accessing internal details is totally unsupported, not recommended, and even if you get it to work in Node version x.y, it may break in x.(y+1), because that's what "internal details" means.
While this may sound as theoretical question, suppose I decide to invest and build a mission-critical application written in Haskell. A year later I find that I absolutely need to improve performance of some very thin bottleneck and this will require optimizing memory access close to raw machine capabilities.
Some assumptions:
It isn't realtime system - occasional latency spikes are tolerable (from interrupts, thread scheduling irregularities, occasional GC etc.)
It isn't a numeric problem - data layout and cache-friendly access patterns are most important (avoiding pointer chasing, reducing conditional jumps etc.)
Code may be tied to specific GHC release (but no forking)
Performance goal requires inplace modification of pre-allocated offheap arrays taking alignment into account (C strings, bit-packed fields etc.)
Data is statically bounded in arrays and allocations are rarely if ever needed
What mechanisms does GHC offer to perfom this kind of optimization? By saying reliably I mean that if source change causes code to no longer perform, it is correctible in source code without rewriting it in assembly.
Is it already possible using GHC-specific extensions and libraries?
Would custom FFI help avoid C calling convention overhead?
Could a special purpose compiler plugin do it through a restricted source DSL?
Could source code generator from a "high-level" assembly (LLVM?) be solution?
It sounds like you're looking for unboxed arrays. "unboxed" in haskell-land means "has no runtime heap representation". You can usually learn whether some part of your code is compiled to an unboxed loop (a loop that performs no allocation), say, by looking at the core representation (this is a very haskell-like language, that's the first stage in compilation). So e.g. you might see Int# in the core output which means an integer which has no heap representation (it's gonna be in a register).
When optimizing haskell code we regularly look at core and expect to be able to manipulate or correct for performance regressions by changing the source code (e.g. adding a strictness annotation, or fiddling with a function such that it can be inlined). This isn't always fun, but will be fairly stable especially if you are pinning your compiler version.
Back to unboxed arrays: GHC exposes a lot of low-level primops in GHC.Prim, in particular it sounds like you want mutable unboxed arrays (MutableByteArray). The primitive package exposes these primops behind a slightly safer, friendlier API and is what you should use (and depend on if writing your own library).
There are many other libraries that implement unboxed arrays, such as vector, and which are built on MutableByteArray, but the point is that operations on that structure generate no garbage and likely compile down to pretty predictable machine instructions.
You might also like to check out this technique if you're doing numeric work and want to use a particular instruction or implement some loop directly in assembly.
GHC also has a very powerful FFI, and you can research about how to write portions of your program in C and interop; haskell supports pinned arrays among other structures for this purpose.
If you need more control than those give you then haskell is likely the wrong language. It's impossible to tell from your description if this is the case for your problem (Your requirements seem contradictory: you need to be able to write a carefully cache-tuned algorithm, but arbitrary GC pauses are okay?).
One last note: you can't rely on GHC's native code generator to perform any of the low-level strength reduction optimizations that e.g. GCC performs (GHC's NCG will probably never ever know about bit-twiddling hacks, autovectorization, etc. etc.). Instead you can try the LLVM backend, but whether you see a speedup in your program is by no means guaranteed.
I'm a complete newbie in Haskell. One thing that always bugs me is the ambiguity in whether Haskell is a managed(term borrowed from MS) language like Java or a compile-to-native code like C?
The GHC page says this "GHC compiles Haskell code either directly to native code or using LLVM as a back-end".
In the case of "compiled to native code", how can features like garbage collection be possible without something like a JVM?
/Update/
Thanks so much for your answer. Conceptually, can you please help point out which one of my following understandings of garbage collection in Haskell is correct:
GHC compiles Haskell code to native code. In the processing of compiling, garbage collection routines will be added to the original program code?
OR
There is a program that runs along side a Haskell program to perform garbage collection?
As far as I am aware the term "managed language" specifically means a language that targets .NET/the Common Language Runtime. So no, Haskell is not a managed language and neither is Java.
Regarding what Haskell is compiled to: As the documentation you quoted says, GHC compiles Haskell to native code. It can do so by either directly emitting native code or by first emitting LLVM code and then letting LLVM compile that to native code. Either way the end result of running GHC is a native executable.
Besides GHC there are also other implementations of Haskell - most notably Hugs, which is a pure interpreter that never produces an executable (native or otherwise).
how can features like garbage collection be possible without something like a JVM?
The same way that they're possible with the JVM: Every time memory is allocated, it is registered with the garbage collector. Then from time to time the garbage collector runs, following the steps of the given garbage collection algorithm. GHC-compiled code uses generational garbage collection.
In response to your edit:
GHC compiles Haskell code to native code. In the processing of compiling, garbage collection routines will be added to the original program code?
Basically. Except that saying "garbage collection routines will be added to the original program code" might paint the wrong picture. The GC routines are just part of the library that every Haskell program is linked against. The compiled code simply contains calls to those routines at the appropriate places.
Basically all there is to it is to call the GC's alloc function every time you would otherwise call malloc.
Just look at any GC library for C and how it's used: All you need to do is to #include the library's header and link against the library, and replace each occurence of malloc with the GC library's alloc function (and remove all calls to free) and bam, your code is garbage collected.
There is a program that runs along side a Haskell program to perform garbage collection?
No.
whether Haskell is a managed(term borrowed from MS) language like Java
GHC-compiled programs include a garbage collector. (As far as I know, all implementations of Haskell include garbage collection, but this is not part of the specification.)
or a compile-to-native code like C?
GHC-compiled programs are compiled to native code. Hugs interprets programs, and does not compile to native code. There are several other implementations which all, as far as I know, compile to native code, but I list these separately because I'm not as confident of this fact.
In the case of "compiled to native code", how can features like garbage collection be possible without something like a JVM?
GHC-compiled programs include a runtime system that provides some basic capabilities like M-to-N green threading, garbage collection, and an IO manager. In a sense, this is a bit like having "something like a JVM" in that it provides many of the same features, but it's very different in implementation: there is no common bytecode across all architectures (and hence no "virtual machine").
which one of my following understandings of garbage collection in Haskell is correct:
GHC compiles Haskell code to native code. In the processing of compiling, garbage collection routines will be added to the original program code?
There is a program that runs along side a Haskell program to perform garbage collection?
Case 1 is correct: the runtime system code is added to the program code during compilation.
"Managed language" is an overloaded term so here are one-word answers and then some details for the usual different meanings that come to (my) mind:
Managed as in a CLR target
No, Haskell does not compile to Microsoft CLI's IL.
Well, I read there are some solutions that can do that, but imo, don't.. the CLR isn't built for FP and will seriously lack optimizations, probably yielding a research language performance. If I personally would really really want to target the CLR, I'd use F# -- it's not a functional language but it's close.
N.B. This is the most accurate and actual meaning for the term "managed language". The next meanings are, well, wrong, but nevertheless & unfortunately common.
Managed as in automatically garbage-collected
Yes, and this is pretty much a must have. I mean, beyond the specification: If we would have to garbage collect it would destroy the functional theme that makes us work in the high altitudes that are our beloved home.
It would also enforce impurity and a memory model.
Managed as in compiled to bytecode which is ran by a VM
No (usually).
It depends on your backend:
Not only we have different Haskell compilers today, some compilers have different backends -- there are even backends for JavaScript!
So if you do want to target a VM, you can use an existing / make a backend for it. But Haskell doesn't require it. So just as you can compile to native raw-metal binary, you can compile to anything else.
In contrast to CLR languages like C#1, VB.NET, and in contrast to Java, etc. you don't have to target a JVM, the CLR, Mono, etc. as Haskell doesn't require a VM at all.
GHC is a good example. When you compile in GHC, it doesn't compile you straight to binary, it compiles to an intermediate language called Core, and then optimizes from Core to Core for some times before it proceeds to another language called STG, and only then proceeds to code generation (it can stop there if you tell it to).2 And these days you can also use it to compile to LLVM bytecode (which is subject to some awesome optimizations). With the LLVM backend, GHC can produce wildly faster programs. For more information about it and about GHC backends, go here.
The diagram below illustrates the GHC compilation pipeline, and here you can find more information about the various stages.
See the fork at the bottom for three different targets? those are the backends I was referring to.
1 A future exception and a fun fact: Microsoft are currently working on native .NET! the cunningly named: Microsoft .NET Native.
What, for you, is the defining feature of a "managed language"? The phrase "GHC compiles Haskell code either directly to native code or using LLVM as a back-end" that you quote is quite clear about what GHC does, so I suspect the "ambiguity" that bugs you is rather in the term "managed language" than in GHC's docs.
In the case of "compiled to native code", how can features like garbage collection be possible without something like a JVM?
How exactly do you think "something like a JVM" implements features like garbage collection? The JVM isn't magic, it's just a program like everything else. At some level you need to have native code in order for the CPU to execute it, so clearly features like garbage collection are possible in native code.
For where you currently are, it's probably best to think of (GHC) Haskell as "managed," but that the platform GHC compiles to is not targeted by anything else. There is, of course, more to it than that, but that's a sufficient explanation in lieu of more Haskell experience.