Get the BytecodeArray of a Local<Function> on Nodejs c++ land - node.js

I'm struggling a bit with the code base of nodejs + v8.
The Goal is to get the bytecode of a function / module (looking at the code they are the same) and disassemble it using the BytecodeArray::Disassemble function, possibly, without side effects, a.k.a, executing the code.
The problem is that it's not clear how to get the bytecode in the first place.

(V8 developer here.) V8's API does not provide access to functions' bytecode. That's intentional, because bytecode is an internal implementation detail. For inspecting bytecode, the --print-bytecode flag is the way to go.
If you insist on mucking with internal details, then of course you can circumvent the public API and poke at V8's internals. From a v8::internal::JSFunction you can get to the v8::internal::SharedFunctionInfo, check whether it HasBytecodeArray(), and if so, call GetBytecodeArray() on it. Disassembling bytecode never has side effects, and never executes the bytecode. It's entirely possible that a function doesn't have bytecode at a given moment in time -- bytecode is created lazily when it's needed, and thrown away if it hasn't been used in a while. If you dig far enough, you can interfere with those mechanisms too, but...:
Needless to say, accessing internal details is totally unsupported, not recommended, and even if you get it to work in Node version x.y, it may break in x.(y+1), because that's what "internal details" means.

Related

When to use "cold" built-in codegen attribute in Rust?

There isn't much information on this attribute in the reference document other than
The cold attribute suggests that the attributed function is unlikely to be called.
How does it work internally and when a Rust developer should use it?
It tells LLVM to mark a function as cold (i.e. not called often), which changes how the function is optimized such that calls to this code is potentially slower, and calls to non-cold code is potentially faster.
Mandatory disclaimer about performance tweaking:
You really should have some benchmarks in place before you start marking various bits of code as cold. You may have some ideas about whether something is in the hot path or not, but unless you test it, you can't know for sure.
FWIW, there's also the perma-unstable LLVM intrinsics likely and unlikely, which do a similar thing, but these have been known to actually hurt performance, even when used correctly, by preventing other optimizations from happening. Here's the RFC: https://github.com/rust-lang/rust/issues/26179
As always: benchmark, benchmark, benchmark! And then benchmark some more.

How to Decompile Bytenode "jsc" files?

I've just seen this library ByteNode it's the same as ByteCode of java but this is for NodeJS.
This library compiles your JavaScript code into V8 bytecode, which protect your source code, I'm wondering is there anyway to Decompile byteNode therefore it's not secure enough. I'm wondering because I would like to protect my source code using this library?
TL;DR It'll raise the bar to someone copying the code and trying to pass it off as their own. It won't prevent a dedicated person from doing so. But the primary way to protect your work isn't technical, it's legal.
This library compiles your JavaScript code into V8 bytecode, which protect your source code...
Well, we don't know it's V8 bytecode, but it's "compiled" in some sense. All we know is that it creates a "code cache" via the built-in vm.Script.prototype.createCachedData API, which is officially just a cache used to speed up recompiling the code a second time, third time, etc. In theory, you're supposed to also provide the original source code as a string to the vm.Script constructor. But if you go digging into Node.js's vm.Script and V8 far enough it seems to be the actual code in some compiled form (whether actual V8 bytecode or not), and the code string you give it when running is ignored. (The ByteNode library provides a dummy string when running the code from the code cache, so clearly the actual code isn't [always?] needed.)
I'm wondering is there anyway to Decompile byteNode therefore it's not secure enough.
Naturally, otherwise it would be useless because Node.js wouldn't be able to run it. I didn't find a tool to do it that already exists, but since V8 is open source, it would presumably be possible to find the necessary information to write a decompiler for it that outputs valid JavaScript source code which someone could then try to understand.
Experimenting with it, local variable names appear to be lost, although function names don't. Comments appear to get lost (this may not be as obvious as it seems, given that Function.prototype.toString is required to either return the original source text or a synthetic version [details]).
So if you run the code through a minifier (particularly one that renames functions), then run it through ByteNode (or just do it with vm.Script yourself, ByteNode is a fairly thin wrapper), it will be feasible for someone to decompile it into something resembling source code, but that source code will be very hard to understand. This is very similar to shipping Java class files, which can be decompiled (there's even a standard tool to do it in the JDK, javap), except that the format Java class files are well-documented and don't change from one dot release to the next (though they can change from one major release to another; new releases always support the older format, though), whereas the format of this data is not documented (though it's an open source project) and is subject to change from one dot release to the next.
Certain changes, such as changing the copyright message, are probably fairly easy to make to said source code. More meaningful changes will be harder.
Note that the code cache appears to have a checksum or other similar integrity mechanism, since directly editing the .jsc file to swap one letter for another in a literal string makes the code cache fail to load. So someone tampering with it (for instance, to change a copyright notice) would either need to go the decompilation/recompilation route, or dive into the V8 source to find out how to correct the integrity check.
Fundamentally, the way to protect your work is to ensure that you've put all the relevant notices in the relevant places such that the fact copying it is a violation of copyright is clear, then pursue your legal recourse should you find out about someone passing it off as their own.
is there any way
You could get a hundred answers here saying "I don't know a way", but that still won't guarantee that there isn't one.
not secure enough
Secure enough for what? What's your deployment scenario? What kind of scenario/attack are you trying to defend against?
FWIW, I don't know of an existing tool that "decompiles" V8 bytecode (i.e. produces JavaScript source code with the same behavior). That said, considering that the bytecode is a fairly straightforward translation of the source code, I'm sure it wouldn't be very hard to write such a tool, if someone had a reason to spend some time on it. After all, V8's JS-to-bytecode compiler is open source, so one would only have to look at those sources and implement the reverse direction. So I would assume that shipping as bytecode provides about as much "protection" as shipping as uglified JavaScript, i.e. none that I would trust.
Before you make any decisions, please also keep in mind that bytecode is considered an internal implementation detail of V8; in particular it is not versioned and can change at any time, so it has to be created by exactly the same V8 version that consumes it. If you want to update your Node.js you'll have to recreate all the bytecode, and there is no checking or warning in place that will point out when you forgot to do that.
Node.js source already contains code for decompiling binary bytecode.
You can get a text string from your V8 bytecode and then you would need to analyze it.
But text string would be very long and miss some important information such as a constant pool. So you need to modify the Node.js source.
Please check https://github.com/3DGISKing/pkg10.17.0
I have attached exported xml file.
If you study V8, it would be possible to analyze it and get source code from it.
It keeping it short and sweet, You can try Ghidra node.js package which is based on Ghidra reverse engineering framework which was open-sourced by NSA in the year 2019. Ghidra is capable of disassembling and decompiling the v8 bytecode. The inner working of disassembling is quite complex, this answer is short but sufficient.

v8 Engine - Why is calling native code from JS so expensive?

Based on multiple answers to other questions, calling native C++ from Javascript is expensive.
I checked myself with the node module "benchmark" and came to the same conclusion.
A simple JS function can get ~90 000 000 calls directly, when calling a C++ function I can get a maximum of about 25 000 000 calls. That in itself is not that bad.
But when adding the creation of an object the JS still is about 70 000 000 calls/sec, but the native version suffers dramatically and goes down to about 2 000 000.
I assume this has todo with the dynamic nature of how the v8 engine works, and that it compiles the JS code to byte code.
But what keeps them from implementing the same optimizations for the C++ code? (or at least calling / insight into what would help there)
(V8 developer here.) Without seeing the code that you ran, it's hard to be entirely sure what effect you were observing, and based on your descriptions I can't reproduce it. Microbenchmarks in particular tend to be tricky, and the relative speedups or slowdowns they appear to be measuring are often misleading, unless you've verified that what happens under the hood is exactly what you expect to be happening. For example, it could be the case that the optimizing compiler was able to eliminate the entire workload because it could statically prove that the result isn't used anywhere. Or it could be the case that no calls were happening at all, because the compiler chose to inline the callee.
Generally speaking, crossing the JS/C++ boundary is what has a certain cost, due to different calling conventions and some other checks and preparations that need to be done, like checking for exceptions that may have been thrown. Both one JavaScript function calling another, and one C++ function calling another, will be faster than JavaScript calling into C++ or the other way round.
This boundary crossing cost is unrelated to the level of compiler optimization on either side. It's also unrelated to byte code. ("Hot", i.e. frequently executed, JavaScript functions are compiled to machine code anyway.)
Lastly, V8 is not a C++ compiler. It's simply not built to do any optimizations for C++ code. And even if it tried to, there's no reason to assume it could do a better job than your existing C++ compiler with -O3. (V8 also doesn't even see the source code of your C++ module, so before you could experiment with recompiling that, you'd have to figure out how to provide that source.)
Without delving into specific V8 versions and their intrinsic reasons, I can say that the overhead is not the in the way the C++ backend works vs. the Javascript, instead the pathway between the languages - that is, the binary interface which implements the invocation of a native method from the Javascript land, and vice versa.
The operations involved in a cross-invocation, in my understanding are:
Prepare the arguments.
Save the JS context.
Invoke a gate code [ which implements the bridge ]
The bridge translates the arguments into C++ style params
The bridge also translates the calling convention to match C++
Invokes a C++ Runtime API wrapper in V8.
This wrapper calls the actual method to perform the operation.
The same is reversely performed when the C++ function returns.
May be there are additional steps involved here, but I guess this in itself suffices to explain why the overhead surfaces.
Now, coming to JS optimizations: the JIT compiler which comes with the V8 engine has 2 parts to it: the first just converts the script into machine code, and the second optimizes the code based on collected profile information. This, is a purely dynamic event and a great, unique opportunity which a C++ compiler cannot match, which works in the static compilation space. For example, the information that an Object is created and destroyed in a block of JS code without escaping its scope outside the block would cause the JIT version to optimize the object creation, such stack allocation (OSR), whereas this will always be in the JS heap, when the native version is invoked.
Thanks for bringing this up, it is an interesting conversation!

How safe is Safe Haskell?

I was thinking about Safe Haskell and I wonder how much I can trust it?
Some fictional scenarios:
I am a little hacker writing a programmable game (think Robocode) where I allow others to program their own entities to compete against each other. Most of the time users will run some untrusted programs on private machines. Untrusted code would probably be inspected before running it.
I am the programmer of an application that is used by several clients. I provide an api so they can extend the functionality and encourage my users to share their plugins. The user community is small and most of the time there is mutual trust, but occasionally someone is working on a top-secret client project and any dataleaks would prove disastrous.
I am ... Google (or Facebook,Yahoo,etc) and want to allow my clients to script their email accounts. Scripts are uploaded and are run on my servers. Any access violations would be fatal.
Given these scenarios:
would Safe Haskell be appropriate to ensure sandboxing and access restriction?
Should someone in the given situations be able to trust the promises made?
As a rule of thumb, I'd say safe Haskell tries to get roughly where the safe subset of C# is. For your scenarios:
Can use safe haskell because you are inspecting code.
Cannot really use safe haskell alone to avoid disastrous data leaks.
I wouldn't recommend Google or Yahoo relying on safe haskell alone to run untrusted code. For one thing, it doesn't manage excessive resource consumption (CPU,memory,disk) or bottoms (undefined or while true). Use an OS sandbox for that.
A note on undefined: operationally, it stops the function returning a value by throwing an exception, as does the error function. Denotationally, it's considered to be the 'bottom' value. Now, even if safe Haskell disallowed undefined and error, a function could still fail to return, just by looping endlessly. And an endless loop is bottom too. So safe Haskell guarantees type and memory safety but doesn't try to guarantee that functions terminate. Safe Haskell is, of course, Turing complete, so it's not possible in general to prove termination. Furthermore, since out-of-memory throws an exception, functions may terminate with them. Finally, pattern match errors throw exceptions. So safe Haskell cannot eliminate bottoms of any kind and may as well allow explicit undefined and error.
To my knowledge, safe Haskell is not safe. Someone can use unsafePerformIO in a package and manually override the "unsafe" factor. If not, every package that had any dependencies on c programs or system libraries could not be marked safe. (Think about libgmp.so, which is linked to by almost everyone's Haskell base packages. For the base packages to be marked safe, this must be somehow getting explicitly marked as safe even though it uses unsafePerformIO).

Safe execution of untrusted Haskell code

I'm looking for a way to run an arbitrary Haskell code safely (or refuse to run unsafe code).
Must have:
module/function whitelist
timeout on execution
memory usage restriction
Capabilities I would like to see:
ability to kill thread
compiling the modules to native code
caching of compiled code
running several interpreters concurrently
complex datatype for compiler errors (insted of simple message in String)
With that sort of functionality it would be possible to implement a browser plugin capable of running arbitrary Haskell code, which is the idea I have in mind.
EDIT: I've got two answers, both great. Thanks! The sad part is that there doesn't seem to be ready-to-go library, just a similar program. It's a useful resource though. Anyway I think I'll wait for 7.2.1 to be released and try to use SafeHaskell in my own program.
We've been doing this for about 8 years now in lambdabot, which supports:
a controlled namespace
OS-enforced timeouts
native code modules
caching
concurrent interactive top-levels
custom error message returns.
This series of rules is documented, see:
Safely running untrusted Haskell code
mueval, an alternative implementation based on ghc-api
The approach to safety taken in lambdabot inspired the Safe Haskell language extension work.
For approaches to dynamic extension of compiled Haskell applications, in Haskell, see the two papers:
Dynamic Extension of Typed Functional Languages, and
Dynamic applications from the ground up.
GHC 7.2.1 will likely have a new facility called SafeHaskell which covers some of what you want. SafeHaskell ensures type-safety (so things like unsafePerformIO are outlawed), and establishes a trust mechanism, so that a library with a safe API but implemented using unsafe features can be trusted. It is designed exactly for running untrusted code.
For the other practical aspects (timeouts and so on), lambdabot as Don says would be a great place to look.

Resources