What are the performance implications of interopping with other languages via system calls?

What are the performance implications of interopping with other languages via system calls? - node.js

Suppose I'm writing a program in node.js (or perhaps another typical back-end scripting language). Suppose further I have a C function f (or a python function, or what have you) that does some pure data transformation.
If I want to use f in my node program, there are two approaches:
Bind f via something like node-gyp that makes it callable from JavaScript land.
Make f into a binary (or, in the case of a language like python, a single f.py interface) that sits on the file system, and then call it from node as if were any other system command (so that one can then take the output from the system call as a string, convert it into node.js data, and then use it).
Question: What are the performance implications of choosing (2) over (1)?
This is important because if you are using a language like C to make some aspect of your application run significantly faster, then using (2) would seem pointless if it slowed things down past some threshold.

The cost of 1 is the cost of loading the native code, transfering arguments (ffi), calling the native code, and transfering arguments back. With loading being done only once.
The cost of 2 is always going to be the cost to startup the process, running the process, converting the results back from strings.
If the cost of f is high, you may never see a difference between 1 and 2. If the cost of f is low, then 2 will take longer because the process startup overhead will dominate.
However, depending on the complexity of f (it might be a very large data-processing application in C), it's almost always faster to create a native binding like 1. Avoiding process startup overhead is important, it also reduces the total amount of memory needed to run your application.
Alternatively you could do option:
Have the C code talk over a local network socket. Accepting requests and responding with answers when the computation is done.
This has the benefit of scaling out to multiple nodes if you need it.

Benchmarking both for your use case is the only way to be sure but method 1 is
likely to be faster.
The startup cost of calling a binary and starting an interpreter for python/perl/blah would likely kill any performance gain you might get using their Foreign Function Interface (FFI). Startup cost is one of the reasons why Apache has mod_python, mod_perl and why FastCGI exists.
Another thing to consider is that you're adding another language to the mix and this might kill performance of the team ie now everyone needs to know two languages and two FFI methods etc. If your app is in Node, keep it in Node and use node to call native methods.

Related

v8 memory spike (rss) when defining more than 1000 function (does not reproduce when using --jitless)

I have a simple node app with 1 function that defines 1000+ functions inside it (without running them).
When I call this function (the wrapper) around 200 times the RSS memory of the process spikes from 100MB to 1000MB and immediately goes down. (The memory spike only happens after around 200~ calls, before that all the calls do not cause a memory spike, and all the calls after do not cause a memory spike)
This issue is happening to us in our node server in production, and I was able to reproduce it in a simple node app here:
https://github.com/gileck/node-v8-memory-issue
When I use --jitless pr --no-opt the issue does not happen (no spikes). but obviously we do not want to remove all the v8 optimizations in production.
This issue must be some kind of a specific v8 optimization, I tried a few other v8 flags but non of them fix the issue (only --jitless and --no-opt fix it)
Anyone knows which v8 optimization could cause this?
Update:
We found that --no-concurrent-recompilation fix this issue (No memory spikes at all).
but still, we can't explain it.
We are not sure why it happens and which code changes might fix it (without the flag).
As one of the answers suggests, moving all the 1000+ function definitions out of the main function will solve it, but then those functions will not be able to access the context of the main function which is why they are defined inside it.
Imagine that you have a server and you want to handle a request.
Obviously, The request handler is going to run many times as the server gets a lot of requests from the client.
Would you define functions inside the request handler (so you can access the request context in those functions) or define them outside of the request handler and pass the request context as a parameter to all of them? We chose the first option... what do you think?

anyone knows which v8 optimization could cause this?
Load Elimination.
I guess it's fair to say that any optimization could cause lots of memory consumption in pathological cases (such as: a nearly 14 MB monster of a function as input, wow!), but Load Elimination is what causes it in this particular case.
You can see for yourself when your run with --turbo-stats (and optionally --turbo-filter=foo to zoom in on just that function).
You can disable Load Elimination if you feel that you must. A preferable approach would probably be to reorganize your code somewhat: defining 2,000 functions is totally fine, but the function defining all these other functions probably doesn't need to be run in a loop long enough until it gets optimized? You'll avoid not only this particular issue, but get better efficiency in general, if you define functions only once each.
There may or may not be room for improving Load Elimination in Turbofan to be more efficient for huge inputs; that's a longer investigation and I'm not sure it's worth it (compared to working on other things that likely show up more frequently in practice).
I do want to emphasize for any future readers of this that disabling optimization(s) is not generally a good rule of thumb for improving performance (or anything else), on the contrary; nor are any other "secret" flags needed to unlock "secret" performance: the default configuration is very carefully optimized to give you what's (usually) best. It's a very rare special case that a particular optimization pass interacts badly with a particular code pattern in an input function.

Running multiple independent Node programs in the same Node instance

I'm doing embedded development on a resource-limited system, and need to run a number of separate Node.js tasks (call them task1.js, task2.js and task3.js). The obvious solution would be to run them separately, e.g.:
$ node task1.js &
[1] 1968
$ node task2.js &
[2] 1969
$ node task3.js &
[3] 1970
$
This works, but I end up with three independent Node stacks, each with its own multi-megabyte heap, interpreter, etc. etc. etc., which is a waste that I'd like to avoid.
Another obvious solution would be to concatenate the source files:
$ cat task1.js task2.js task3.js | node -
This works, but it has problems. First, all three task sources would end up in the same module, so I'd risk name collisions. For example, if each task file included const crypto = require('crypto');, then when concatenated Node would complain about the multiply-defined crypto variable.
This would also require all of the primary task source files to be in the same directory, otherwise any relative path references to dependent files would be calculated based on the default working directory, and would likely break.
So, I'm looking for a way to run multiple tasks in the same Node instance, sharing Node resources as much as possible.
It would be great if some or all the following were true:
For development/debugging convenience, the same taskX.js sources could be used individually (as at the top), or run at the same time in the same node instance
No special care would have to be made in each task's code to prevent namespace collisions
Relative path references in include statements wouldn't all be resolved from the same working directory, so that I could have separate source trees for the separate tasks
Problems I don't need to be solved:
Multiprocessing or multithreading
Sharing data between the tasks
Inter-task events and communication services (if I do I'll write them myself)
Protecting each task from the others' bad behavior
Expected constraints for the task code:
No busy-waiting, so that none block the others from executing
No exclusive use of common system resources (e.g. no two will open a server socket on the same port)
Use of global Node resources will be restricted or forbidden

Is this resource limited to 1 CPU? If so, then your best bet is making each task return an async function, and processing them with something like async.parallel.
Especially if your subtasks are broken down into mostly async functions, this will allow the tasks to run as "parallel" as possible.
In a multi-cpu environment, you can boost performance using child processes (or using native node cluster module). But, as others stated, this would require the memory overhead of v8 for each process.
If your tasks are mostly cpu intensive, you will not see much gain from the async.parallel, and it could even be slower than doing all your tasks sync. But, if there is network or disk access (IO), then using parallel should be faster.

Does node.js really not optimize calls to [].slice.call(arguments)?

In the bluebird docs, they have this as an anti-pattern that stops optimization.. They call it argument leaking,
function leaksArguments2() {
var args = [].slice.call(arguments);
}
I do this all the time in Node.js. Is this really a problem. And, if so, why?
Assume only the latest version of Node.js.

Disclaimer: I am the author of the wiki page
It's a problem if the containing function is called a lot (being hot). Functions that leak arguments are not supported by the optimizing compiler (crankshaft).
Normally when a function is hot, it will be optimized. However if the function contains unsupported features like leaking arguments, being a hot function doesn't help and it will continue running slow generic code.
The performance of an optimized function compared to an unoptimized one is huge. For example consider a function that adds 3 doubles together: http://jsperf.com/213213213 21x difference.
What if it added 6 doubles together? 29x difference Generally the more code the function has, the more severe the punishment is for that function to run in unoptimized mode.
For node.js stuff like this in general is actually a huge problem due to the fact that any cpu time completely blocks the server. Just by optimizing the url parser that is included in node core (my module is 30x faster in node's own benchmarks), improves the requests per second of mysql-express from 70K rps to 100K rps in a benchmark that queries a database.
Good news is that node core is aware of this

Is this really a problem
For application code, no. For almost any module/library code, no. For a library such as bluebird that is intended to be used pervasively throughout an entire codebase, yes. If you did this in a very hot function in your application, then maybe yes.
I don't know the details but I trust the bluebird authors as credible that accessing arguments in the ways described in the docs causes v8 to refuse to optimize the function, and thus it's something that the bluebird authors consider worth using a build-time macro to get the optimized version.
Just keep in mind the latency numbers that gave rise to node in the first place. If your application does useful things like talking to a database or the filesystem, then I/O will be your bottleneck and optimizing/caching/parallelizing those will pay vastly higher dividends than v8-level in-memory micro-optimizations such as above.

What are the benefits of coroutines?

I've been learning some lua for game development. I heard about coroutines in other languages but really came up on them in lua. I just don't really understand how useful they are, I heard a lot of talk how it can be a way to do multi-threaded things but aren't they run in order? So what benefit would there be from normal functions that also run in order? I'm just not getting how different they are from functions except that they can pause and let another run for a second. Seems like the use case scenarios wouldn't be that huge to me.
Anyone care to shed some light as to why someone would benefit from them?
Especially insight from a game programming perspective would be nice^^

OK, think in terms of game development.
Let's say you're doing a cutscene or perhaps a tutorial. Either way, what you have are an ordered sequence of commands sent to some number of entities. An entity moves to a location, talks to a guy, then walks elsewhere. And so forth. Some commands cannot start until others have finished.
Now look back at how your game works. Every frame, it must process AI, collision tests, animation, rendering, and sound, among possibly other things. You can only think every frame. So how do you put this kind of code in, where you have to wait for some action to complete before doing the next one?
If you built a system in C++, what you would have is something that ran before the AI. It would have a sequence of commands to process. Some of those commands would be instantaneous, like "tell entity X to go here" or "spawn entity Y here." Others would have to wait, such as "tell entity Z to go here and don't process anymore commands until it has gone here." The command processor would have to be called every frame, and it would have to understand complex conditions like "entity is at location" and so forth.
In Lua, it would look like this:
local entityX = game:GetEntity("entityX");
entityX:GoToLocation(locX);
local entityY = game:SpawnEntity("entityY", locY);
local entityZ = game:GetEntity("entityZ");
entityZ:GoToLocation(locZ);
do
coroutine.yield();
until (entityZ:isAtLocation(locZ));
return;
On the C++ size, you would resume this script once per frame until it is done. Once it returns, you know that the cutscene is over, so you can return control to the user.
Look at how simple that Lua logic is. It does exactly what it says it does. It's clear, obvious, and therefore very difficult to get wrong.
The power of coroutines is in being able to partially accomplish some task, wait for a condition to become true, then move on to the next task.

Coroutines in a game:
Easy to use, Easy to screw up when used in many places.
Just be careful and not use it in many places.
Don't make your Entire AI code dependent on Coroutines.
Coroutines are good for making a quick fix when a state is introduced which did not exist before.
This is exactly what java does. Sleep() and Wait()
Both functions are the best ways to make it impossible to debug your game.
If I were you I would completely avoid any code which has to use a Wait() function like a Coroutine does.
OpenGL API is something you should take note of. It never uses a wait() function but instead uses a clean state machine which knows exactly what state what object is at.
If you use coroutines you end with up so many stateless pieces of code that it most surely will be overwhelming to debug.
Coroutines are good when you are making an application like Text Editor ..bank application .. server ..database etc (not a game).
Bad when you are making a game where anything can happen at any point of time, you need to have states.
So, in my view coroutines are a bad way of programming and a excuse to write small stateless code.
But that's just me.

It's more like a religion. Some people believe in coroutines, some don't. The usecase, the implementation and the environment all together will result into a benefit or not.
Don't trust benchmarks which try to proof that coroutines on a multicore cpu are faster than a loop in a single thread: it would be a shame if it were slower!
If this runs later on some hardware where all cores are always under load, it will turn out to be slower - ups...
So there is no benefit per se.
Sometimes it's convenient to use. But if you end up with tons of coroutines yielding and states that went out of scope you'll curse coroutines. But at least it isn't the coroutines framework, it's still you.

We use them on a project I am working on. The main benefit for us is that sometimes with asynchronous code, there are points where it is important that certain parts are run in order because of some dependencies. If you use coroutines, you can force one process to wait for another process to complete. They aren't the only way to do this, but they can be a lot simpler than some other methods.

I'm just not getting how different they are from functions except that
they can pause and let another run for a second.
That's a pretty important property. I worked on a game engine which used them for timing. For example, we had an engine that ran at 10 ticks a second, and you could WaitTicks(x) to wait x number of ticks, and in the user layer, you could run WaitFrames(x) to wait x frames.
Even professional native concurrency libraries use the same kind of yielding behaviour.

Lots of good examples for game developers. I'll give another in the application extension space. Consider the scenario where the application has an engine that can run a users routines in Lua while doing the core functionality in C. If the user needs to wait for the engine to get to a specific state (e.g. waiting for data to be received), you either have to:
multi-thread the C program to run Lua in a separate thread and add in locking and synchronization methods,
abend the Lua routine and retry from the beginning with a state passed to the function to skip anything, least you rerun some code that should only be run once, or
yield the Lua routine and resume it once the state has been reached in C
The third option is the easiest for me to implement, avoiding the need to handle multi-threading on multiple platforms. It also allows the user's code to run unmodified, appearing as if the function they called took a long time.

How can threads be avoided?

I've read a lot recently about how writing multi-threaded apps is a huge pain in the neck, and have learned enough about the topic to understand, at least at some level, why it is so.
I've read that using functional programming techniques can help alleviate some of this pain, but I've never seen a simple example of functional code that is concurrent. So, what are some alternatives to using threads? At least, what are some ways to abstract them away so you needn't think about things like locking and whether a particular library's objects are thread-safe.
I know Google's MapReduce is supposed to help with the problem, but I haven't seen a succinct explanation of it.
Although I'm giving a specific example below, I'm more curious of general techniques than solving this specific problem (using the example to help illustrate other techniques would be helpful though).
I came to the question when I wrote a simple web crawler as a learning exercise. It works pretty well, but it is slow. Most of the bottleneck comes from downloading pages. It is currently single threaded, and thus only downloads a single page at a time. Thus, if the pages can be downloaded concurrently, it would speed things up dramatically, even if the crawler ran on a single processor machine. I looked into using threads to solve the issue, but they scare me. Any suggestions on how to add concurrency to this type of problem without unleashing a terrible threading nightmare?

The reason functional programming helps with concurrency is not because it avoids using threads.
Instead, functional programming preaches immutability, and the absence of side effects.
This means that an operation could be scaled out to N amount of threads or processes, without having to worry about messing with shared state.

Actually, threads are pretty easy to handle until you need to synchronize them. Usually, you use threadpool to add task and wait till they are finished.
It is when threads need to communicate and access shared data structures that multi threading becomes really complicated. As soon as you have two locks, you can get deadlocks, and this is where multithreading gets really hard. Sometimes, your locking code could be wrong by just a few instructions. In that case, you could only see bugs in production, on multi-core machines (if you developed on single core, happened to me) or they could be triggered by some other hardware or software. Unit testing doesn't help much here, testing finds bugs, but you can never be as sure as in "normal" apps.

I'll add an example of how functional code can be used to safely make code concurrent.
Here is some code you might want to do in parallel, so you don't have wait for one file to finish to start downloading the next:
void DownloadHTMLFiles(List<string> urls)
{
foreach(string url in urls)
{
DownlaodOneFile(url); //download html and save it to a file with a name based on the url - perhaps used for caching.
}
}
If you have a number of files the user might spend a minute or more waiting for them all. We can re-write this code functionally like this, and it basically does the exact same thing:
urls.ForEach(DownloadOneFile);
Note that this still runs sequentially. However, not only is it shorter, we've gained an important advantage here. Since each call to the DownloadOneFile function is completely isolated from the others (for our purposes, available bandwidth isn't an issue) you could very easily swap out the ForEach function for another very similar function: one that kicks off each call to DownlaodOneFile on a separate thread from a threadpool.
It turns out .Net has just such a function availabe using Parallel Extensions. So, by using functional programming you can change one line of code and suddenly have something run in parallel that used to run sequentially. That's pretty powerful.

There are a couple of brief mentions of asynchronous models but no one has really explained it so I thought I'd chime in. The most common method I've seen used as an alternative for multi-threading is asynchronous architectures. All that really means is that instead of executing code sequentially in a single thread, you use a polling method to initiate some functions and then come back and check periodically until there's data available.
This really only works in models like your aforementioned crawler, where the real bottleneck is I/O rather than CPU. In broad strokes, the asynchronous approach would initiate the downloads on several sockets, and a polling loop periodically checks to see if they're finished downloading and when that's done, we can move on to the next step. This allows you to run several downloads that are waiting on the network, by context switching within the same thread, as it were.
The multi-threaded model would work much the same, except using a separate thread rather than a polling loop checking multiple sockets in the same thread. In an I/O bound application, asynchronous polling works almost as well as threading for many use cases, since the real problem is simply waiting for the I/O to complete and not so much the waiting for the CPU to process the data.
Another real world example is for a system that needed to execute a number of other executables and wait for results. This can be done in threads, but it's also considerably simpler and almost as effective to simply fire off several external applications as Process objects, then check back periodically until they're all finished executing. This puts the CPU-intensive parts (the running code in the external executables) in their own processes, but the data processing is all handled asynchronously.
The Python ftp server lib I work on, pyftpdlib uses the Python asyncore library to handle serving FTP clients with only a single thread, and asynchronous socket communication for file transfers and command/response.
See for further reading the Python Twisted library's page on Asynchronous Programming - while somewhat specific to using Twisted, it also introduces async programming from a beginner perspective.

Concurrency is quite a complicated subject in computer science, which demands good understanding of hardware architecture as well as operating system behavior.
Multi-threading has many implementations based on your hardware and your hosting OS, and as tough as it is already, the pitfalls are numerous. It should be noted that in order to achieve "true" concurrency, threads are the only way to go. Basically, threads are the only way for you as a programmer to share resources between different parts of your software while allowing them to run in parallel. By parallel you should consider that a standard CPU (dual/multi-cores aside) can only do one thing at a time. Concepts like context switching now come into play, and they have their own set of rules and limitations.
I think you should seek more generic background on the subject, like you are saying, before you go about implementing concurrency in your program.
I guess the best place to start is the wikipedia article on concurrency, and go on from there.

What typically makes multi-threaded programming such a nightmare is when threads share resources and/or need to communicate with each other. In the case of downloading web pages, your threads would be working independently, so you may not have much trouble.
One thing you may want to consider is spawning multiple processes rather than multiple threads. In the case you mention--downloading web pages concurrently--you could split the workload up into multiple chunks and hand each chunk off to a separate instance of a tool (like cURL) to do the work.

If your goal is to achieve concurrency it will be hard to get away from using multiple threads or processes. The trick is not to avoid it but rather to manage it in a way that is reliable and non-error prone. Deadlocks and race conditions in particular are two aspects of concurrent programming that are easy to get wrong. One general approach to manage this is to use a producer/consumer queue... threads write work items to the queue and workers pull items from it. You must make sure you properly synchronize access to the queue and you're set.
Also, depending on your problem, you may also be able to create a domain specific language which does away with concurrency issues, at least from the perspective of the person using your language... of course the engine which processes the language still needs to handle concurrency, but if this will be leveraged across many users it could be of value.

There are some good libraries out there.
java.util.concurrent.ExecutorCompletionService will take a collection of Futures (i.e. tasks which return values), process them in background threads, then bung them in a Queue for you to process further as they complete. Of course, this is Java 5 and later, so isn't available everywhere.
In other words, all your code is single threaded - but where you can identify stuff safe to run in parallel, you can farm it off to a suitable library.
Point is, if you can make the tasks independent, then thread safety isn't impossible to achieve with a little thought - though it is strongly recommended you leave the complicated bit (like implementing the ExecutorCompletionService) to an expert...

One simple way to avoid threading in your simple scenario, Is to download from different processes. The main process will invoke other processes with parameters that will download the files to local directory, And then the main process can do the real job.
I don't think that there are any simple solution to those problems. Its not a threading problem. Its the concurrency that brake the human mind.

You might watch the MSDN video on the F# language: PDC 2008: An introduction to F#
This includes the two things you are looking for. (Functional + Asynchronous)

For python, this looks like an interesting approach: http://members.verizon.net/olsongt/stackless/why_stackless.html#introduction

Use Twisted. "Twisted is an event-driven networking engine written in Python" http://twistedmatrix.com/trac/. With it, I could make 100 asynchronous http requests at a time without using threads.

Your specific example is seldom solved with multi-threading. As many have said, this class of problems is IO-bound, meaning the processor has very little work to do, and spends most of it's time waiting for some data to arrive over the wire and to process that, and similarly it has to wait for disk buffers to flush so that it can put more of the recently downloaded data on disk.
The method to performance is through the select() facility, or an equivalent system call. The basic process is to open a number of sockets (for the web crawler downloads) and file handles (for storing them to disk). Next you set all of the different sockets and fh to non-blocking mode, meaning that instead of making your program wait until data is available to read after issuing a request, it returns right away with a special code (usually EAGAIN) to indicate that no data is ready. If you looped through all of the sockets in this way you would be polling, which works well, but is still a waste of cpu resources because your reads and writes will almost always return with EAGAIN.
To get around this, all of the sockets and fp's will be collected into a 'fd_set', which is passed to the select system call, then your program will block, waiting on ANY of the sockets, and will awaken your program when there's some data on any of the streams to process.
The other common case, compute bound work, is without a doubt best addressed with some sort of true parallelism (as apposed to the asynchronous concurrency presented above) to access the resources of multiple cpu's. In the case that your cpu bound task is running on a single threaded archetecture, definately avoid any concurrency, as the overhead will actually slow your task down.

Threads are not to be avoided nor are they "difficult". Functional programming is not necessarily the answer either. The .NET framework makes threading fairly simple. With a little thought you can make reasonable multithreaded programs.
Here's a sample of your webcrawler (in VB.NET)
Imports System.Threading
Imports System.Net
Module modCrawler
Class URLtoDest
Public strURL As String
Public strDest As String
Public Sub New(ByVal _strURL As String, ByVal _strDest As String)
strURL = _strURL
strDest = _strDest
End Sub
End Class
Class URLDownloader
Public id As Integer
Public url As URLtoDest
Public Sub New(ByVal _url As URLtoDest)
url = _url
End Sub
Public Sub Download()
Using wc As New WebClient()
wc.DownloadFile(url.strURL, url.strDest)
Console.WriteLine("Thread Finished - " & id)
End Using
End Sub
End Class
Public Sub Download(ByVal ud As URLtoDest)
Dim dldr As New URLDownloader(ud)
Dim thrd As New Thread(AddressOf dldr.Download)
dldr.id = thrd.ManagedThreadId
thrd.SetApartmentState(ApartmentState.STA)
thrd.IsBackground = False
Console.WriteLine("Starting Thread - " & thrd.ManagedThreadId)
thrd.Start()
End Sub
Sub Main()
Dim lstUD As New List(Of URLtoDest)
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file0.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file1.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file2.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file3.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file4.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file5.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file6.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file7.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file8.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file9.txt"))
For Each ud As URLtoDest In lstUD
Download(ud)
Next
' you will see this message in the middle of the text
' pressing a key before all files are done downloading aborts the threads that aren't finished
Console.WriteLine("Press any key to exit...")
Console.ReadKey()
End Sub
End Module

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string