NodeJS, Promises and performance

NodeJS, Promises and performance - node.js

My question is about performance in my NodeJS app...
If my program run 12 iteration of 1.250.000 each = 15.000.000 iterations all together - it takes dedicated servers at Amazon the following time to process:
r3.large: 2 vCPU, 6.5 ECU, 15 GB memory --> 123 minutes
4.8xlarge: 36 vCPU, 132 ECU, 60 GB memory --> 102 minutes
I have some code similair to the code below...
start();
start(){
for(var i=0; i<12; i++){
function2(); // Iterates over a collection - which contains data split up in intervals - by date intervals. This function is actually also recursive - due to the fact - that is run through the data many time (MAX 50-100 times) - due to different intervals sizes...
}
}
function2(){
return new Promise{
for(var i=0; i<1.250.000; i++){
return new Promise{
function3(); // This function simple iterate through all possible combinations - and call function3 - with all given values/combinations
}
}
}
}
function3(){
return new Promise{ // This function simple make some calculations based on the given values/combination - and then return the result to function2 - which in the end - decides which result/combination was the best...
}}
This is equal to 0.411 millisecond / 441 microseconds pér iteration!
When i look at performance and memory usage in the taskbar... the CPU is not running at 100% - but more like 50%...the entire time?
The memory usage starts very low - but KEEPS growing in GB - every minute until the process is done - BUT the (allocated) memory is first released when i press CTRL+C in the Windows CMD... so its like the NodeJS garbage collection doesn't not work optimal - or may be its simple the design of the code again...
When i execute the app i use the memory opt like:
node --max-old-space-size="50000" server.js
PLEASE tell me every thing you thing i can do - to make my program FASTER!
Thank you all - so much!

It's not that the garbage collector doesn't work optimally but that it doesn't work at all - you don't give it any chance to.
When developing the tco module that does tail call optimization in Node i noticed a strange thing. It seemed to leak memory and I didn't know why. It turned out that it was because of few console.log()
calls in various places that I used for testing to see what's going on because seeing a result of recursive call millions levels deep took some time so I wanted to see something while it was doing it.
Your example is pretty similar to that.
Remember that Node is single-threaded. When your computations run, nothing else can - including the GC. Your code is completely synchronous and blocking - even though it's generating millions of promises in a blocking manner. It is blocking because it never reaches the event loop.
Consider this example:
var a = 0, b = 10000000;
function numbers() {
while (a < b) {
console.log("Number " + a++);
}
}
numbers();
It's pretty simple - you want to print 10 million numbers. But when you run it it behaves very strangely - for example it prints numbers up to some point, and then it stops for several seconds, then it keeps going or maybe starts trashing if you're using swap, or maybe gives you this error that I just got right after seeing the Number 8486:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Aborted
What's going on here is that the main thread is blocked in a synchronous loop where it keeps creating objects but the GC has no chance to release them.
For such long running tasks you need to divide your work and get into the event loop once in a while.
Here is how you can fix this problem:
var a = 0, b = 10000000;
function numbers() {
var i = 0;
while (a < b && i++ < 100) {
console.log("Number " + a++);
}
if (a < b) setImmediate(numbers);
}
numbers();
It does the same - it prints numbers from a to b but in bunches of 100 and then it schedules itself to continue at the end of the event loop.
Output of $(which time) -v node numbers1.js 2>&1 | egrep 'Maximum resident|FATAL'
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Maximum resident set size (kbytes): 1495968
It used 1.5GB of memory and crashed.
Output of $(which time) -v node numbers2.js 2>&1 | egrep 'Maximum resident|FATAL'
Maximum resident set size (kbytes): 56404
It used 56MB of memory and finished.
See also those answers:
How to write non-blocking async function in Express request handler
How node.js server serve next request, if current request have huge computation?
Maximum call stack size exceeded in nodejs
Node; Q Promise delay
How to avoid jimp blocking the code node.js

Related

Parallel traveling salesman problem solver finishes prematurely

Good time of the day. I'm doing an assignment which involves writing a program that solves Traveling Salesman Problem in parallel using brute-force method. I managed to write a working version but it was a lot slower than consequential version due to numerous memory allocations at a high rate which I attempted to limit as in a code bellow using buffered channel.
SO probably won't allow to fit all the code into post so, please view definition and methods of data structures on Github: https://github.com/telephrag/tsp_go
This version doesn't work at all.
At the end t will contain empty path and maximum value of uint64 for traveled distance which is set at the beginning.
As could be seen through debugger, salesman with id of 1 is getting node added to its path at sm.Path[1], but once <-t.RouteQueue occurs in the same call to travel() the program stops despite numerous goroutines are supposedly waiting to write to t.RouteQueue at the moment. It's also confirmed through debugger as well that the program never reaches if-block responsible for setting new shortest path in t.
If we create sync.Waitgroup for each for-loop the program will crash on deadlock.
sync.WaitGroup but remove everything related to channel the program would work but do so very slowly. You can get this version at the first commit to main branch of repository above.
Why does the program ends prematurely?
package tsp
func (t *Tsp) Solve() {
sm := NewSalesman(t)
sm.Visit(0) // sets bit corresponding to given node in bitmask
for nextNode := range graph[0] {
t.RouteQueue <- true // book slot in route queue (buffered channel)
go t.travel(sm.Copy(t), nextNode)
}
}
func (t *Tsp) travel(sm *Salesman, node uint64) {
sm.Distance += graph[sm.TailNode()][node] // increase traveled distance
if sm.Distance > t.MinDist { // terminate if traveled distance is bigger than current minimal
<-t.RouteQueue
return
}
sm.Visit(node)
sm.Count++ // increase amount of nodes traveled
sm.Path[sm.Count] = node // add node to path
if sm.Count == t.NodeCount-1 { // stop if t.NodeCount - 1 nodes traveled
sm.Count++
sm.Distance += graph[node][0] // return to zero-node
t.Mu.Lock()
if t.MinDist > sm.Distance { // set new min distance and path if they are shorter
t.MinPath = sm.Path
t.MinDist = sm.Distance
}
t.Mu.Unlock()
}
<-t.RouteQueue // free the slot for routes
for nextNode := range graph[node] {
if !sm.HasVisited(node) {
t.RouteQueue <- true
go t.travel(sm.Copy(t), nextNode)
}
}
}

Office JS setDataAsync function memory leak

When using Async functions in the shared API of the office js in Excel 2016 it causes a memory leak, specifically calling binding.setDataAsnyc which never releases the data written.( The leak is in the internet explorer process running the addin within excel( it is a 32-bit version one)).
Example :
//1 Miliion row by 10 columns data parsed from a csv usually in chunks
var data = [];
var i,j;
for (i=0; i<1000000; i++) {
row = [];
for(j=0; j<10; j++) {
row.push("data" + i + "" + j);
}
data.push(row);
}
var limit = 10000;
var next = function (step) {
var columnLetter = getExcelColumnName(data[0].length);
var startRow = step * limit + 1;
var endRow = start + limit;
if (data.length < startRow)
return;
var values = data.slice(startRow - 1, endRow - 1);
var range = "A" + startRow + ":" + columnLetter + "" + endRow;
Office.context.document.bindings.addFromNamedItemAsync(range,
Office.CoercionType.Matrix, { id: "binding" + step },
function (asyncResult) {
if (asyncResult.status == "failed") {
console.log('Error: ' + asyncResult.error.message);
return;
}
Office.select("bindings#binding" + step).setDataAsync(values,
{
coercionType: Office.CoercionType.Matrix
}, function (asyncResult) {
//Memory keeps Increasing on this callback
if (asyncResult.status == Office.AsyncResultStatus.Failed) {
console.log("Action failed with error: " + asyncResult.error.message);
return;
}
next(step++);
});
});
}
next(0);
I tried releasing each binding after the setDataAsync but still the memory persists. the only way to reclaim the memory is to reload the addin.
I tried the other way of assigning values to ranges:
range.values = values;
It doesn't leak but take 3 times as long as setDataAsync (approximately 210 seconds for 1M rows by 10 columns) while setDataAsync take about 70 seconds but of course leaks and consumes 1.1 GB of memory in that request.
I also tried table.rows.add(null, values); but that's got even worse performance.
I tested the same code without setdataAsync (calling next right away) and no memory leak occurs.
Did anybody else experience this?
Is there anyway around it to release that memory?
If not is there another way to fill large amount of data in Excel except these 3 methods that is also fast?

Adding a binding does increase memory consumption -- but just calling setDataAsync definitely should not (or at least, not by more than you'd expect, or by more than copy-pasting the data into the sheet manually would be).
One thing to clarify: when you say:
I also tried table.rows.add(null, values) but that's even worse performance.
I assume you mean doing the alternate Office 2016-wave-of-APIs syntax, using an Excel.run(function(context) { ... });? I can follow up regarding perf, but I'm curious if the memory leak happens when you use the new API syntax as well.
FWIW, there is an API coming in the near-ish future that will allow you to suspend calculation while setting values -- that should increase performance. But I do want to see if we can figure out the two issues you're pointing out: the memory leak on setDataAsync and the range.values call.
If you can begin by answering the question above (whether the same leak occurs even in range.values), it will help narrow down the issue, and then I'll follow up with the team.
Thanks!

I found out there is a boolean window.Excel._RedirectV1APIs that you can set to true that make setDataAsync use the new API. It should be set automatically if
window.Office.context.requirements.isSetSupported("RedirectV1Api")
return true but somehow that requirement is not set but it works if you set manually
window.Excel._RedirectV1APIs = true
yet it is still much slower than native setDataAsync but slightly faster than the range.values manual approach (160s vs 180s for 1M rows by 10 cols) and don't cause memory leaks.
Also, I found out that the memory leak happens after window.external.Execute(id, params, callback) function with id: 71 (setDataAsnyc) and on calling the callback from that function so it seems the memory leak somehow happens in/because of external code in Excel itself (although the memory leak is in the internet explorer process though). If you short circuit and just call the callback directly instead of window.external.Execute no memory leak happens, but of course, no data is set either.

why it's slowly when I parse a message of Google protocol buffer in multi-thread?

I try to parse many Google protocol buffer messages from a binary file generated by calling SerializeToString. I first load all Bytes into a heap memory by calling new function. I also have two arrays to store the Bytes begin address of a message in the heap memory and the Bytes count of the message.
Then I begin to parse message by calling ParseFromString.I want to quicken the procedure by using multi-thread.
In each thread, I pass the start index and end index of address array and Byte count array.
In parent process. the main code is:
struct ParsePara
{
char* str_buffer;
size_t* buffer_offset;
size_t* binary_string_length_array;
size_t start_idx;
size_t end_idx;
Flight_Ticket_Info* ticket_info_buffer_array;
};
//Flight_Ticket_Info is class of message
//offset_size is the count of message
ticket_array = new Flight_Ticket_Info[offset_size];
const int max_thread_count = 6;
pthread_t pthread_id_vec[max_thread_count];
CTimer thread_cost;
thread_cost.start();
vector<ParsePara*> para_vec;
const size_t each_count = ceil(float(offset_size) / max_thread_count);
for (size_t k = 0;k < max_thread_count;k++)
{
size_t start_idx = each_count * k;
size_t end_idx = each_count * (k+1);
if (start_idx >= offset_size)
break;
if (end_idx >= offset_size)
end_idx = offset_size;
ParsePara* cand_para_ptr = new ParsePara();
if (!cand_para_ptr)
{
_ERROR_EXIT(0,"[Malloc memory fail.]");
}
cand_para_ptr->str_buffer = m_valdata;//heap memory for storing Bytes of message
cand_para_ptr->buffer_offset = offset_array;//begin address of each message
cand_para_ptr->start_idx = start_idx;
cand_para_ptr->end_idx = end_idx;
cand_para_ptr->ticket_info_buffer_array = ticket_array;//array to store message
cand_para_ptr->binary_string_length_array = binary_length_array;//Bytes count of each message
para_vec.push_back(cand_para_ptr);
}
for(size_t k = 0 ;k < para_vec.size();k++)
{
int ret = pthread_create(&pthread_id_vec[k],NULL,parserFlightTicketForMultiThread,para_vec[k]);
if (0 != ret)
{
_ERROR_EXIT(0,"[Error] [create thread fail]");
}
}
for (size_t k = 0;k < para_vec.size();k++)
{
pthread_join(pthread_id_vec[k],NULL);
}
In each thread the thread function is:
void* parserFlightTicketForMultiThread(void* void_para_ptr)
{
ParsePara* para_ptr = (ParsePara*) void_para_ptr;
parserFlightTicketForMany(para_ptr->str_buffer,para_ptr->ticket_info_buffer_array,para_ptr->buffer_offset,
para_ptr->start_idx,para_ptr->end_idx,para_ptr->binary_string_length_array);
}
void parserFlightTicketForMany(const char* str_buffer,Flight_Ticket_Info* ticket_info_buffer_array,
size_t* buffer_offset,const size_t start_idx,const size_t end_idx,size_t* binary_string_length_array)
{
printf("start_idx:%d,end_idx:%d\n",start_idx,end_idx);
for (size_t k = start_idx;k < end_idx;k++)
{
if (k % 100000 == 0)
cout << k << endl;
size_t cand_offset = buffer_offset[k];
size_t binary_length = binary_string_length_array[k];
ticket_info_buffer_array[k].ParseFromString(string(&str_buffer[cand_offset],binary_length-1));
}
printf("done %ld %ld\n",start_idx,end_idx);
}
But multi-thread cost is more than one thread.
one thread cost is:40455623ms
My computer is 8 core and six thread cost is:131586865ms
Anyone can help me? thank you!

Some possible problems -- you'll have to experiment to determine which:
Protobuf parsing speed is often limited by memory bandwidth rather than CPU time, especially with a large input data set. In that case, more threads won't help, since all the cores are sharing bandwidth to main memory. Indeed, having multiple cores fighting over memory bandwidth could make the overall operation slower. Note that the biggest consumer of memory is not the input bytes but rather the parsed data objects -- that is, the output of parsing -- which are many times larger than the encoded data. To improve this problem, consider writing the parsing loop so that it fully-processes each message immediately after parsing, before moving on to the text message. That way, instead of allocating k protobuf objects, you only need to allocate one protobuf object per thread, and repeatedly reuse the same object for parsing. This way the object will (probably) stay in the core's private L1 cache and avoid consuming memory bandwidth; only the input bytes will be read over the main bus.
How are you loading data into RAM? Did you read() into a large array or did you mmap()? In the latter case the data is read from disk lazily -- it won't happen until you actually attempt to parse it. Even in the read() case, it could be that the data has been swapped out, creating similar effects. Either way, your threads are now not just fighting for memory bandwidth, but disk bandwidth, which is of course much slower. Having six threads reading separate parts of a big file will definitely be slower overall than having one thread read the whole file, because the operating system optimizes for sequential access.
Protobuf allocates memory during parsing. Many memory allocators take a lock while allocating new memory. Since all your threads are allocating tons and tons of objects in a tight loop, they will contend for this lock. Make sure you are using a thread-friendly memory allocator, such as Google's tcmalloc. Note that repeatedly reusing the same protobuf object in a parse-consume loop rather than allocating lots of different objects will also help immensely here, because the protobuf object will automatically reuse memory for sub-objects.
There may be a bug in your code and it might not be doing what you expect at all when multithreaded. For example, a bug might be causing all the threads to process the same data, rather than different data, and it could be that the data they're choosing happens to be bigger. Make sure you are testing that the results of your code are exactly the same when you run single-threaded vs. multi-threaded.
In short, if you want multiple cores to make your code faster, you have to think about not just what each core is doing, but what data is going in and out of each core, and how much the cores have to talk to each other. Ideally you want each core to operate all on its own without talking to anyone or anything; then you get maximum parallelism. That's not usually possible, of course, but the closer you can get to that, the better.
BTW, a random optimization for you:
ParseFromString(string(&str_buffer[cand_offset],binary_length-1))
Replace that with:
ParseFromArray(&str_buffer[cand_offset],binary_length-1)
Creating at std::string makes a copy of the data, which wastes time (and memory bandwidth). (This doesn't explain why threading is slow, though.)

Address certain core for threads in Perl

I have a list of 40 files, which I want to modify through my script.
Since every file processed in the same way, I want to use Threads to speed it up.
Therefore I have this construct :
my $threads_ = sub
{
while (defined(my $taskRef = $q->dequeue()))
{
my $work= shift(#{$workRef});
&{\&{$work}}(#{$workRef});
my $open= $q->open() - 1;
}
};
my #Working;
for( my $i = 1; $i < 8; $i++)
{
push #Working, threads->new($threads_);
}
And I have this code for starting a thread for every file
foreach my $File (#Filelist)
{
$q->enqueue(['mySub',$FirstVar,$SecondVar]);
}
But it still takes way to long time.
My question is, is there a certain way to assign each thread to a single Core, in order to speed it up?

I'd use Parallel::ForkManager for something like this; it works great. I'd recommend not brewing your own when an accepted standard solution exists. By "address certain core", I take it to mean your purpose is to limit the number of concurrent tasks to the number of available processors and ForkManager will do this for you -- just set the max number of processes when you initialize your ForkManager object.
The commenters above were absolutely correct to point out that I/O will eventually limit your throughput, but it's easy enough to determine when adding more processes fails to speed things up.

why does a a nodejs array shift/push loop run 1000x slower above array length 87369?

Why is the speed of nodejs array shift/push operations not linear in the size of the array? There is a dramatic knee at 87370 that completely crushes the system.
Try this, first with 87369 elements in q, then with 87370. (Or, on a 64-bit system, try 85983 and 85984.) For me, the former runs in .05 seconds; the latter, in 80 seconds -- 1600 times slower. (observed on 32-bit debian linux with node v0.10.29)
q = [];
// preload the queue with some data
for (i=0; i<87369; i++) q.push({});
// fetch oldest waiting item and push new item
for (i=0; i<100000; i++) {
q.shift();
q.push({});
if (i%10000 === 0) process.stdout.write(".");
}
64-bit debian linux v0.10.29 crawls starting at 85984 and runs in .06 / 56 seconds. Node v0.11.13 has similar breakpoints, but at different array sizes.

Shift is a very slow operation for arrays as you need to move all the elements but V8 is able to use a trick to perform it fast when the array contents fit in a page (1mb).
Empty arrays start with 4 slots and as you keep pushing, it will resize the array using formula 1.5 * (old length + 1) + 16.
var j = 4;
while (j < 87369) {
j = (j + 1) + Math.floor(j / 2) + 16
console.log(j);
}
Prints:
23
51
93
156
251
393
606
926
1406
2126
3206
4826
7256
10901
16368
24569
36870
55322
83000
124517
So your array size ends up actually being 124517 items which makes it too large.
You can actually preallocate your array just to the right size and it should be able to fast shift again:
var q = new Array(87369); // Fits in a page so fast shift is possible
// preload the queue with some data
for (i=0; i<87369; i++) q[i] = {};
If you need larger than that, use the right data structure

I started digging into the v8 sources, but I still don't understand it.
I instrumented deps/v8/src/builtins.cc:MoveElemens (called from Builtin_ArrayShift, which implements the shift with a memmove), and it clearly shows the slowdown: only 1000 shifts per second because each one takes 1ms:
AR: at 1417982255.050970: MoveElements sec = 0.000809
AR: at 1417982255.052314: MoveElements sec = 0.001341
AR: at 1417982255.053542: MoveElements sec = 0.001224
AR: at 1417982255.054360: MoveElements sec = 0.000815
AR: at 1417982255.055684: MoveElements sec = 0.001321
AR: at 1417982255.056501: MoveElements sec = 0.000814
of which the memmove is 0.000040 seconds, the bulk is the heap->RecordWrites (deps/v8/src/heap-inl.h):
void Heap::RecordWrites(Address address, int start, int len) {
if (!InNewSpace(address)) {
for (int i = 0; i < len; i++) {
store_buffer_.Mark(address + start + i * kPointerSize);
}
}
}
which is (store-buffer-inl.h)
void StoreBuffer::Mark(Address addr) {
ASSERT(!heap_->cell_space()->Contains(addr));
ASSERT(!heap_->code_space()->Contains(addr));
Address* top = reinterpret_cast<Address*>(heap_->store_buffer_top());
*top++ = addr;
heap_->public_set_store_buffer_top(top);
if ((reinterpret_cast<uintptr_t>(top) & kStoreBufferOverflowBit) != 0) {
ASSERT(top == limit_);
Compact();
} else {
ASSERT(top < limit_);
}
}
when the code is running slow, there are runs of shift/push ops followed by runs of 5-6 calls to Compact() for every MoveElements. When it's running fast, MoveElements isn't called until a handful of times at the end, and just a single compaction when it finishes.
I'm guessing memory compaction might be thrashing, but it's not falling in place for me yet.
Edit: forget that last edit about output buffering artifacts, I was filtering duplicates.

this bug had been reported to google, who closed it without studying the issue.
https://code.google.com/p/v8/issues/detail?id=3059
When shifting out and calling tasks (functions) from a queue (array)
the GC(?) is stalling for an inordinate length of time.
114467 shifts is OK
114468 shifts is problematic, symptoms occur
the response:
he GC has nothing to do with this, and nothing is stalling either.
Array.shift() is an expensive operation, as it requires all array
elements to be moved. For most areas of the heap, V8 has implemented a
special trick to hide this cost: it simply bumps the pointer to the
beginning of the object by one, effectively cutting off the first
element. However, when an array is so large that it must be placed in
"large object space", this trick cannot be applied as object starts
must be aligned, so on every .shift() operation all elements must
actually be moved in memory.
I'm not sure there's a whole lot we can do about this. If you want a
"Queue" object in JavaScript with guaranteed O(1) complexity for
.enqueue() and .dequeue() operations, you may want to implement your
own.
Edit: I just caught the subtle "all elements must be moved" part -- is RecordWrites not GC but an actual element copy then? The memmove of the array contents is 0.04 milliseconds. The RecordWrites loop is 96% of the 1.1 ms runtime.
Edit: if "aligned" means the first object must be at first address, that's what memmove does. What is RecordWrites?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string