When using Async functions in the shared API of the office js in Excel 2016 it causes a memory leak, specifically calling binding.setDataAsnyc which never releases the data written.( The leak is in the internet explorer process running the addin within excel( it is a 32-bit version one)).
Example :
//1 Miliion row by 10 columns data parsed from a csv usually in chunks
var data = [];
var i,j;
for (i=0; i<1000000; i++) {
row = [];
for(j=0; j<10; j++) {
row.push("data" + i + "" + j);
}
data.push(row);
}
var limit = 10000;
var next = function (step) {
var columnLetter = getExcelColumnName(data[0].length);
var startRow = step * limit + 1;
var endRow = start + limit;
if (data.length < startRow)
return;
var values = data.slice(startRow - 1, endRow - 1);
var range = "A" + startRow + ":" + columnLetter + "" + endRow;
Office.context.document.bindings.addFromNamedItemAsync(range,
Office.CoercionType.Matrix, { id: "binding" + step },
function (asyncResult) {
if (asyncResult.status == "failed") {
console.log('Error: ' + asyncResult.error.message);
return;
}
Office.select("bindings#binding" + step).setDataAsync(values,
{
coercionType: Office.CoercionType.Matrix
}, function (asyncResult) {
//Memory keeps Increasing on this callback
if (asyncResult.status == Office.AsyncResultStatus.Failed) {
console.log("Action failed with error: " + asyncResult.error.message);
return;
}
next(step++);
});
});
}
next(0);
I tried releasing each binding after the setDataAsync but still the memory persists. the only way to reclaim the memory is to reload the addin.
I tried the other way of assigning values to ranges:
range.values = values;
It doesn't leak but take 3 times as long as setDataAsync (approximately 210 seconds for 1M rows by 10 columns) while setDataAsync take about 70 seconds but of course leaks and consumes 1.1 GB of memory in that request.
I also tried table.rows.add(null, values); but that's got even worse performance.
I tested the same code without setdataAsync (calling next right away) and no memory leak occurs.
Did anybody else experience this?
Is there anyway around it to release that memory?
If not is there another way to fill large amount of data in Excel except these 3 methods that is also fast?
Adding a binding does increase memory consumption -- but just calling setDataAsync definitely should not (or at least, not by more than you'd expect, or by more than copy-pasting the data into the sheet manually would be).
One thing to clarify: when you say:
I also tried table.rows.add(null, values) but that's even worse performance.
I assume you mean doing the alternate Office 2016-wave-of-APIs syntax, using an Excel.run(function(context) { ... });? I can follow up regarding perf, but I'm curious if the memory leak happens when you use the new API syntax as well.
FWIW, there is an API coming in the near-ish future that will allow you to suspend calculation while setting values -- that should increase performance. But I do want to see if we can figure out the two issues you're pointing out: the memory leak on setDataAsync and the range.values call.
If you can begin by answering the question above (whether the same leak occurs even in range.values), it will help narrow down the issue, and then I'll follow up with the team.
Thanks!
I found out there is a boolean window.Excel._RedirectV1APIs that you can set to true that make setDataAsync use the new API. It should be set automatically if
window.Office.context.requirements.isSetSupported("RedirectV1Api")
return true but somehow that requirement is not set but it works if you set manually
window.Excel._RedirectV1APIs = true
yet it is still much slower than native setDataAsync but slightly faster than the range.values manual approach (160s vs 180s for 1M rows by 10 cols) and don't cause memory leaks.
Also, I found out that the memory leak happens after window.external.Execute(id, params, callback) function with id: 71 (setDataAsnyc) and on calling the callback from that function so it seems the memory leak somehow happens in/because of external code in Excel itself (although the memory leak is in the internet explorer process though). If you short circuit and just call the callback directly instead of window.external.Execute no memory leak happens, but of course, no data is set either.
Related
I tried one code at hackerearth : https://www.hackerearth.com/practice/data-structures/stacks/basics-of-stacks/practice-problems/algorithm/fight-for-laddus/description/
The speed seems fine however the memory usage exceed the 256mb limit by nearly 2.8 times.
In java and python the memory is 5 times less however the time is nearly twice.
What factor can be used to optimise the memory usage in nodejs code implementation?
Here is nodejs implementation:
// Sample code to perform I/O:
process.stdin.resume();
process.stdin.setEncoding("utf-8");
var stdin_input = "";
process.stdin.on("data", function (input) {
stdin_input += input; // Reading input from STDIN
});
process.stdin.on("end", function () {
main(stdin_input);
});
function main(input) {
let arr = input.split("\n");
let testCases = parseInt(arr[0], 10);
arr.splice(0,1);
finalStr = "";
while(testCases > 0){
let inputArray = (arr[arr.length - testCases*2 + 1]).split(" ");
let inputArrayLength = inputArray.length;
testCases = testCases - 1;
frequencyObject = { };
for(let i = 0; i < inputArrayLength; ++i) {
if(!frequencyObject[inputArray[i]])
{
frequencyObject[inputArray[i]] = 0;
}
++frequencyObject[inputArray[i]];
}
let finalArray = [];
finalArray[inputArrayLength-1] = -1;
let stack = [];
stack.push(inputArrayLength-1);
for(let i = inputArrayLength-2; i>=0; i--)
{
let stackLength = stack.length;
while(stackLength > 0 && frequencyObject[inputArray[stack[stackLength-1]]] <= frequencyObject[inputArray[i]])
{
stack.pop();
stackLength--;
}
if (stackLength > 0) {
finalArray[i] = inputArray[stack[stackLength-1]];
} else {
finalArray[i] = -1;
}
stack.push(i);
}
console.log(finalArray.join(" ") + "\n")
}
}
What factor can be used to optimize the memory usage in nodejs code implementation?
Here are some things to consider:
Don't buffer any more input data than you need to before you process it or output it.
Try to avoid making copies of data. Use the data in place if possible. Remember that all string operations create a new string that is likely a copy of the original data. And, many array operations like .map(), .filter(), etc... create new copies of the original array.
Keep in mind that garbage collection is delayed and is typically done during idle time. So, for example, modifying strings in a loop may create a lot of temporary objects that all must exist at once, even though most or all of them will be garbage collected when the loop is done. This creates a poor peak memory usage.
Buffering
The first thing I notice is that you read the entire input file into memory before you process any of it. Right away for large input files, you're going to use a lot of memory. Instead, what you want to do is read enough of a chunk to get the next testCase and then process it.
FYI, this incremental reading/processing will make the code significantly more complicated to write (I've written an implementation myself) because you have to handle partially read lines, but it will hold down memory use a bunch and that's what you asked for.
Copies of Data
After reading the entire input file into memory, you immediately make a copy of it all with this:
let arr = input.split("\n");
So, now you've more than doubled the amount of memory the input data is taking up. Instead of just one string for all the input, you now still have all of that in memory, but you've now broken it up into hundreds of other strings (each with a little overhead of its own for a new string and of course a copy of each line).
Modifying Strings in a Loop
When you're creating your final result which you call finalStr, you're doing this over and over again:
finalStr = finalStr + finalArray.join(" ") + "\n"
This is going to create tons and tons of incremental strings that will likely end up all in memory at once because garbage collection probably won't run until the loop is over. As an example, if you had 100 lines of output that were each 100 characters long so the total output (not count line termination characters) was 100 x 100 = 10,000 characters, then constructing this in a loop like you are would create temporary strings of 100, 200, 300, 400, ... 10,000 which would consume 5000 (avg length) * 100 (number of temporary strings) = 500,000 characters. That's 50x the total output size consumed in temporary string objects.
So, not only does this create tons of incremental strings each one larger than the previous one (since you're adding onto it), it also creates your entire output in memory before writing any of it out to stdout.
Instead, you can incremental output each line to stdout as you construct each line. This will put the worst case memory usage at probably about 2x the output size whereas you're at 50x or worse.
My question is about performance in my NodeJS app...
If my program run 12 iteration of 1.250.000 each = 15.000.000 iterations all together - it takes dedicated servers at Amazon the following time to process:
r3.large: 2 vCPU, 6.5 ECU, 15 GB memory --> 123 minutes
4.8xlarge: 36 vCPU, 132 ECU, 60 GB memory --> 102 minutes
I have some code similair to the code below...
start();
start(){
for(var i=0; i<12; i++){
function2(); // Iterates over a collection - which contains data split up in intervals - by date intervals. This function is actually also recursive - due to the fact - that is run through the data many time (MAX 50-100 times) - due to different intervals sizes...
}
}
function2(){
return new Promise{
for(var i=0; i<1.250.000; i++){
return new Promise{
function3(); // This function simple iterate through all possible combinations - and call function3 - with all given values/combinations
}
}
}
}
function3(){
return new Promise{ // This function simple make some calculations based on the given values/combination - and then return the result to function2 - which in the end - decides which result/combination was the best...
}}
This is equal to 0.411 millisecond / 441 microseconds pér iteration!
When i look at performance and memory usage in the taskbar... the CPU is not running at 100% - but more like 50%...the entire time?
The memory usage starts very low - but KEEPS growing in GB - every minute until the process is done - BUT the (allocated) memory is first released when i press CTRL+C in the Windows CMD... so its like the NodeJS garbage collection doesn't not work optimal - or may be its simple the design of the code again...
When i execute the app i use the memory opt like:
node --max-old-space-size="50000" server.js
PLEASE tell me every thing you thing i can do - to make my program FASTER!
Thank you all - so much!
It's not that the garbage collector doesn't work optimally but that it doesn't work at all - you don't give it any chance to.
When developing the tco module that does tail call optimization in Node i noticed a strange thing. It seemed to leak memory and I didn't know why. It turned out that it was because of few console.log()
calls in various places that I used for testing to see what's going on because seeing a result of recursive call millions levels deep took some time so I wanted to see something while it was doing it.
Your example is pretty similar to that.
Remember that Node is single-threaded. When your computations run, nothing else can - including the GC. Your code is completely synchronous and blocking - even though it's generating millions of promises in a blocking manner. It is blocking because it never reaches the event loop.
Consider this example:
var a = 0, b = 10000000;
function numbers() {
while (a < b) {
console.log("Number " + a++);
}
}
numbers();
It's pretty simple - you want to print 10 million numbers. But when you run it it behaves very strangely - for example it prints numbers up to some point, and then it stops for several seconds, then it keeps going or maybe starts trashing if you're using swap, or maybe gives you this error that I just got right after seeing the Number 8486:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Aborted
What's going on here is that the main thread is blocked in a synchronous loop where it keeps creating objects but the GC has no chance to release them.
For such long running tasks you need to divide your work and get into the event loop once in a while.
Here is how you can fix this problem:
var a = 0, b = 10000000;
function numbers() {
var i = 0;
while (a < b && i++ < 100) {
console.log("Number " + a++);
}
if (a < b) setImmediate(numbers);
}
numbers();
It does the same - it prints numbers from a to b but in bunches of 100 and then it schedules itself to continue at the end of the event loop.
Output of $(which time) -v node numbers1.js 2>&1 | egrep 'Maximum resident|FATAL'
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Maximum resident set size (kbytes): 1495968
It used 1.5GB of memory and crashed.
Output of $(which time) -v node numbers2.js 2>&1 | egrep 'Maximum resident|FATAL'
Maximum resident set size (kbytes): 56404
It used 56MB of memory and finished.
See also those answers:
How to write non-blocking async function in Express request handler
How node.js server serve next request, if current request have huge computation?
Maximum call stack size exceeded in nodejs
Node; Q Promise delay
How to avoid jimp blocking the code node.js
Why is the speed of nodejs array shift/push operations not linear in the size of the array? There is a dramatic knee at 87370 that completely crushes the system.
Try this, first with 87369 elements in q, then with 87370. (Or, on a 64-bit system, try 85983 and 85984.) For me, the former runs in .05 seconds; the latter, in 80 seconds -- 1600 times slower. (observed on 32-bit debian linux with node v0.10.29)
q = [];
// preload the queue with some data
for (i=0; i<87369; i++) q.push({});
// fetch oldest waiting item and push new item
for (i=0; i<100000; i++) {
q.shift();
q.push({});
if (i%10000 === 0) process.stdout.write(".");
}
64-bit debian linux v0.10.29 crawls starting at 85984 and runs in .06 / 56 seconds. Node v0.11.13 has similar breakpoints, but at different array sizes.
Shift is a very slow operation for arrays as you need to move all the elements but V8 is able to use a trick to perform it fast when the array contents fit in a page (1mb).
Empty arrays start with 4 slots and as you keep pushing, it will resize the array using formula 1.5 * (old length + 1) + 16.
var j = 4;
while (j < 87369) {
j = (j + 1) + Math.floor(j / 2) + 16
console.log(j);
}
Prints:
23
51
93
156
251
393
606
926
1406
2126
3206
4826
7256
10901
16368
24569
36870
55322
83000
124517
So your array size ends up actually being 124517 items which makes it too large.
You can actually preallocate your array just to the right size and it should be able to fast shift again:
var q = new Array(87369); // Fits in a page so fast shift is possible
// preload the queue with some data
for (i=0; i<87369; i++) q[i] = {};
If you need larger than that, use the right data structure
I started digging into the v8 sources, but I still don't understand it.
I instrumented deps/v8/src/builtins.cc:MoveElemens (called from Builtin_ArrayShift, which implements the shift with a memmove), and it clearly shows the slowdown: only 1000 shifts per second because each one takes 1ms:
AR: at 1417982255.050970: MoveElements sec = 0.000809
AR: at 1417982255.052314: MoveElements sec = 0.001341
AR: at 1417982255.053542: MoveElements sec = 0.001224
AR: at 1417982255.054360: MoveElements sec = 0.000815
AR: at 1417982255.055684: MoveElements sec = 0.001321
AR: at 1417982255.056501: MoveElements sec = 0.000814
of which the memmove is 0.000040 seconds, the bulk is the heap->RecordWrites (deps/v8/src/heap-inl.h):
void Heap::RecordWrites(Address address, int start, int len) {
if (!InNewSpace(address)) {
for (int i = 0; i < len; i++) {
store_buffer_.Mark(address + start + i * kPointerSize);
}
}
}
which is (store-buffer-inl.h)
void StoreBuffer::Mark(Address addr) {
ASSERT(!heap_->cell_space()->Contains(addr));
ASSERT(!heap_->code_space()->Contains(addr));
Address* top = reinterpret_cast<Address*>(heap_->store_buffer_top());
*top++ = addr;
heap_->public_set_store_buffer_top(top);
if ((reinterpret_cast<uintptr_t>(top) & kStoreBufferOverflowBit) != 0) {
ASSERT(top == limit_);
Compact();
} else {
ASSERT(top < limit_);
}
}
when the code is running slow, there are runs of shift/push ops followed by runs of 5-6 calls to Compact() for every MoveElements. When it's running fast, MoveElements isn't called until a handful of times at the end, and just a single compaction when it finishes.
I'm guessing memory compaction might be thrashing, but it's not falling in place for me yet.
Edit: forget that last edit about output buffering artifacts, I was filtering duplicates.
this bug had been reported to google, who closed it without studying the issue.
https://code.google.com/p/v8/issues/detail?id=3059
When shifting out and calling tasks (functions) from a queue (array)
the GC(?) is stalling for an inordinate length of time.
114467 shifts is OK
114468 shifts is problematic, symptoms occur
the response:
he GC has nothing to do with this, and nothing is stalling either.
Array.shift() is an expensive operation, as it requires all array
elements to be moved. For most areas of the heap, V8 has implemented a
special trick to hide this cost: it simply bumps the pointer to the
beginning of the object by one, effectively cutting off the first
element. However, when an array is so large that it must be placed in
"large object space", this trick cannot be applied as object starts
must be aligned, so on every .shift() operation all elements must
actually be moved in memory.
I'm not sure there's a whole lot we can do about this. If you want a
"Queue" object in JavaScript with guaranteed O(1) complexity for
.enqueue() and .dequeue() operations, you may want to implement your
own.
Edit: I just caught the subtle "all elements must be moved" part -- is RecordWrites not GC but an actual element copy then? The memmove of the array contents is 0.04 milliseconds. The RecordWrites loop is 96% of the 1.1 ms runtime.
Edit: if "aligned" means the first object must be at first address, that's what memmove does. What is RecordWrites?
I have an application which retrieves and caches the results of a clients query and sends the results out to a client from a cache.
I have a limit on the number of items which may be cached at any one time and keeping track of this limit has has drastically reduced the applications performance when processing a large number of concurrent requests. Is there a better way to solve this problem without locking so often which may improve performance?
Edit: I've gone with the CAS approach and it seems to work pretty well.
First, rather than using a lock, use atomic decrements and compare-and-exchange to manipulate your counter. The syntax for this varies with your compiler; in GCC you might do something like:
long remaining_cache_slots;
void release() {
__sync_add_and_fetch(&remaining_cache_slots, 1);
}
// Returns false if we've hit our cache limit
bool acquire() {
long prev_value, new_value;
do {
prev_value = remaining_cache_slots;
if (prev_value <= 0) return false;
new_value = prev_value - 1;
} while(!__sync_bool_compare_and_swap(&remaining_cache_slots, prev_value, new_value));
return true;
}
This should help reduce the window for contention. However, you'll still be bouncing that cache line all over the place, which at a high request rate can severely hurt your performance.
If you're willing to accept a certain amount of waste (ie, allowing the number of cached results - or rather, pending responses - to go slightly below the limit), you have some other options. One is to make the cache thread-local (if possible in your design). Another is to have each thread reserve a pool of 'cache tokens' to use.
What I mean by reserving a pool of cache tokens is that each thread can reserve ahead of time the right to insert N entries into the cache. When that thread removes an entry from the cache it adds it to its set of tokens; if it runs out of tokens, it tries to get them from a global pool, and if it has too many, it puts some of them back. The code might look a bit like this:
long global_cache_token_pool;
__thread long thread_local_token_pool = 0;
// Release 10 tokens to the global pool when we go over 20
// The maximum waste for this scheme is 20 * nthreads
#define THREAD_TOKEN_POOL_HIGHWATER 20
#define THREAD_TOKEN_POOL_RELEASECT 10
// If we run out, acquire 5 tokens from the global pool
#define THREAD_TOKEN_POOL_ACQUIRECT 5
void release() {
thread_local_token_pool++;
if (thread_local_token_pool > THREAD_TOKEN_POOL_HIGHWATER) {
thread_local_token_pool -= THREAD_TOKEN_POOL_RELEASECT;
__sync_fetch_and_add(&global_token_pool, THREAD_TOKEN_POOL_RELEASECT);
}
}
bool acquire() {
if (thread_local_token_pool > 0) {
thread_local_token_pool--;
return true;
}
long prev_val, new_val, acquired;
do {
prev_val = global_token_pool;
acquired = std::min(THREAD_TOKEN_POOL_ACQUIRECT, prev_val);
if (acquired <= 0) return false;
new_val = prev_val - acquired;
} while (!__sync_bool_compare_and_swap(&remaining_cache_slots, prev_value, new_value));
thread_local_token_pool = acquired - 1;
return true;
}
Batching up requests like this reduces the frequency at which threads access shared data, and thus the amount of contention and cache churn. However, as mentioned before, it makes your limit a bit less precise, and so requires careful tuning to get the right balance.
In SendResults, try updating totalResultsCached only once after you process the results. This will minimize the time spent acquiring/releasing the lock.
void SendResults( int resultsToSend, Request *request )
{
for (int i=0; i<resultsToSend; ++i)
{
send(request.remove())
}
lock totalResultsCached
totalResultsCached -= resultsToSend;
unlock totalResultsCached
}
If resultsToSend is typically 1, then my suggestion will not make much of a difference.
Also, after hitting the cache limit, some extra requests may be dropped in ResultCallback, because SendResults is not updating totalResultsCached immediately after sending each request.
It seems that I have a problem with Linux IO performance. Working with a project I need to clear whole the file from the kernel space. I use the following code pattern:
for_each_mapping_page(mapping, index) {
page = read_mapping_page(mapping, index);
lock_page(page);
{ kmap // memset // kunmap }
set_page_dirty(page);
write_one_page(page, 1);
page_cache_release(page);
cond_resched();
}
All works fine but with large files (~3Gb+ for me) I see that my system stalls in a strange manner: while this operation is not completed I can't run anything. In other words, all the processes that exists before this operation runs fine, but if I try to run something while this operation I see nothing until it completed.
Is it a kernel's IO scheduling issue or may be I missed something? And how can I fix this problem?
Thanks.
UPD:
According to Kristof's suggestion I've reworked my code and now it looks like this:
headIndex = soff >> PAGE_CACHE_SHIFT;
tailIndex = eoff >> PAGE_CACHE_SHIFT;
/**
* doing the exact #headIndex .. #tailIndex range
*/
for (index = headIndex; index < tailIndex; index += nr_pages) {
nr_pages = min_t(int, ARRAY_SIZE(pages), tailIndex - index);
for (i = 0; i < nr_pages; i++) {
pages[i] = read_mapping_page(mapping, index + i, NULL);
if (IS_ERR(pages[i])) {
while (i--)
page_cache_release(pages[i]);
goto return_result;
}
}
for (i = 0; i < nr_pages; i++)
zero_page_atomic(pages[i]);
result = filemap_write_and_wait_range(mapping, index << PAGE_CACHE_SHIFT,
((index + nr_pages) << PAGE_CACHE_SHIFT) - 1);
for (i = 0; i < nr_pages; i++)
page_cache_release(pages[i]);
if (result)
goto return_result;
if (fatal_signal_pending(current))
goto return_result;
cond_resched();
}
As the result I've got better IO performance, but still have problems with huge IO activity while doing concurrent disk access within the same user as caused the operation.
Anyway, thanks for the suggestions.
In essence you're bypassing the kernels IO scheduler completely.
If you look at the ext2 implementation you'll see it never (well ok, once) calls write_one_page(). For large-scale data transfers it uses mpage_writepages() instead.
This uses the Block I/O interface, rather than immediately accessing the hardware. This means it passes through the IO scheduler. Large operations will not block the entire systems, as the scheduler will automatically ensure that other operations are interleaved with the large writes.