I'm building a NodeJs App using Express 4 + Sequelize + a Postgresql database.
I'm using Node v8.11.3.
I wrote a script to load data into my database from a JSON file. I tested the script with a sample of ~30 entities to seed. It works perfectly.
Actually, I have around 100 000 entities to load, in the complete JSON file. My script reads the JSON file and tries to populate the database asynchronously (ie. 100 000 entities at the same time).
The result is, after some minutes :
<--- Last few GCs --->
[10488:0000018619050A20] 134711 ms: Mark-sweep 1391.6 (1599.7) -> 1391.6 (1599.7) MB, 1082.3 / 0.0 ms allocation failure GC in old space requested
[10488:0000018619050A20] 136039 ms: Mark-sweep 1391.6 (1599.7) -> 1391.5 (1543.7) MB, 1326.9 / 0.0 ms last resort GC in old space requested
[10488:0000018619050A20] 137351 ms: Mark-sweep 1391.5 (1543.7) -> 1391.5 (1520.2) MB, 1311.5 / 0.0 ms last resort GC in old space requested
<--- JS stacktrace --->
==== JS stack trace =========================================
Security context: 0000034170025879 <JSObject>
1: split(this=00000165BEC5DB99 <Very long string[1636]>)
2: attachExtraTrace [D:\Code\backend-lymo\node_modules\bluebird\js\release\debuggability.js:~775] [pc=0000021115C5728E](this=0000003CA90FF711 <CapturedTrace map = 0000033AD0FE9FB1>,error=000001D3EC5EFD59 <Error map = 00000275F61BA071>)
3: _attachExtraTrace(aka longStackTracesAttachExtraTrace) [D:\Code\backend-lymo\node_module...
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
1: node_module_register
2: v8::internal::FatalProcessOutOfMemory
3: v8::internal::FatalProcessOutOfMemory
4: v8::internal::Factory::NewFixedArray
5: v8::internal::HashTable<v8::internal::SeededNumberDictionary,v8::internal::SeededNumberDictionaryShape>::IsKey
6: v8::internal::HashTable<v8::internal::SeededNumberDictionary,v8::internal::SeededNumberDictionaryShape>::IsKey
7: v8::internal::StringTable::LookupString
8: v8::internal::StringTable::LookupString
9: v8::internal::RegExpImpl::Exec
10: v8::internal::interpreter::BytecodeArrayRandomIterator::UpdateOffsetFromIndex
11: 0000021115A043C1
Finally, some entities have been created but the process clearly crashed.
I understood that this error is due to memory.
My questions is : Why Node doesn't take the time to manage everything without overshooting memory ? Is there a "queue" to limit such explosions ?
I identified some workarounds :
Segment the seed into several JSON files
Use more memory using --max_old_space_size=8192 option
Proceed sequentially (using sync calls)
but none of these solutions are satisfying to me. It makes me afraid for the future of my app supposed to manage sometimes long operations in production.
What do you think about it ?
Node.js just does what you tell it. If you go into some big loop and start up a lot of database operations, then that's exactly what node.js attempts to do. If you start so many operations that you consume too many resources (memory, database resources, files, whatever), then you will run into trouble. Node.js does not manage that for you. It has to be your code that manages how many operations you keep in flight at the same time.
On the other hand, node.js is particularly good at having a bunch of asynchronous operations in flight at the same time and you will generally get better end-to-end performance if you do code it to have more than one operation going at a time. How many you want to have in flight at the same time depends entirely upon the specific code and exactly what the asynchronous operation is doing. If it's a database operation, then it will likely depend upon the database and how many simultaneous requests it does best with.
Here are some references that give you ideas for ways to control how many operations are going at once, including some code examples:
Make several requests to an API that can only handle 20 request a minute
Promise.all consumes all my RAM
Javascript - how to control how many promises access network in parallel
Fire off 1,000,000 requests 100 at a time
Nodejs: Async request with a list of URL
Loop through an api get request with variable URL
Choose proper async method for batch processing for max requests/sec
If you showed your code, we could advise more specifically which technique might fit best for your situation.
Use async.eachOfLimit to do at max X operations in same times :
var async = require("async");
var myBigArray = [];
var X = 10; // 10 operations in same time at max
async.eachOfLimit(myBigArray, X, function(element, index, callback){
// insert element
MyCollection.insert(element, function(err){
return callback(err);
});
}, function(err, result){
// all finished
if(err){
// do stg
}
else
{
// do stg
}
});
Related
I use node-influx and influx.query(...) use too much heap.
In my application I have something like
const data = await influx.query(`SELECT "f_power" as "power", "nature", "scope", "participantId"
FROM "power"
WHERE
"nature" =~ /^Conso$|^Prod$|^Autoconso$/ AND
"source" =~ /^$|^One$|^Two AND
"participantId" =~ /${participantIdFilter.join('|')}/ AND
"time" >= '${req.query.dateFrom}' AND
"time" <= '${req.query.dateTo}'
GROUP BY "timestamp"
`);
wb = dataToWb(data)
XLSX.writeFile(wb, filename);
Data is a result set of about 50M (I used this code)
And the heap used by this method is about 350M (I used process.memoryUsage().heapUsed)
I'm surprised by the diference between these two values...
So is possible to make this query less resource intensive?
Actually I use data to make a xlsx file. And the generation of this file lead to a node process out of memory. The method XLSX.writeFile(wb, filename) use about 100M. That's it's not enougth to fill my 512M RAM. So I figured me that is heap used by influx query which is never collected by the GC.
Actually I don't understand why the generation make this error. Why V8 can't free memory used by a method executed after and in another context ?
The node-influx (1.x client) reads the whole response, parses it into JSON, and transforms it into results (data array). There are a lot more intermediate objects on the heap during the processing of the response. You should also run the node garbage collector before and after the query to get a better estimate of how much heap does it take. You can now control the result memory usage only by reducing the result columns or rows (by limit, time, or aggregations). You can also join queries with smaller results to reduce the maximum heap usage caused by temporary objects. Of course, that is paid by the time and complexity of your code.
You can expect less memory usage with 2.x client (https://github.com/influxdata/influxdb-client-js). It does not create intermediate JSON objects, internally processes results in smaller chunks, it has an observable-like API that lets the user decide how to represent a result row. It uses FLUX as a query language and requires InfluxDB 2.x or InfluxDB 1.8 (with 2.x API enabled).
I have a wordlist of 11 character which I want to append in a url. After some modification in request.js,I am able to run 5 million size wordlist in requestlist array.It start throwing JavaScript heap memory error after going higher.I have billion of size of wordlist to process. I can able to generate my wordlist with js code.5 million entry finishes up in an hour,due to higher server capacityR I possess. Requestlist is a static variable so I cant add again in it.How can I run it infinitely for billions of combination.If any cron script can help then I am open to this also.
It would be better to use RequestQueue for such a high amount of Requests. The queue is persisted to disk as an SQLite database so memory usage is not an issue.
I suggest adding let's say 1000 requests into the queue and immediately start crawling, while pushing more requests to the queue. Enqueueing tens of millions or billions of requests might take long, but you don't need to wait for that.
For best performance, use apify version 1.0.0 or higher.
I have a mysql database which stores thousands of stock OHLC data for 2 years. Data is read from MySQL in the form of pandas dataframes and then submitted to celery in large batch jobs which eventually lead to "OOM command not allowed when used memory > 'maxmemory'".
I have added the following celery config options. These options have allowed my script to run longer however redis inevitably reaches 2gb memory and celery throws OOM errors.
result_expires = 30
ignore_result = True
worker_max_tasks_per_child = 1000
From the redis side I have tried playing with the maxmemory policy using both allkeys-lru and volatile-lru. Neither seem to make a difference.
When celery hits the OOM error the redis cli shows max memory usage and no keys?
# Memory
used_memory:2144982784
used_memory_human:2.00G
used_memory_rss:1630146560
used_memory_rss_human:1.52G
used_memory_peak:2149023792
used_memory_peak_human:2.00G
used_memory_peak_perc:99.81%
used_memory_overhead:2144785284
used_memory_startup:987472
used_memory_dataset:197500
used_memory_dataset_perc:0.01%
allocator_allocated:2144944880
allocator_active:1630108672
allocator_resident:1630108672
total_system_memory:17179869184
total_system_memory_human:16.00G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:2147483648
maxmemory_human:2.00G
maxmemory_policy:allkeys-lru
allocator_frag_ratio:0.76
allocator_frag_bytes:18446744073194715408
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.00
rss_overhead_bytes:37888
mem_fragmentation_ratio:0.76
mem_fragmentation_bytes:-514798320
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:2143797684
mem_aof_buffer:0
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0
And there are zero keys?
127.0.0.1:6379[1]> keys *
(empty list or set)
When I run this same code in subsets of 200*5 requests (then terminate) everything runs successfully. Redis memory usage caps around 100mb and when the python process terminates all the memory usage drops as expected. This leads me to believe I could probably implement a handler to do 200*5 requests at a time however I suspect that the python process (my script) terminating is what is actually freeing memory in celery/redis...
I would like to avoid subsetting this and process everything in MySQL in one shot. About 5000 pandas dataframes * 5 tasks total.
I do not understand why the memory usage in redis continues to grow when I am forgetting all results immediately following retrieving them?
Here is an example for how this is done in my code:
def getTaskResults(self, caller, task):
#Wait for the results and then call back
#Attach this Results object in the callback along with the data
ready = False
while not ready:
if task.ready():
ready = True
data = pd.read_json(task.get())
data.sort_values(by=['Date'], inplace=True)
task.forget()
return caller.resultsCB(data, self)
This is probably my ignorance with redis but if there are no keys how is it consuming all that memory, or how can I validate what is actually consuming that memory in redis?
Since I store the taskID of every call to celery in an object I have confirmed that trying to do a task.get after adding in task.forget throws an error.
I have a node process that I use to add key-values to an object. When I get to about 9.88 million keys added, the process appears to hang. I assumed an out-of-memory issue, so I turned on trace_gc and also put in a check in the the code that adds the keys:
const { heapTotal, heapUsed } = process.memoryUsage()
if ((heapUsed / heapTotal) > 0.99) {
throw new Error('Too much memory')
}
That condition was never met, and the error never thrown. As far as --trace_gc output, my last scavenge log was:
[21544:0x104000000] 2153122 ms: Scavenge 830.0 (889.8) -> 814.3 (889.8) MB, 1.0 / 0.0 ms allocation failure
Mark-sweep, however, continues logging this:
[21544:0x104000000] 3472253 ms: Mark-sweep 1261.7 (1326.9) -> 813.4 (878.8) MB, 92.3 / 0.1 ms (+ 1880.1 ms in 986 steps since start of marking, biggest step 5.6 ms, walltime since start of marking 12649 ms) finalize incremental marking via task GC in old space requested
Is this output consistent with memory issues?
I should note that having to add this many keys to the object is an edge-case; normally the range is more likely in the thousands. In addition, the keys are added during a streaming process, so I don't know how many are required to added at the outset. So in addition to trying to figure out what the specific problem is, I'm also looking for a way to determine that the problem will likely occur before the process hangs.
I have about ~300mb of data (~180k json objects) that gets updated once every 2-3 days.
This data is divided into three "collections", that I must keep up to date.
I decided to take the Node.JS way, but any solution in a language i know ( Java, Python) will be welcomed.
Whenever I perform a batch set using the node.JS firebase-admin client, not only it consumes an aberrant amount of ram ( about 4-6GB!), but it also tends to crash with errors that don't have a clear ( up to page 4 of google search without a meaningful answer ) reason.
My code is frankly simple, this is it:
var collection = db.collection("items");
var batch = db.batch();
array.forEach(item => {
var ref = collection.doc(item.id);
batch.set(ref, item);
});
batch.commit().then((res) => {
console.log("YAY",res);
});
I haven't found anywhere if there is a limit on the number of writes in a limited span of time (I understand doing 50-60k writes should be easy peasy with a backend the size of firebase), and also found that this can go up the ram train and have like 4-6GB of ram allocated.
I can confirm that when the errors are thrown, or the ram usage clogs my laptop, whatever happens first, I am still at less than 1-4% my daily usage quotas, so that is not the issue.