Memory usage of node-influx when querying - node.js

I use node-influx and influx.query(...) use too much heap.
In my application I have something like
const data = await influx.query(`SELECT "f_power" as "power", "nature", "scope", "participantId"
FROM "power"
WHERE
"nature" =~ /^Conso$|^Prod$|^Autoconso$/ AND
"source" =~ /^$|^One$|^Two AND
"participantId" =~ /${participantIdFilter.join('|')}/ AND
"time" >= '${req.query.dateFrom}' AND
"time" <= '${req.query.dateTo}'
GROUP BY "timestamp"
`);
wb = dataToWb(data)
XLSX.writeFile(wb, filename);
Data is a result set of about 50M (I used this code)
And the heap used by this method is about 350M (I used process.memoryUsage().heapUsed)
I'm surprised by the diference between these two values...
So is possible to make this query less resource intensive?
Actually I use data to make a xlsx file. And the generation of this file lead to a node process out of memory. The method XLSX.writeFile(wb, filename) use about 100M. That's it's not enougth to fill my 512M RAM. So I figured me that is heap used by influx query which is never collected by the GC.
Actually I don't understand why the generation make this error. Why V8 can't free memory used by a method executed after and in another context ?

The node-influx (1.x client) reads the whole response, parses it into JSON, and transforms it into results (data array). There are a lot more intermediate objects on the heap during the processing of the response. You should also run the node garbage collector before and after the query to get a better estimate of how much heap does it take. You can now control the result memory usage only by reducing the result columns or rows (by limit, time, or aggregations). You can also join queries with smaller results to reduce the maximum heap usage caused by temporary objects. Of course, that is paid by the time and complexity of your code.
You can expect less memory usage with 2.x client (https://github.com/influxdata/influxdb-client-js). It does not create intermediate JSON objects, internally processes results in smaller chunks, it has an observable-like API that lets the user decide how to represent a result row. It uses FLUX as a query language and requires InfluxDB 2.x or InfluxDB 1.8 (with 2.x API enabled).

Related

Requestlist throws heap out of memory in apify for more than 10 million wordlist

I have a wordlist of 11 character which I want to append in a url. After some modification in request.js,I am able to run 5 million size wordlist in requestlist array.It start throwing JavaScript heap memory error after going higher.I have billion of size of wordlist to process. I can able to generate my wordlist with js code.5 million entry finishes up in an hour,due to higher server capacityR I possess. Requestlist is a static variable so I cant add again in it.How can I run it infinitely for billions of combination.If any cron script can help then I am open to this also.
It would be better to use RequestQueue for such a high amount of Requests. The queue is persisted to disk as an SQLite database so memory usage is not an issue.
I suggest adding let's say 1000 requests into the queue and immediately start crawling, while pushing more requests to the queue. Enqueueing tens of millions or billions of requests might take long, but you don't need to wait for that.
For best performance, use apify version 1.0.0 or higher.

Redis memory usage continues to climb when using task.forget()

I have a mysql database which stores thousands of stock OHLC data for 2 years. Data is read from MySQL in the form of pandas dataframes and then submitted to celery in large batch jobs which eventually lead to "OOM command not allowed when used memory > 'maxmemory'".
I have added the following celery config options. These options have allowed my script to run longer however redis inevitably reaches 2gb memory and celery throws OOM errors.
result_expires = 30
ignore_result = True
worker_max_tasks_per_child = 1000
From the redis side I have tried playing with the maxmemory policy using both allkeys-lru and volatile-lru. Neither seem to make a difference.
When celery hits the OOM error the redis cli shows max memory usage and no keys?
# Memory
used_memory:2144982784
used_memory_human:2.00G
used_memory_rss:1630146560
used_memory_rss_human:1.52G
used_memory_peak:2149023792
used_memory_peak_human:2.00G
used_memory_peak_perc:99.81%
used_memory_overhead:2144785284
used_memory_startup:987472
used_memory_dataset:197500
used_memory_dataset_perc:0.01%
allocator_allocated:2144944880
allocator_active:1630108672
allocator_resident:1630108672
total_system_memory:17179869184
total_system_memory_human:16.00G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:2147483648
maxmemory_human:2.00G
maxmemory_policy:allkeys-lru
allocator_frag_ratio:0.76
allocator_frag_bytes:18446744073194715408
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.00
rss_overhead_bytes:37888
mem_fragmentation_ratio:0.76
mem_fragmentation_bytes:-514798320
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:2143797684
mem_aof_buffer:0
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0
And there are zero keys?
127.0.0.1:6379[1]> keys *
(empty list or set)
When I run this same code in subsets of 200*5 requests (then terminate) everything runs successfully. Redis memory usage caps around 100mb and when the python process terminates all the memory usage drops as expected. This leads me to believe I could probably implement a handler to do 200*5 requests at a time however I suspect that the python process (my script) terminating is what is actually freeing memory in celery/redis...
I would like to avoid subsetting this and process everything in MySQL in one shot. About 5000 pandas dataframes * 5 tasks total.
I do not understand why the memory usage in redis continues to grow when I am forgetting all results immediately following retrieving them?
Here is an example for how this is done in my code:
def getTaskResults(self, caller, task):
#Wait for the results and then call back
#Attach this Results object in the callback along with the data
ready = False
while not ready:
if task.ready():
ready = True
data = pd.read_json(task.get())
data.sort_values(by=['Date'], inplace=True)
task.forget()
return caller.resultsCB(data, self)
This is probably my ignorance with redis but if there are no keys how is it consuming all that memory, or how can I validate what is actually consuming that memory in redis?
Since I store the taskID of every call to celery in an object I have confirmed that trying to do a task.get after adding in task.forget throws an error.

Spark and 100000k of sequential HTTP calls: driver vs workers

I have to do 100000 sequential HTTP requests with Spark. I have to store responses into S3. I said sequential, because each request returns around 50KB of data, and I have to keep 1 second in order to not exceed API rate limits.
Where to make HTTP calls: from Spark Job's code (executed on driver/master node) or from dataset transformation (executed on worker node)?
Workarrounds
Make HTTP request from my Spark job (on Driver/Master node), create dataset of each HTTP response (each contains 5000 json items) and save each dataset to S3 with help of spark. You do not need to keep dataset after you saved it
Create dataset from all 100000 URLs (move all further computations to workers), make HTTP requests inside map or mapPartition, save single dataset to S3.
The first option
It's simpler and it represents a nature of my compurations - they're sequential, because of 1 second delay. But:
Is it bad to make 100_000 HTTP calls from Driver/Master node?
*Is it more efficient to create/save one 100_000 * 5_000 dataset than creating/saving 100_000 small datasets of size 5_000*
Each time I creating dataset from HTTP response - I'll move response to worker and then save it to S3, right? Double shuffling than...
Second option
Actually it won't benefit from parallel processing, since you have to keep interval of 1 second because request. The only bonus is to moving computations (even if they aren't too hard) from driver. But:
Is it worth of moving computations to workers?
Is it a good idea to make API call inside transformation?
Saving a file <32MB (or whatever fs.s3a.block.size is) to S3 is ~2xGET, 1xLIST and a PUT; you get billed a bit by AWS for each of these calls, plus storage costs.
For larger files, a POST to initiate multipart upload after that first block, one POST per 32 MB (of 32MB, obviously) and a final POST of a JSON file to complete. So: slightly more efficient
Where small S3 sizes matter is in the bills from AWS and followup spark queries: anything you use in spark, pyspark, SQL etc. many small files are slower: Theres a high cost in listing files in S3, and every task pushed out to a spark worker has some setup/commit/complete costs.
regarding doing HTTP API calls inside a worker, well, you can do fun things there. If the result isn't replicable then task failures & retries can give bad answers, but for a GET it should be OK. What is hard is throttling the work; I'll leave you to come up with a strategy there.
Here's an example of uploading files to S3 or other object store in workers; first the RDD of the copy src/dest operations is built up, then they are pushed out to workers. The result of the worker code includes upload duration length info, if someone ever wanted to try and aggregate the stats (though there you'd probably need timestamp for some time series view)
Given you have to serialize the work to one request/second, 100K requests is going to take over a day. if each request takes <1 second, you may as well run it on a single machine. What's important is to save the work incrementally so that if your job fails partway through you can restart from the last checkpoint. I'd personally focus on that problem: how could do this operation such that every 15-20 minutes of work was saved, and on a restart you can carry on from there.
Spark does not handle recovery of a failed job, only task failures. Lose the driver and you get to restart your last query. Break things up.
Something which comes to mind could be
* first RDD takes list of queries and some summary info about any existing checkpointed data, calculates the next 15 minutes of work,
* building up a list of GET calls to delegate to 1+ worker. Either 1 URL/row, or have multiple URLs in a single row
* run that job, save the results
* test recovery works with a smaller window and killing things.
* once happy: do the full run
Maybe also: recognise & react to any throttle events coming off the far end by
1. Sleeping in the worker
1. returning a count of throttle events in the results, so that the driver can initially collect aggregate stats and maybe later tune sleep window for subsequent tasks.

How to perform massive data uploads to firebase firestore

I have about ~300mb of data (~180k json objects) that gets updated once every 2-3 days.
This data is divided into three "collections", that I must keep up to date.
I decided to take the Node.JS way, but any solution in a language i know ( Java, Python) will be welcomed.
Whenever I perform a batch set using the node.JS firebase-admin client, not only it consumes an aberrant amount of ram ( about 4-6GB!), but it also tends to crash with errors that don't have a clear ( up to page 4 of google search without a meaningful answer ) reason.
My code is frankly simple, this is it:
var collection = db.collection("items");
var batch = db.batch();
array.forEach(item => {
var ref = collection.doc(item.id);
batch.set(ref, item);
});
batch.commit().then((res) => {
console.log("YAY",res);
});
I haven't found anywhere if there is a limit on the number of writes in a limited span of time (I understand doing 50-60k writes should be easy peasy with a backend the size of firebase), and also found that this can go up the ram train and have like 4-6GB of ram allocated.
I can confirm that when the errors are thrown, or the ram usage clogs my laptop, whatever happens first, I am still at less than 1-4% my daily usage quotas, so that is not the issue.

How to make edges unique and to quantify them without out-of-memory error

I've created an edge collection with about 16 Mio edges. The edges are not unique, means there are more than one edge from vertex a to vertex b. The edge collection size is about 2.4 GB data and has 1.6 GB edge index size. I am using a computer with 16 GB RAM (and additionally, 16 BG swap space).
Now I try to calculate unique edges (between each couple of vertex a-b) with a statement like this one:
FOR wf IN DeWritesWith
COLLECT from = wf._from, to = wf._to WITH COUNT INTO res
INSERT { "_from": from, "_to": to, "type": "writesWith", "numArticles": res } INTO DeWritesWithAggregated
// Does also lead to out-of-memory error:
// RETURN { "_from": from, "_to": to, "type": "writesWith", "numArticles": res }
My Problem: I always run out-of-memory (32 GB RAM). As the problem also occures when I do not want to write the result, I assume it is not a problem of huge write transaction logs.
Is this normal, and can I optimize the AQL somehow? I am hoping for a solution as I think this scenario is a more generic usage scenario in graphs ...
Since ArangoDB 2.6, the COLLECT can run in two modes:
the sorted mode that uses a sort step before aggregation
a hash table mode that does not require an upfront sort step
The optimizer will choose the hash table mode automatically if it is considered to be cheaper than the sorted mode with the sort step.
The new COLLECT implementation in 2.6 should make the selection part of the query run much faster in 2.6 than in 2.5 and before. Note that COLLECT still produces a sorted output of its result (not its input) even with the hash table mode. This is done for compatibility with the sorted mode. This result sort step can be avoided by adding an extra SORT null instruction after the COLLECT statement. The optimizer can then optimize away the sorting of the result.
A blog post that explains the two modes is here:
http://jsteemann.github.io/blog/2015/04/22/collecting-with-a-hash-table/

Resources