How to execute queries during a stream using pg-promise?

How to execute queries during a stream using pg-promise? - node.js

I'm trying to execute an insert query for each row of a query stream using pg-promise with pg-query-stream. With the approach I have, memory usage increases with each query executed.
I've also narrowed the problem down to just executing any query during the stream, not just inserts. I currently listen for 'data' events on the stream, pause the stream, execute a query, and resume the stream. I've also tried piping the query stream into a writeable stream that executes the query, but I get the error that the db connection is already closed.
let count = 0;
const startTime = new Date();
const qs = new QueryStream('SELECT 1 FROM GENERATE_SERIES(1, 1000000)');
db.stream(qs, stream => {
stream.on('data', async () => {
count++;
stream.pause();
await db.one('SELECT 1');
if (count % 10000 === 0) {
const duration = Math.round((new Date() - startTime) / 1000);
const mb = Math.round(process.memoryUsage().heapUsed/1024/1024);
console.log(`row ${count}, ${mb}MB, ${duration} seconds`);
}
stream.resume();
});
});
I expected the memory usage to hover around a constant value, but the output looks like the following:
row 10000, 105MB, 4 seconds
row 20000, 191MB, 6 seconds
row 30000, 278MB, 9 seconds
row 40000, 370MB, 10 seconds
row 50000, 458MB, 14 seconds
It takes over 10 minutes to reach row 60000.
UPDATE:
I edited the code above to include async/await to wait for the inner query to finish and I increased the series to 10,000,000. I ran the node process with 512MB of memory and the program slows significantly when approaching that limit but doesn't crash. This problem occurred with v10 and not v11+ of node.

This is due to invalid use of promises / asynchronous code.
Line db.one('SELECT 1'); isn't chained to anything, spawning loose promises at a fast rate, which in turn pollutes memory.
You need to chain it either with .then.catch or with await.

Related

Optimizing file parse and SNS publish of large record set

I have an 85mb data file with 110k text records in it. I need to parse each of these records, and publish an SNS message to a topic for each record. I am doing this successfully, but the Lambda function requires a lot of time to run, as well as a large amount of memory. Consider the following:
const parse = async (key) => {
//get the 85mb file from S3. this takes 3 seconds
//I could probably do this via a stream to cut down on memory...
let file = await getFile( key );
//parse the data by new line
const rows = file.split("\n");
//free some memory now
//this free'd up ~300mb of memory in my tests
file = null;
//
for( let i = 0; i < rows.length; i++ ) {
//... parse the row and build a small JS object from it
//publish to SNS. assume publishMsg returns a promise after a successful SNS push
requests.push( publishMsg(data) );
}
//wait for all to finish
await Promise.all(requests);
return 1;
};
The Lambda function will timeout with this code at 90 seconds (the current limit I have set). I could raise this limit, as well as the memory (currently at 1024mb) and likely solve my issue. But, none of the SNS publish calls take place when the function hits the timeout. Why?
Lets say 10k rows process before the function hits the timeout. Since I am submitting the publish async, shouldn't several of these complete regardless of the timeout? It seems they only run if the entire function completes.
I have run a test where I cut the data down to 15k rows, and it runs without any issue, in roughly 15 seconds.
So the question, why are the async calls not firing prior to the function timeout, and any input on how I can optimize this without moving away from Lambda?
Lambda Config: nodeJS 10.x, 1024 mb, 90 second timeout

DynamoDB PutItem using all heap memory - NodeJS

I have a csv with over a million lines, I want to import all the lines into DynamoDB. I'm able to loop through the csv just fine, however, when I try to call DynamoDB PutItem on these lines, I run out of heap memory after about 18k calls.
I don't understand why this memory is being used or how I can get around this issue. Here is my code:
let insertIntoDynamoDB = async () => {
const file = './file.csv';
let index = 0;
const readLine = createInterface({
input: createReadStream(file),
crlfDelay: Infinity
});
readLine.on('line', async (line) => {
let record = parse(`${line}`, {
delimiter: ',',
skip_empty_lines: true,
skip_lines_with_empty_values: false
});
await dynamodb.putItem({
Item: {
"Id": {
S: record[0][2]
},
"newId": {
S: record[0][0]
}
},
TableName: "My-Table-Name"
}).promise();
index++;
if (index % 1000 === 0) {
console.log(index);
}
});
// halts process until all lines have been processed
await once(readLine, 'close');
console.log('FINAL: ' + index);
}
If I comment out the Dynamodb call, I can look through the file just fine and read every line. Where is this memory usage coming from? My DynamoDB write throughput is at 500, adjusting this value has no affect.

For anyone that is grudging through the internet and trying to find out why DynamoDB is consuming all the heap memory, there is a github bug report found here: https://github.com/aws/aws-sdk-js/issues/1777#issuecomment-339398912
Basically, the aws sdk only has 50 sockets to make http requests, if all sockets are consumed, then the events will be queued until a socket becomes available. When processing millions of requests, these sockets get consumed immediately, and then the queue builds up until it blows up the heap.
So, then how do you get around this?
Increase heap size
Increase number of sockets
Control how many "events" you are queueing
Options 1 and 2 are the easy way out, but do no scale. They might work for your scenario, if you are doing a 1 off thing, but if you are trying to build a robust solution, then you will wan't to go with number 3.
To do number 3, I determine the max heap size, and divide it by how large I think an "event" will be in memory. For example: I assume an updateItem event for dynamodb would be 100,000 bytes. My heap size was 4GB, so 4,000,000,000 B / 100,000 B = 40,000 events. However, I only take 50% of this many events to leave room on the heap for other processes that the node application might be doing. This percentage can be lowered/increased depending on your preference. Once I have the amount of events, I then read a line from the csv and consume an event, when the event has been completed, I release the event back into the pool. If there are no events available, then I pause the input stream to the csv until an event becomes available.
Now I can upload millions of entries to dynamodb without any worry of blowing up the heap.

Redis time() sometimes returns time in the future

In the Node.js code below, timeDiff is sometimes negative (about 1% of the time). That is, Redis time() sometimes returns time a few hundred milliseconds in the future, compared to the Date.now(), even though the Redis instance is located on the same AWS virtual machine and Date.now() is called after successfully receiving the response from Redis.
Is this something expected? Could it be caused by NTP, given the frequency and the difference (sometimes close to one second)?
Unfortunately, I couldn't even find source code for Redis time() function, so help finding it would be useful as well.
const redis = require("redis");
const Bluebird = require("bluebird");
Bluebird.promisifyAll(redis);
const redisClient = redis.createClient(LOCAL_PORT, LOCAL_HOST);
function runNext() {
logRedisLatency().finally(() => {
setTimeout(runNext, 20 * 1000);
});
}
function logRedisLatency() {
return redisClient.timeAsync().then((data) => {
const timeDiff = Date.now() - parseRedisTime(data);
console.log(timeDiff);
});
}
function parseRedisTime(data) {
const seconds = parseInt(data[0], 10);
const milliseconds = parseInt(data[1].substr(0, 3), 10);
const time = (seconds * 1000) + milliseconds;
return time;
}
runNext();

According to redis docs for time function, this function return an array containing two value:
unix time in seconds.
microseconds.
As you know the microsecond value can be a number between 0 to 999999. In your code you are calculating milliseconds from the first 3 digits of microseconds!
The problem is just here!!!
To calculate milliseconds you should divide microseconds by 1000
for example when microseconds is 52369, it means that the milliseconds is 52, but in your code it calculated as 523!
you can do something like this:
const milliseconds = parseInt(data[1], 10)/1000|0;

Inconsistent request behavior in Node when requesting large number of links?

I am currently using this piece of code to connect to a massive list of links (a total of 2458 links, dumped at https://pastebin.com/2wC8hwad) to get feeds from numerous sources, and to deliver them to users of my program.
It's basically splitting up one massive array into multiple batches (arrays), then forking a process to handle a batch to request each stored link for a 200 status code. Only when a batch is complete is the next batch sent for processing, and when its all done the forked process is disconnected. However I'm facing issues concerning apparent inconsistency in how this is performing with this logic, particularly the part where it requests the code.
const req = require('./request.js')
const process = require('child_process')
const linkList = require('./links.json')
let processor
console.log(`Total length: ${linkList.length}`) // 2458 links
const batchLength = 400
const batchList = [] // Contains batches (arrays) of links
let currentBatch = []
for (var i in linkList) {
if (currentBatch.length < batchLength) currentBatch.push(linkList[i])
else {
batchList.push(currentBatch)
currentBatch = []
currentBatch.push(linkList[i])
}
}
if (currentBatch.length > 0) batchList.push(currentBatch)
console.log(`Batch list length by default is ${batchList.length}`)
// cutDownBatchList(1)
console.log(`New batch list length is ${batchList.length}`)
const startTime = new Date()
getBatchIsolated(0, batchList)
let failCount = 0
function getBatchIsolated (batchNumber) {
console.log('Starting batch #' + batchNumber)
let completedLinks = 0
const currentBatch = batchList[batchNumber]
if (!processor) processor = process.fork('./request.js')
for (var u in currentBatch) { processor.send(currentBatch[u]) }
processor.on('message', function (linkCompletion) {
if (linkCompletion === 'failed') failCount++
if (++completedLinks === currentBatch.length) {
if (batchNumber !== batchList.length - 1) setTimeout(getBatchIsolated, 500, batchNumber + 1)
else finish()
}
})
}
function finish() {
console.log(`Completed, time taken: ${((new Date() - startTime) / 1000).toFixed(2)}s. (${failCount}/${linkList.length} failed)`)
processor.disconnect()
}
function cutDownBatchList(maxBatches) {
for (var r = batchList.length - 1; batchList.length > maxBatches && r >= 0; r--) {
batchList.splice(r, 1)
}
return batchList
}
Below is request.js, using needle. (However, for some strange reason it may completely hang up on a particular site indefinitely - in that case, I just use this workaround)
const needle = require('needle')
function connect (link, callback) {
const options = {
timeout: 10000,
read_timeout: 8000,
follow_max: 5,
rejectUnauthorized: true
}
const request = needle.get(link, options)
.on('header', (statusCode, headers) => {
if (statusCode === 200) callback(null, link)
else request.emit('err', new Error(`Bad status code (${statusCode})`))
})
.on('err', err => callback(err, link))
}
process.on('message', function(linkRequest) {
connect(linkRequest, function(err, link) {
if (err) {
console.log(`Couldn't connect to ${link} (${err})`)
process.send('failed')
} else process.send('success')
})
})
In theory, I think this should perform perfectly fine - it spawns off a separate process to handle the dirty work in sequential batches so its not overloaded and is super scaleable. However, when using using the full list of links at length 2458 with a total of 7 batches, I often get massive "socket hang up" errors on random batches on almost every trial that I do, similar to what would happen if I requested all the links at once.
If I cut down the number of batches to 1 using the function cutDownBatchList it performs perfectly fine on almost every trial. This is all happening on a Linux Debian VPS with two 3.1GHz vCores and 4 GB RAM from OVH, on Node v6.11.2
One thing I also noticed is that if I increased the timeout to 30000 (30 sec) in request.js for 7 batches, it works as intended - however it works perfectly fine with a much lower timeout when I cut it down to 1 batch. If I also try to do all 2458 links at once, with a higher timeout, I also face no issues (which basically makes this mini algorithm useless if I can't cut down the timeout via batch handling links). This all goes back to the inconsistent behavior issue.
The best TLDR I can do: Trying to request a bunch of links in sequential batches in a forked child process - succeeds almost every time with a lower number of batches, fails consistently with full number of batches even though behavior should be the same since its handling it in isolated batches.
Any help would be greatly appreciated in solving this issue as I just cannot for the life of me figure it out!

Loss data in MongoDB real-time application

I am developing an realtime application (node, socket.io and mongoDB with Mongoose) which receive data every 30 seconds. The data received is some metadata about the machine and 10 pressures.
I have a document per day that is preallocated when the day change (avoiding move data in the Database when documents grows saving data ), so I only have to do updates.
It data looks like:
{
metadata: { .... }
data: {
"0":{
"0":{
"0":{
"pressures" : {.....},
},
"30":{}
},
"1":{},
"59":{}
},
"1":{},
"23":{}
}
},
Without doing any Database operation in the server I receive the data from sensors every 30 seconds without problems and never lost the socket.io connection :
DATA 2016-09-30T16:02:00+02:00
DATA 2016-09-30T16:02:30+02:00
DATA 2016-09-30T16:03:00+02:00
DATA 2016-09-30T16:03:30+02:00
DATA 2016-09-30T16:04:00+02:00
DATA 2016-09-30T16:04:30+02:00
but when I start doing updates (calling DataDay.findById(_id).exec()...) I lost half of the data and sometimes the socket.io connection i.e at (16:18h, 17:10h, 17:49h, 18:12h ..), It's like the server stops receiving socket information at intervals
DATA 2016-09-30T16:02:00+02:00
LOST
LOST
DATA 2016-09-30T16:03:30+02:00
DATA 2016-09-30T16:04:00+02:00
LOST
LOST
DATA 2016-09-30T16:05:30+02:00
LOST
I am using MONGODB with mongoose (with bluebird promises), but I am probably doing some blocking operation or something wrong, but I can't find out it.
The code treating the income data is:
socket.on('machine:data', function (data) {
console.log('DATA ' + data.metada.date));
var startAt = Date.now(); // Only for testing
dataYun = data;
var _id = _createIdDataDay(dataYun._id); // Synchronous
DataDay.findById(_id).exec() // Asynchronous
.then( _handleEntityNotFound ) // Synchronous
.then( _createPreData ) // Asynchronous
.then( _saveUpdates ) // Asynchronous
.then( function () {
console.log('insertData: ' + (Date.now() - startAt + ' ms'));
})
.catch( _handleError(data) );
console.log('AFTER DE THE INSERT METHOD');
console.log(data.data.pressures);
});
I have controlled how expensive were the operations and:
_createIdDataDay: 0 ms
_handleEntityNotFound: 1 ms
_createPreData 709 ms // only is executed once a day
_saveUpdates: 452 ms
insertData: 452 ms
This test has been done only with one machine sending data but the goal is receive data from 50 to 100 machines all of them sending data at the same time.
So from this test the conclusion is that every 30 seconds I have to update the database an the operation last more or less 452 ms.
So I don't understand where the problem is.
Is 452 ms too expensive for an update?
Even so, I am not doing any operation more, and the next data comes in 30 seconds so it doesn't make sense loss data
I know that promises doesn't work well for multiple events (but I think this isn't the case), but not sure.
Can it be a problem with socket.io?
Or simply I am doing something that blocks the event loop but I can't see it.
Thanks

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string