node.js loop crashed immediately when insert bulk data to cassandra

node.js loop crashed immediately when insert bulk data to cassandra - node.js

I am trying to insert 1000000 data to cassandra with nodeJS. But the loop is crashed a little time later. Every time I cannot insert over 10000 record. Why the loop is crashed anybody help me.
Thanks.
My code looks like:
var helenus = require('helenus'),
pool = new helenus.ConnectionPool({
hosts : ['localhost:9160'],
keyspace : 'twissandra',
user : '',
password : '',
timeout : 3000
});
pool.on('error', function(err){
console.error(err.name, err.message);
});
var i=0;
pool.connect(function(err, keyspace){
if(err){ throw(err);
} else {
while (i<1000000){
i++;
var str="tkg" + i;
var pass="ktr" + i;
pool.cql("insert into users (username,password) VALUES (?,?)",[str, pass],function(err, results){
});
}
}
});
console.log("end");

You're probably overloading the Cassandra queue by attempting to make a million requests all at once! Keep in mind the request is asynchronous, so it is made even if the previous one has not completed.
Try using async.eachLimit to limit it to 50-100 requests at a time. The actual maximum concurrent capacity changes based on the backend process.

Actually there was no problem. I checked the number of records twice at different times and i saw that the write operation continued until timeout value. The timeout value is given inside the code. As a summary in the code there is no crash, thank you Julian H. Lam for reply.
But another question is that how to increase write performance of cassandra? What should i change in cassandra.yaml file or any?
Thank you.

Related

How to manage massive calls to Postgresql in Node

I have a question regarding massive calls to PostgreSQL.
This is the scenario:
I have a simple Nodejs app that makes queries to PostgreSQL in a short period of time.
Everything is fine, but sometimes these calls get rejected due to Postgresql maximum pool connections setting, which is equal to 100.
I have in mind to make queue consumption app style, which means adding every query to a queue and then consuming an element every second. By consequence a query to PostgreSQL every second.
But my problem is, Idk where to start. This is the part where I am getting problems with, at some point, I have a lot of calls and I get lots of "ERROR IN QUERY EXECUTION" for the reason explained before.
const pool3 = new Pool(credentialsPostGres);
let res = [];
let sql_call = "select colum1 from table2 where x = y"; //the real query is a bit more complex, but you get the idea.
poll_query.query(sql_call,(err,results) => {
if (err) {
pool3.end();
console.log(err + " ERROR IN QUERY EXECUTION");
} else {
res.push({ data: Object.values(JSON.parse(JSON.stringify(results.rows))) });
pool3.end();
return callback(res,data);
}
})
How I should manage this part into a queue? I am a bit lost.
Help!

Automatic run function daily at specific time

I'm doing a school project about CMS System to help operate my school website. It use 3 database:
MongoDB (Head database, store all information)
Redis (Store the menu of website)
Elasticsearch (Store posts)
Currently when I insert/edit/delete data, I also insert/edit/delete to a related database. But my mentor want me to write a function that let's system auto sync data between those 3 database at a specific time (user can choose when).
My server uses Node JS to operate, this requirement is new for me, never heard about it before. My new approach is:
Add 1 flag to database field
Select all rows which contain flag == true.
Sync data
But I don't know how to auto run above function at a specific time. I hope you guys can help me to optimize my new flow and solve the sync problem.
Thank you all !!!
Edit 1: Change tittle.
Edit 2: I found a solution from this topic: Running a function everyday midnight, is there anyway to re-run this function if error occur when sync data. Something like this:
function syncDataAtMidNight(){
var now = new Date();
var night = new Date(
now.getFullYear(),
now.getMonth(),
now.getDate() + 1, // the next day, ...
0, 0, 0 // ...at 00:00:00 hours
);
var msToMidnight = night.getTime() - now.getTime();
setTimeout(function() {
// is this a correct way to loop, in case of error when sync ???
myFunctionSync(err, resp) {
if(err){
myFunctionSync();
} else {
changeFlagToFalse();
syncDataAtMidNight();
}
}
}, msToMidnight);}

You should consider to use cron, he have the significant advantage to be at system level, he can get the code error return and mostly launch/relaunch node if necessary.

Node calling postgres function with temp tables causing "memory leak"

I have a node.js program calling a Postgres (Amazon RDS micro instance) function, get_jobs within a transaction, 18 times a second using the node-postgres package by brianc.
The node code is just an enhanced version of brianc's basic client pooling example, roughly like...
var pg = require('pg');
var conString = "postgres://username:password#server/database";
function getJobs(cb) {
pg.connect(conString, function(err, client, done) {
if (err) return console.error('error fetching client from pool', err);
client.query("BEGIN;");
client.query('select * from get_jobs()', [], function(err, result) {
client.query("COMMIT;");
done(); //call `done()` to release the client back to the pool
if (err) console.error('error running query', err);
cb(err, result);
});
});
}
function poll() {
getJobs(function(jobs) {
// process the jobs
});
setTimeout(poll, 55);
}
poll(); // start polling
So Postgres is getting:
2016-04-20 12:04:33 UTC:172.31.9.180(38446):XXX#XXX:[5778]:LOG: statement: BEGIN;
2016-04-20 12:04:33 UTC:172.31.9.180(38446):XXX#XXX:[5778]:LOG: execute <unnamed>: select * from get_jobs();
2016-04-20 12:04:33 UTC:172.31.9.180(38446):XXX#XXX:[5778]:LOG: statement: COMMIT;
... repeated every 55ms.
get_jobs is written with temp tables, something like this
CREATE OR REPLACE FUNCTION get_jobs (
) RETURNS TABLE (
...
) AS
$BODY$
DECLARE
_nowstamp bigint;
BEGIN
-- take the current unix server time in ms
_nowstamp := (select extract(epoch from now()) * 1000)::bigint;
-- 1. get the jobs that are due
CREATE TEMP TABLE jobs ON COMMIT DROP AS
select ...
from really_big_table_1
where job_time < _nowstamp;
-- 2. get other stuff attached to those jobs
CREATE TEMP TABLE jobs_extra ON COMMIT DROP AS
select ...
from really_big_table_2 r
inner join jobs j on r.id = j.some_id
ALTER TABLE jobs_extra ADD PRIMARY KEY (id);
-- 3. return the final result with a join to a third big table
RETURN query (
select je.id, ...
from jobs_extra je
left join really_big_table_3 r on je.id = r.id
group by je.id
);
END
$BODY$ LANGUAGE plpgsql VOLATILE;
I've used the temp table pattern because I know that jobs will always be a small extract of rows from really_big_table_1, in hopes that this will scale better than a single query with multiple joins and multiple where conditions. (I used this to great effect with SQL Server and I don't trust any query optimiser now, but please tell me if this is the wrong approach for Postgres!)
The query runs in 8ms on small tables (as measured from node), ample time to complete one job "poll" before the next one starts.
Problem: After about 3 hours of polling at this rate, the Postgres server runs out of memory and crashes.
What I tried already...
If I re-write the function without temp tables, Postgres doesn't run out of memory, but I use the temp table pattern a lot, so this isn't a solution.
If I stop the node program (which kills the 10 connections it uses to run the queries) the memory frees up. Merely making node wait a minute between polling sessions doesn't have the same effect, so there are obviously resources that the Postgres backend associated with the pooled connection is keeping.
If I run a VACUUM while polling is going on, it has no effect on memory consumption and the server continues on its way to death.
Reducing the polling frequency only changes the amount of time before the server dies.
Adding DISCARD ALL; after each COMMIT; has no effect.
Explicitly calling DROP TABLE jobs; DROP TABLE jobs_extra; after RETURN query () instead of ON COMMIT DROPs on the CREATE TABLEs. Server still crashes.
Per CFrei's suggestion, added pg.defaults.poolSize = 0 to the node code in an attempt to disable pooling. The server still crashed, but took much longer and swap went much higher (second spike) than all the previous tests which looked like the first spike below. I found out later that pg.defaults.poolSize = 0 may not disable pooling as expected.
On the basis of this: "Temporary tables cannot be accessed by autovacuum. Therefore, appropriate vacuum and analyze operations should be performed via session SQL commands.", I tried to run a VACUUM from the node server (as some attempt to make VACUUM an "in session" command). I couldn't actually get this test working. I have many objects in my database and VACUUM, operating on all objects, was taking too long to execute each job iteration. Restricting VACUUM just to the temp tables was impossible - (a) you can't run VACUUM in a transaction and (b) outside the transaction the temp tables don't exist. :P EDIT: Later on the Postgres IRC forum, a helpful chap explained that VACUUM isn't relevant for temp tables themselves, but can be useful to clean up the rows created and deleted from pg_attributes that TEMP TABLES cause. In any case, VACUUMing "in session" wasn't the answer.
DROP TABLE ... IF EXISTS before the CREATE TABLE, instead of ON COMMIT DROP. Server still dies.
CREATE TEMP TABLE (...) and insert into ... (select...) instead of CREATE TEMP TABLE ... AS, instead of ON COMMIT DROP. Server dies.
So is ON COMMIT DROP not releasing all the associated resources? What else could be holding memory? How do I release it?

I used this to great effect with SQL Server and I don't trust any query optimiser now
Then don't use them. You can still execute queries directly, as shown below.
but please tell me if this is the wrong approach for Postgres!
It is not a completely wrong approach, it's just a very awkward one, as you are trying to create something that's been implemented by others for a much easier use. As a result, you are making many mistakes that can lead to many problems, including memory leaks.
Compare to the simplicity of the exact same example that uses pg-promise:
var pgp = require('pg-promise')();
var conString = "postgres://username:password#server/database";
var db = pgp(conString);
function getJobs() {
return db.tx(function (t) {
return t.func('get_jobs');
});
}
function poll() {
getJobs()
.then(function (jobs) {
// process the jobs
})
.catch(function (error) {
// error
});
setTimeout(poll, 55);
}
poll(); // start polling
Gets even simpler when using ES6 syntax:
var pgp = require('pg-promise')();
var conString = "postgres://username:password#server/database";
var db = pgp(conString);
function poll() {
db.tx(t=>t.func('get_jobs'))
.then(jobs=> {
// process the jobs
})
.catch(error=> {
// error
});
setTimeout(poll, 55);
}
poll(); // start polling
The only thing that I didn't quite understand in your example - the use of a transaction to execute a single SELECT. This is not what transactions are generally for, as you are not changing any data. I assume you were trying to shrink a real piece of code you had that changes some data also.
In case you don't need a transaction, your code can be further reduced to:
var pgp = require('pg-promise')();
var conString = "postgres://username:password#server/database";
var db = pgp(conString);
function poll() {
db.func('get_jobs')
.then(jobs=> {
// process the jobs
})
.catch(error=> {
// error
});
setTimeout(poll, 55);
}
poll(); // start polling
UPDATE
It would be a dangerous approach, however, not to control the end of the previous request, which also may create memory/connection issues.
A safe approach should be:
function poll() {
db.tx(t=>t.func('get_jobs'))
.then(jobs=> {
// process the jobs
setTimeout(poll, 55);
})
.catch(error=> {
// error
setTimeout(poll, 55);
});
}

Use CTEs to create partial result sets instead of temp tables.
CREATE OR REPLACE FUNCTION get_jobs (
) RETURNS TABLE (
...
) AS
$BODY$
DECLARE
_nowstamp bigint;
BEGIN
-- take the current unix server time in ms
_nowstamp := (select extract(epoch from now()) * 1000)::bigint;
RETURN query (
-- 1. get the jobs that are due
WITH jobs AS (
select ...
from really_big_table_1
where job_time < _nowstamp;
-- 2. get other stuff attached to those jobs
), jobs_extra AS (
select ...
from really_big_table_2 r
inner join jobs j on r.id = j.some_id
)
-- 3. return the final result with a join to a third big table
select je.id, ...
from jobs_extra je
left join really_big_table_3 r on je.id = r.id
group by je.id
);
END
$BODY$ LANGUAGE plpgsql VOLATILE;
The planner will evaluate each block in sequence the way I wanted to achieve with temp tables.
I know this doesn't directly solve the memory leak issue (I'm pretty sure there's something wrong with Postgres' implementation of them, at least the way they manifest on the RDS configuration).
However, the query works, it is query planned the way I was intending and the memory usage is stable now after 3 days of running the job and my server doesn't crash.
I didn't change the node code at all.

nodejs + postgresql way too slow

I have this piece of code:
var pg = require('pg');
var QueryStream = require('pg-query-stream');
var constr = 'postgres://devel:1234#127.0.0.1/tcc';
var JSONStream = require('JSONStream');
var http = require('http');
pg.connect(constr, function(err, client, done) {
if (err) {
console.log('Erro ao conectar cliente.', err);
process.exit(1);
}
sql = 'SELECT \
pessoa.cod, \
pessoa.nome, \
pessoa.nasc, \
cidade.nome AS cidade \
FROM pessoa, cidade \
WHERE cidade.cod IN (1, 2, 3);';
http.createServer(function (req, resp) {
resp.writeHead(200, { 'Content-Type': 'text/html; Charset=UTF-8' });
var query = new QueryStream(sql);
var stream = client.query(query);
//stream.on('data', console.log);
stream.on('end', function() {
//done();
resp.end()
});
stream.pipe(JSONStream.stringify()).pipe(resp);
}).listen(8080, 'localhost');
});
When I run apache bench on it, it get only about four requests per second.
If I run the same query in php/apache or java/tomcat I get ten times faster
results. The database has 1000 rows. If I limit the query to about ten rows,
then node is double faster than php/java.
What am I doing wrong?
EDIT: Some time ago I opened an issue here: https://github.com/brianc/node-postgres/issues/653
I'm providing this link because I posted there some other variations on the code I have tried.
Even with comments and hints so far, I have not been able to get a descent speed.

pg-query-stream uses cursors.
it uses cursors (bold for emphasis).
you can read the code and change batchSize to better fit your needs.
For those who don't know what cursors are, in short they are a trade-off for keeping memory footprint small and not reading a whole table in memory. But if you get 100 rows at a time when you have 1000 results, that's 1000 / 100 round-trips; so probably 10x slower than a solution not using cursors.
If you know how many rows you need, add a limit to your query, and change the number of rows returned each time to minimize number of roundtrips.

As far as I can tell from this code, you create a single one connection to the PostgreSQL and everything gets queued through it.
The pg module allows for this, it's described here:
https://github.com/brianc/node-postgres/wiki/Queryqueue
If you want a real performance, the for each HTTP request you should fetch the connection from the pool, use it, release it and make 101% sure you always release (e.g. proper exception handling) or your server will die once the pool gets completely exhausted.
Once you are there you can tweak the connection pool parameters and measure performance.

Looks like you're waiting for the server to be created before the request gets relayed. Try moving http.createServer outside of the call. If you only want to use the http server in the request, you should try making the calls async.

Maybe you should set http.agent.maxSockets value, try this:
var http = require('http');
http.agent.maxSockets = {{number}};
default maxSockets is 5

Redis, Transactions and Throughput

Alright, it's been about 10 hours, and I still can't figure this out. Can someone please help? I am writing to both Redis and MongoDB each time my Node/Express API is called. However, when I query each database by the same key, Redis gradually starts to miss records over time. I can minimize this behavior by throttling the overall throughput (reducing # of ops I'm asking Redis to do). Here's the pseudo code:
function (req, res) {
async.parallel {
f {w:1 into MongoDB -- seems to be working fine}
f {write to Redis -- seems to be miss-firing}
And here the Redis code:
var trx = 1; // transaction is 1:pending 0:complete
async.whilst(function(){return trx;},
function(callback){
r.db.watch(key);
r.db.hgetall(key, function(err, result){
// update existing key
if (result !== null) {
update(key, result, req, function(err, result){
if (err) {callback(err);}
else if (result === null) {callback(null);}
else {trx = 0; callback(null);}
});
}
// new key
else {
newSeries(bin, req, function(err, result){
if (err) {callback(err);}
else if (result === null) {callback(null);}
else {trx = 0; callback(null);}
});
}
});
}, function(err){if(err){callback(err);} else{callback(null);}}
);
in the "update" and "newSeries" functions, I'm basically just doing a MULTI/EXEC to redis using the values from HGETALL, and returning the result (to ensure I didn't hit a race condition).
I am using Cluster with Node, so I have multiple threads executing at once to Redis.
Any thoughts would be really helpful. Thanks.

I guess I just needed a bit of sleep, and a bit more log-trolling to figure this out.
Basically, it was the async.each loop above my block of code. Because that runs in parallel, EXEC was sometimes called on a different key! So it would wipe out the WATCH on another key! So, I just needed to switch it to async.eachSeries - which ensures my single node-worker isn't "working" (WATCH'ing and EXEC'ing) multiple keys at once!
So, first critical lessons: first, any EXEC command from a connection with wipe out all WATCH commands (so be very careful with parallel or Async processing).
And second, be very, very careful with async.each, and always default to async.eachSeries! For me, async.each is conceptually very tough - and it can really screw up single-threaded processes (like Redis). This has cost me a lot of time and pain over the past year... beware!
Hope this helps someone out there.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

node.js loop crashed immediately when insert bulk data to cassandra - node.js

Related

How to manage massive calls to Postgresql in Node

Automatic run function daily at specific time

Node calling postgres function with temp tables causing "memory leak"

nodejs + postgresql way too slow

Redis, Transactions and Throughput

Categories

Resources