Bulk insert in not working in Casssandra with node.js - node.js

I am trying to insert more than 100 records using batch() method.
client.batch(batchQuery, { prepare: true }, function (err, result) {
if (err) {
res.status(404).json({ msg: err });
} else {
res.json([result.rows][0]);
}
});
batchQuery has more than 100 insert queries. It is working if the records are less than 7. If its more than 10, then i am getting "Batch too large"

You shouldn't use batches for bulk inserts into Cassandra (in contrast to RDBMS) - this error that you get mean that you're inserting data into different partitions, and it pushes an additional load on the node that receives query. You need to use batches only if you're doing inserts into the same partition - in this case they will be applied as a single mutation.
Otherwise, sending individual insert queries via async execute, will be much faster. You only need not to send too many requests at the same time (see this answer).
You can read more about good & bad use of batches in the documentation and following answer on SO: 1.

Related

How to manage massive calls to Postgresql in Node

I have a question regarding massive calls to PostgreSQL.
This is the scenario:
I have a simple Nodejs app that makes queries to PostgreSQL in a short period of time.
Everything is fine, but sometimes these calls get rejected due to Postgresql maximum pool connections setting, which is equal to 100.
I have in mind to make queue consumption app style, which means adding every query to a queue and then consuming an element every second. By consequence a query to PostgreSQL every second.
But my problem is, Idk where to start. This is the part where I am getting problems with, at some point, I have a lot of calls and I get lots of "ERROR IN QUERY EXECUTION" for the reason explained before.
const pool3 = new Pool(credentialsPostGres);
let res = [];
let sql_call = "select colum1 from table2 where x = y"; //the real query is a bit more complex, but you get the idea.
poll_query.query(sql_call,(err,results) => {
if (err) {
pool3.end();
console.log(err + " ERROR IN QUERY EXECUTION");
} else {
res.push({ data: Object.values(JSON.parse(JSON.stringify(results.rows))) });
pool3.end();
return callback(res,data);
}
})
How I should manage this part into a queue? I am a bit lost.
Help!

Does redis.pipeline.exec() "reset" the pipeline?

I am trying to bulk insert around 500k records into Redis using a pipeline, and my code looks roughly like
const seedCache = async (
products: Array<Item>,
chunkSize,
) => {
for (const chunk of _.chunk(products, chunkSize)) {
chunk.forEach((item) => {
pipeline.set(item.id, item.data);
});
await pipeline.exec();
console.log(pipeline.length);
}
redis.quit();
};
Essentially, I load chunkSize items into the pipeline, then wait until pipeline.exec() returns, then continue.
I expected that "console.log(pipeline.length)" would be printing "0" every time, since it is only getting run after the pipeline has been flushed to Redis. However, I'm finding that pipeline.length is never getting reset to 0; instead, it just grows and grows until its length is equal to products.length by the end. This is causing my machine to run out of memory for large datasets.
Does anybody know why this is happening? Also, is this even the correct way to bulk insert records into Redis? Since running this script with 5000 products and batch size 100 only inserts 200 into the cache, whereas it does successfully insert an array of 1000 products with the same batch size. The documents are quite large (~5kB), so it needs to be done in batches somehow.

preventing race conditions with nodejs

I'm writing an application using nodeJS 6.3.0 and aws DynamoDB.
the dynamodb holds statistics information that are added to dynamodb that are being called from 10 different function (10 different statistic measures). the interval is set to 10 seconds, which means that every 10 seconds, 10 calls to my function are being made to add all the relevant information.
the putItem function:
function putItem(tableName,itemData,callback) {
var params = {
TableName: tableName,
Item: itemData
};
docClient.put(params, function(err, data) {
if (err) {
logger.error(params,"putItem failed in dynamodb");
callback(err,null);
} else {
callback(null,data);
}
});
now... I created a queue.
var queue = require('./dynamoDbQueue').queue;
that implements a simple queue with fixed size that I took from http://www.bennadel.com/blog/2308-creating-a-fixed-length-queue-in-javascript-using-arrays.htm.
the idea is that if there is a network problem.. lets say for a minute. i want all the events to be pushed to the queue and when the problem is resolved to send queue information to dynamodb and to free the queue.
so I modified my original function to the following code:
function putItem(tableName,itemData,callback) {
var params = {
TableName: tableName,
Item: itemData
};
if (queue.length>0) {
queue.push(params);
callback(null,null);
} else {
docClient.put(params, function (err, data) {
if (err) {
queue.push(params);
logger.error(params, "putItem failed in dynamodb");
handleErroredQueue(); // imaginary function that i need to implement
callback(err, null);
} else {
callback(null, data);
}
});
}
}
but since I have 10 insert functions that runs at the same second, there is a chance of race conditions. which means that ...
execute1 - one function validated that the queue is empty... and is about to execute docClient.put() function.
execute2 - and at the same time another function returned from docClient.put() with an error and as a result it adds to the queue it's first row.
execute1 - by the time that the first function calling docClient.put(), the problem has been resolved and it successfully inserted data to dynamodb, which leaves the queue with previous data that will be released in the next iteration.
so for example if i inserted 4 rows with ids 1,2,3,4, the order of rows that will be inserted to dynamodb is 1,2,4,3.
is there a way to resolve that ?
thanks!
I think you are on right track, but instead of checking for an error and then adding into queue what I would suggest is to add every operation to queue first and then read the data from the queue every time.
For instance, in your case you call function 1,2,3,4 and it results in 1,2,4,3 because you are using the queue at a time off error/abrupt operation.
Step1: All your function will make an entry to a Queue -> 1,2,3,4
Step2: Read your queue and make an insert, if success remove the element
else redo the operation. This way it will insert in the desired sequence
Another advantage is that because you are using queue you don't have to keep very high throughputs for the table.
Edit:
I guess you just need to ensure that on completion of your first operation you will perform your next process and not before that.
e.g: fn 1 -> read from queue (don't delete right now from queue) -> operation Completed if not perfrom again -> Delete from queue -> perform next operation.
You just have to make sure you read from queue and wait till you get response from DynamoDB.
Hope this helps.

Node calling postgres function with temp tables causing "memory leak"

I have a node.js program calling a Postgres (Amazon RDS micro instance) function, get_jobs within a transaction, 18 times a second using the node-postgres package by brianc.
The node code is just an enhanced version of brianc's basic client pooling example, roughly like...
var pg = require('pg');
var conString = "postgres://username:password#server/database";
function getJobs(cb) {
pg.connect(conString, function(err, client, done) {
if (err) return console.error('error fetching client from pool', err);
client.query("BEGIN;");
client.query('select * from get_jobs()', [], function(err, result) {
client.query("COMMIT;");
done(); //call `done()` to release the client back to the pool
if (err) console.error('error running query', err);
cb(err, result);
});
});
}
function poll() {
getJobs(function(jobs) {
// process the jobs
});
setTimeout(poll, 55);
}
poll(); // start polling
So Postgres is getting:
2016-04-20 12:04:33 UTC:172.31.9.180(38446):XXX#XXX:[5778]:LOG: statement: BEGIN;
2016-04-20 12:04:33 UTC:172.31.9.180(38446):XXX#XXX:[5778]:LOG: execute <unnamed>: select * from get_jobs();
2016-04-20 12:04:33 UTC:172.31.9.180(38446):XXX#XXX:[5778]:LOG: statement: COMMIT;
... repeated every 55ms.
get_jobs is written with temp tables, something like this
CREATE OR REPLACE FUNCTION get_jobs (
) RETURNS TABLE (
...
) AS
$BODY$
DECLARE
_nowstamp bigint;
BEGIN
-- take the current unix server time in ms
_nowstamp := (select extract(epoch from now()) * 1000)::bigint;
-- 1. get the jobs that are due
CREATE TEMP TABLE jobs ON COMMIT DROP AS
select ...
from really_big_table_1
where job_time < _nowstamp;
-- 2. get other stuff attached to those jobs
CREATE TEMP TABLE jobs_extra ON COMMIT DROP AS
select ...
from really_big_table_2 r
inner join jobs j on r.id = j.some_id
ALTER TABLE jobs_extra ADD PRIMARY KEY (id);
-- 3. return the final result with a join to a third big table
RETURN query (
select je.id, ...
from jobs_extra je
left join really_big_table_3 r on je.id = r.id
group by je.id
);
END
$BODY$ LANGUAGE plpgsql VOLATILE;
I've used the temp table pattern because I know that jobs will always be a small extract of rows from really_big_table_1, in hopes that this will scale better than a single query with multiple joins and multiple where conditions. (I used this to great effect with SQL Server and I don't trust any query optimiser now, but please tell me if this is the wrong approach for Postgres!)
The query runs in 8ms on small tables (as measured from node), ample time to complete one job "poll" before the next one starts.
Problem: After about 3 hours of polling at this rate, the Postgres server runs out of memory and crashes.
What I tried already...
If I re-write the function without temp tables, Postgres doesn't run out of memory, but I use the temp table pattern a lot, so this isn't a solution.
If I stop the node program (which kills the 10 connections it uses to run the queries) the memory frees up. Merely making node wait a minute between polling sessions doesn't have the same effect, so there are obviously resources that the Postgres backend associated with the pooled connection is keeping.
If I run a VACUUM while polling is going on, it has no effect on memory consumption and the server continues on its way to death.
Reducing the polling frequency only changes the amount of time before the server dies.
Adding DISCARD ALL; after each COMMIT; has no effect.
Explicitly calling DROP TABLE jobs; DROP TABLE jobs_extra; after RETURN query () instead of ON COMMIT DROPs on the CREATE TABLEs. Server still crashes.
Per CFrei's suggestion, added pg.defaults.poolSize = 0 to the node code in an attempt to disable pooling. The server still crashed, but took much longer and swap went much higher (second spike) than all the previous tests which looked like the first spike below. I found out later that pg.defaults.poolSize = 0 may not disable pooling as expected.
On the basis of this: "Temporary tables cannot be accessed by autovacuum. Therefore, appropriate vacuum and analyze operations should be performed via session SQL commands.", I tried to run a VACUUM from the node server (as some attempt to make VACUUM an "in session" command). I couldn't actually get this test working. I have many objects in my database and VACUUM, operating on all objects, was taking too long to execute each job iteration. Restricting VACUUM just to the temp tables was impossible - (a) you can't run VACUUM in a transaction and (b) outside the transaction the temp tables don't exist. :P EDIT: Later on the Postgres IRC forum, a helpful chap explained that VACUUM isn't relevant for temp tables themselves, but can be useful to clean up the rows created and deleted from pg_attributes that TEMP TABLES cause. In any case, VACUUMing "in session" wasn't the answer.
DROP TABLE ... IF EXISTS before the CREATE TABLE, instead of ON COMMIT DROP. Server still dies.
CREATE TEMP TABLE (...) and insert into ... (select...) instead of CREATE TEMP TABLE ... AS, instead of ON COMMIT DROP. Server dies.
So is ON COMMIT DROP not releasing all the associated resources? What else could be holding memory? How do I release it?
I used this to great effect with SQL Server and I don't trust any query optimiser now
Then don't use them. You can still execute queries directly, as shown below.
but please tell me if this is the wrong approach for Postgres!
It is not a completely wrong approach, it's just a very awkward one, as you are trying to create something that's been implemented by others for a much easier use. As a result, you are making many mistakes that can lead to many problems, including memory leaks.
Compare to the simplicity of the exact same example that uses pg-promise:
var pgp = require('pg-promise')();
var conString = "postgres://username:password#server/database";
var db = pgp(conString);
function getJobs() {
return db.tx(function (t) {
return t.func('get_jobs');
});
}
function poll() {
getJobs()
.then(function (jobs) {
// process the jobs
})
.catch(function (error) {
// error
});
setTimeout(poll, 55);
}
poll(); // start polling
Gets even simpler when using ES6 syntax:
var pgp = require('pg-promise')();
var conString = "postgres://username:password#server/database";
var db = pgp(conString);
function poll() {
db.tx(t=>t.func('get_jobs'))
.then(jobs=> {
// process the jobs
})
.catch(error=> {
// error
});
setTimeout(poll, 55);
}
poll(); // start polling
The only thing that I didn't quite understand in your example - the use of a transaction to execute a single SELECT. This is not what transactions are generally for, as you are not changing any data. I assume you were trying to shrink a real piece of code you had that changes some data also.
In case you don't need a transaction, your code can be further reduced to:
var pgp = require('pg-promise')();
var conString = "postgres://username:password#server/database";
var db = pgp(conString);
function poll() {
db.func('get_jobs')
.then(jobs=> {
// process the jobs
})
.catch(error=> {
// error
});
setTimeout(poll, 55);
}
poll(); // start polling
UPDATE
It would be a dangerous approach, however, not to control the end of the previous request, which also may create memory/connection issues.
A safe approach should be:
function poll() {
db.tx(t=>t.func('get_jobs'))
.then(jobs=> {
// process the jobs
setTimeout(poll, 55);
})
.catch(error=> {
// error
setTimeout(poll, 55);
});
}
Use CTEs to create partial result sets instead of temp tables.
CREATE OR REPLACE FUNCTION get_jobs (
) RETURNS TABLE (
...
) AS
$BODY$
DECLARE
_nowstamp bigint;
BEGIN
-- take the current unix server time in ms
_nowstamp := (select extract(epoch from now()) * 1000)::bigint;
RETURN query (
-- 1. get the jobs that are due
WITH jobs AS (
select ...
from really_big_table_1
where job_time < _nowstamp;
-- 2. get other stuff attached to those jobs
), jobs_extra AS (
select ...
from really_big_table_2 r
inner join jobs j on r.id = j.some_id
)
-- 3. return the final result with a join to a third big table
select je.id, ...
from jobs_extra je
left join really_big_table_3 r on je.id = r.id
group by je.id
);
END
$BODY$ LANGUAGE plpgsql VOLATILE;
The planner will evaluate each block in sequence the way I wanted to achieve with temp tables.
I know this doesn't directly solve the memory leak issue (I'm pretty sure there's something wrong with Postgres' implementation of them, at least the way they manifest on the RDS configuration).
However, the query works, it is query planned the way I was intending and the memory usage is stable now after 3 days of running the job and my server doesn't crash.
I didn't change the node code at all.

Request rate is large

Im using Azure documentdb and accessing it through my node.js on express server, when I query in loop, low volume of few hundred there is no issue.
But when query in loop slightly large volume, say around thousand plus
I get partial results (inconsistent, every time I run result values are not same. May be because of asynchronous nature of Node.js)
after few results it crashes with this error
body: '{"code":"429","message":"Message: {\"Errors\":[\"Request rate is large\"]}\r\nActivityId: 1fecee65-0bb7-4991-a984-292c0d06693d, Request URI: /apps/cce94097-e5b2-42ab-9232-6abd12f53528/services/70926718-b021-45ee-ba2f-46c4669d952e/partitions/dd46d670-ab6f-4dca-bbbb-937647b03d97/replicas/130845018837894542p"}' }
Meaning DocumentDb fail to handle 1000+ request per second?
All together giving me a bad impression on NoSQL techniques.. is it short coming of DocumentDB?
As Gaurav suggests, you may be able to avoid the problem by bumping up the pricing tier, but even if you go to the highest tier, you should be able to handle 429 errors. When you get a 429 error, the response will include a 'x-ms-retry-after-ms' header. This will contain a number representing the number of milliseconds that you should wait before retrying the request that caused the error.
I wrote logic to handle this in my documentdb-utils node.js package. You can either try to use documentdb-utils or you can duplicate it yourself. Here is a snipit example.
createDocument = function() {
client.createDocument(colLink, document, function(err, response, header) {
if (err != null) {
if (err.code === 429) {
var retryAfterHeader = header['x-ms-retry-after-ms'] || 1;
var retryAfter = Number(retryAfterHeader);
return setTimeout(toRetryIf429, retryAfter);
} else {
throw new Error(JSON.stringify(err));
}
} else {
log('document saved successfully');
}
});
};
Note, in the above example document is within the scope of createDocument. This makes the retry logic a bit simpler, but if you don't like using widely scoped variables, then you can pass document in to createDocument and then pass it into a lambda function in the setTimeout call.

Resources