Massive inserts with pg-promise - node.js

I'm using pg-promise and I want to make multiple inserts to one table. I've seen some solutions like Multi-row insert with pg-promise and How do I properly insert multiple rows into PG with node-postgres?, and I could use pgp.helpers.concat in order to concatenate multiple selects.
But now, I need to insert a lot of measurements in a table, with more than 10,000 records, and in https://github.com/vitaly-t/pg-promise/wiki/Performance-Boost says:
"How many records you can concatenate like this - depends on the size of the records, but I would never go over 10,000 records with this approach. So if you have to insert many more records, you would want to split them into such concatenated batches and then execute them one by one."
I read all the article but I can't figure it out how to "split" my inserts into batches and then execute them one by one.
Thanks!

UPDATE
Best is to read the following article: Data Imports.
As the author of pg-promise I was compelled to finally provide the right answer to the question, as the one published earlier didn't really do it justice.
In order to insert massive/infinite number of records, your approach should be based on method sequence, that's available within tasks and transactions.
var cs = new pgp.helpers.ColumnSet(['col_a', 'col_b'], {table: 'tableName'});
// returns a promise with the next array of data objects,
// while there is data, or an empty array when no more data left
function getData(index) {
if (/*still have data for the index*/) {
// - resolve with the next array of data
} else {
// - resolve with an empty array, if no more data left
// - reject, if something went wrong
}
}
function source(index) {
var t = this;
return getData(index)
.then(data => {
if (data.length) {
// while there is still data, insert the next bunch:
var insert = pgp.helpers.insert(data, cs);
return t.none(insert);
}
// returning nothing/undefined ends the sequence
});
}
db.tx(t => t.sequence(source))
.then(data => {
// success
})
.catch(error => {
// error
});
This is the best approach to inserting massive number of rows into the database, from both performance point of view and load throttling.
All you have to do is implement your function getData according to the logic of your app, i.e. where your large data is coming from, based on the index of the sequence, to return some 1,000 - 10,000 objects at a time, depending on the size of objects and data availability.
See also some API examples:
spex -> sequence
Linked and Detached Sequencing
Streaming and Paging
Related question: node-postgres with massive amount of queries.
And in cases where you need to acquire generated id-s of all the inserted records, you would change the two lines as follows:
// return t.none(insert);
return t.map(insert + 'RETURNING id', [], a => +a.id);
and
// db.tx(t => t.sequence(source))
db.tx(t => t.sequence(source, {track: true}))
just be careful, as keeping too many record id-s in memory can create an overload.

I think the naive approach would work.
Try to split your data into multiple pieces of 10,000 records or less.
I would try splitting the array using the solution from this post.
Then, multi-row insert each array with pg-promise and execute them one by one in a transaction.
Edit : Thanks to #vitaly-t for the wonderful library and for improving my answer.
Also don't forget to wrap your queries in a transaction, or else it
will deplete the connections.
To do this, use the batch function from pg-promise to resolve all queries asynchronously :
// split your array here to get splittedData
int i = 0
var cs = new pgp.helpers.ColumnSet(['col_a', 'col_b'], {table: 'tmp'})
// values = [..,[{col_a: 'a1', col_b: 'b1'}, {col_a: 'a2', col_b: 'b2'}]]
let queries = []
for (var i = 0; i < splittedData.length; i++) {
var query = pgp.helpers.insert(splittedData[i], cs)
queries.push(query)
}
db.tx(function () {
this.batch(queries)
})
.then(function (data) {
// all record inserted successfully !
}
.catch(function (error) {
// error;
});

Related

How to read an individual column from Dynamo-Db without using Scan in Node-js?

I have 4.5 millions of records in my Dynamo Db.
I want to read the the id of each record as a batchwise.
i am expecting something like offset and limit like how we can read in Mongo Db.
Is there any way suggestions without scan method in Node-JS.
I have done enough research i can only find scan method which buffers the complete records from Dynamo Db and the it starts scanning the records, which is not effective in performance basis.
Please do give me suggestion.
From my point of view, there's no problem doing scans because (according to the Scan doc):
DynamoDB paginates the results from Scan operations
You can use the ProjectionExpression parameter so that Scan only returns some of the attributes, rather than all of them
The default size for pages is 1MB, but you can also specify the max number of items per page with the Limit parameter.
So it's just basic pagination, the same thing MongoDB does with offset and limit.
Here is an example from the docs of how to perform Scan with the node.js SDK.
Now, if you want to get all the IDs as a batchwise, you could wrap the whole thing with a Promise and resolve when there's no LastEvaluatedKey.
Below a pseudo-code of what you could do :
const performScan = () => new Promise((resolve, reject) => {
const docClient = new AWS.DynamoDB.DocumentClient();
let params = {
TableName:"YOUR_TABLE_NAME",
ProjectionExpression: "id",
Limit: 100 // only if you want something else that the default 1MB. 100 means 100 items
};
let items = [];
var scanExecute = cb => {
docClient.scan(params, (err,result) => {
if(err) return reject(err);
items = items.concat(result.Items);
if(result.LastEvaluatedKey) {
params.ExclusiveStartKey = result.LastEvaluatedKey;
return scanExecute();
} else {
return err
? reject(err)
: resolve(items);
}
});
};
scanExecute();
});
performScan().then(items => {
// deal with it
});
First things to know about DynamoDB is that it is a Key-Value Store with support for secondary indexes.
DynamoDB is a bad choice if the application often has to iterate over the entire data set without using indexes(primary or secondary), because the only way to do that is to use the Scan API.
DynamoDB Table Scan's are (a few things I can think off)
Expensive(I mean $$$)
Slow for big data sets
Might use up the provisioned throughput
If you know the primary key of all the items in DynamoDB (some external knowledge like primary is an auto incremented value, is referenced in another DB etc) then you can use BatchGetItem or Query.
So if it is a one off thing then Scan is your only option else you should look into refactoring your application to remove this scenario.

Google Datastore combine (union) multiple sets of entity results to achieve OR condition

I am working with NodeJS on Google App Engine with the Datastore database.
Due to the fact that Datastore does not have support the OR operator, I need to run multiple queries and combine the results.
I am planning to run multiple queries and then combine the results into a single array of entity objects. I have a single query working already.
Question: What is a reasonably efficient way to combine two (or more) sets of entities returned by Datastore including de-duplication? I believe this would be a "union" operation in terms of set theory.
Here is the basic query outline that will be run multiple times with some varying filters to achieve the OR conditions required.
//Set requester username
const requester = req.user.userName;
//Create datastore query on Transfer Request kind table
const task_history = datastore.createQuery('Task');
//Set query conditions
task_history.filter('requester', requester);
//Run datastore query
datastore.runQuery(task_history, function(err, entities) {
if(err) {
console.log('Task History JSON unable to return data results. Error message: ', err);
return;
//If query works and returns any entities
} else if (entities[0]) {
//Else if query works but does not return any entities return empty JSON response
res.json(entities); //HOW TO COMBINE (UNION) MULTIPLE SETS OF ENTITIES EFFICIENTLY?
return;
}
});
Here is my original post: Google Datastore filter with OR condition
IMHO the most efficient way would be to use Keys-only queries in the 1st stage, then perform the combination of the keys obtained into a single list (including deduplication), followed by obtaining the entities simply by key lookup. From Projection queries:
Keys-only queries
A keys-only query (which is a type of projection query) returns just
the keys of the result entities instead of the entities themselves, at
lower latency and cost than retrieving entire entities.
It is often more economical to do a keys-only query first, and then
fetch a subset of entities from the results, rather than executing a
general query which may fetch more entities than you actually need.
Here's how to create a keys-only query:
const query = datastore.createQuery()
.select('__key__')
.limit(1);
This method addresses several problems you may encounter when trying to directly combine lists of entities obtained through regular, non-keys-only queries:
you can't de-duplicate properly because you can't tell the difference between different entities with identical values and the same entity appearing in multiply query results
comparing entities by property values can be tricky and is definitely slower/more computing expensive than comparing just entity keys
if you can't process all the results in a single request you're incurring unnecessary datastore costs for reading them without actually using them
it is much simpler to split processing of entities in multiple requests (via task queues, for example) when handling just entity keys
There are some disadvantages as well:
it may be a bit slower because you're going to the datastore twice: once for the keys and once to get the actual entities
you can't take advantage of getting just the properties you need via non-keys-only projection queries
Here is the solution I created based on the advice provided in the accepted answer.
/*History JSON*/
module.exports.treqHistoryJSON = function(req, res) {
if (!req.user) {
req.user = {};
res.json();
return;
}
//Set Requester username
const loggedin_username = req.user.userName;
//Get records matching Requester OR Dataowner
//Google Datastore OR Conditions are not supported
//Workaround separate parallel queries get records matching Requester and Dataowner then combine results
async.parallel({
//Get entity keys matching Requester
requesterKeys: function(callback) {
getKeysOnly('TransferRequest', 'requester_username', loggedin_username, (treqs_by_requester) => {
//Callback pass in response as parameter
callback(null, treqs_by_requester)
});
},
//Get entity keys matching Dataowner
dataownerKeys: function(callback) {
getKeysOnly('TransferRequest', 'dataowner_username', loggedin_username, (treqs_by_dataowner) => {
callback(null, treqs_by_dataowner)
});
}
}, function(err, getEntities) {
if (err) {
console.log('Transfer Request History JSON unable to get entity keys Transfer Request. Error message: ', err);
return;
} else {
//Combine two arrays of entity keys into a single de-duplicated array of entity keys
let entity_keys_union = unionEntityKeys(getEntities.requesterKeys, getEntities.dataownerKeys);
//Get key values from entity key 'symbol' object type
let entity_keys_only = entity_keys_union.map((ent) => {
return ent[datastore.KEY];
});
//Pass in array of entity keys to get full entities
datastore.get(entity_keys_only, function(err, entities) {
if(err) {
console.log('Transfer Request History JSON unable to lookup multiple entities by key for Transfer Request. Error message: ', err);
return;
//If query works and returns any entities
} else {
processEntitiesToDisplay(res, entities);
}
});
}
});
};
/*
* Get keys-only entities by kind and property
* #kind string name of kind
* #property_type string property filtering by in query
* #filter_value string of filter value to match in query
* getEntitiesCallback callback to collect results
*/
function getKeysOnly(kind, property_type, filter_value, getEntitiesCallback) {
//Create datastore query
const keys_query = datastore.createQuery(kind);
//Set query conditions
keys_query.filter(property_type, filter_value);
//Select KEY only
keys_query.select('__key__');
datastore.runQuery(keys_query, function(err, entities) {
if(err) {
console.log('Get Keys Only query unable to return data results. Error message: ', err);
return;
} else {
getEntitiesCallback(entities);
}
});
}
/*
* Union two arrays of entity keys de-duplicate based on ID value
* #arr1 array of entity keys
* #arr2 array of entity keys
*/
function unionEntityKeys(arr1, arr2) {
//Create new array
let arr3 = [];
//For each element in array 1
for(let i in arr1) {
let shared = false;
for (let j in arr2)
//If ID in array 1 is same as array 2 then this is a duplicate
if (arr2[j][datastore.KEY]['id'] == arr1[i][datastore.KEY]['id']) {
shared = true;
break;
}
//If IDs are not the same add element to new array
if(!shared) {
arr3.push(arr1[i])
}
}
//Concat array 2 and new array 3
arr3 = arr3.concat(arr2);
return arr3;
}
I just wanted to write in for folks who stumble upon this...
There is a workaround for some cases of not having the OR operator if you can restructure your data a bit, using Array properties: https://cloud.google.com/datastore/docs/concepts/entities#array_properties
From the documentation:
Array properties can be useful, for instance, when performing queries with equality filters: an entity satisfies the query if any of its values for a property matches the value specified in the filter.
So, if you needed to query for all entities bearing one of multiple potential values, putting all of the possibilities for each entity into an Array property and then indexing it for your query should yield the results you want. (But, you'd need to maintain that additional property, or replace your existing properties with that Array implementation if it could work for all of what you need.)

Inserting multiple records with pg-promise

I have a scenario in which I need to insert multiple records. I have a table structure like id (it's fk from other table), key(char), value(char). The input which needs to be saved would be array of above data. example:
I have some array objects like:
lst = [];
obj = {};
obj.id= 123;
obj.key = 'somekey';
obj.value = '1234';
lst.push(obj);
obj = {};
obj.id= 123;
obj.key = 'somekey1';
obj.value = '12345';
lst.push(obj);
In MS SQL, I would have created TVP and passed it. I don't know how to achieve in postgres.
So now what I want to do is save all the items from the list in single query in postgres sql, using pg-promise library. I'm not able to find any documentation / understand from documentation. Any help appreciated. Thanks.
I am the author of pg-promise.
There are two ways to insert multiple records. The first, and most typical way is via a transaction, to make sure all records are inserted correctly, or none of them.
With pg-promise it is done in the following way:
db.tx(t => {
const queries = lst.map(l => {
return t.none('INSERT INTO table(id, key, value) VALUES(${id}, ${key}, ${value})', l);
});
return t.batch(queries);
})
.then(data => {
// SUCCESS
// data = array of null-s
})
.catch(error => {
// ERROR
});
You initiate a transaction with method tx, then create all INSERT query promises, and then resolve them all as a batch.
The second approach is by concatenating all insert values into a single INSERT query, which I explain in detail in Performance Boost. See also: Multi-row insert with pg-promise.
For more examples see Tasks and Transactions.
Addition
It is worth pointing out that in most cases we do not insert a record id, rather have it generated automatically. Sometimes we want to get the new id-s back, and in other cases we don't care.
The examples above resolve with an array of null-s, because batch resolves with an array of individual results, and method none resolves with null, according to its API.
Let's assume that we want to generate the new id-s, and that we want to get them all back. To accomplish this we would change the code to the following:
db.tx(t => {
const queries = lst.map(l => {
return t.one('INSERT INTO table(key, value) VALUES(${key}, ${value}) RETURNING id',
l, a => +a.id);
});
return t.batch(queries);
})
.then(data => {
// SUCCESS
// data = array of new id-s;
})
.catch(error => {
// ERROR
});
i.e. the changes are:
we do not insert the id values
we replace method none with one, to get one row/object from each insert
we append RETURNING id to the query to get the value
we add a => +a.id to do the automatic row transformation. See also pg-promise returns integers as strings to understand what that + is for.
UPDATE-1
For a high-performance approach via a single INSERT query see Multi-row insert with pg-promise.
UPDATE-2
A must-read article: Data Imports.

How do I see output of SQL query in node-sqlite3?

I read all the documentation and this seemingly simple operation seems completely ignored throughout the entire README.
Currently, I am trying to run a SELECT query and console.log the results, but it is simply returning a database object. How do I view the results from my query in Node console?
exports.runDB = function() {
db.serialize(function() {
console.log(db.run('SELECT * FROM archive'));
});
db.close();
}
run does not have retrieval capabilities. You need to use all, each, or get
According to the documentation for all:
Note that it first retrieves all result rows and stores them in
memory. For queries that have potentially large result sets, use the
Database#each function to retrieve all rows or Database#prepare
followed by multiple Statement#get calls to retrieve a previously
unknown amount of rows.
As an illistration:
db.all('SELECT url, rowid FROM archive', function(err, table) {
console.log(table);
});
That will return all entries in the archive table as an array of objects.

Inserting records without failing on duplicate

I'm inserting a lot of documents in bulk with the latest node.js native driver (2.0).
My collection has an index on the URL field, and I'm bound to get duplicates out of the thousands of lines I insert. Is there a way for MongoDB to not crash when it encounters a duplicate?
Right now I'm batching records 1000 at a time, and Using insertMany. I've tried various things, including adding {continueOnError=true}. I tried inserting my records one by one, but it's just too slow, I have thousands of workers in a queue and can't really afford the delay.
Collection definition :
self.prods = db.collection('products');
self.prods.ensureIndex({url:1},{unique:true}, function() {});
Insert :
MongoProcessor.prototype._batchInsert= function(coll,items){
var self = this;
if(items.length>0){
var batch = [];
var l = items.length;
for (var i = 0; i < 999; i++) {
if(i<l){
batch.push(items.shift());
}
if(i===998){
coll.insertMany(batch, {continueOnError: true},function(err,res){
if(err) console.log(err);
if(res) console.log('Inserted products: '+res.insertedCount+' / '+batch.length);
self._batchInsert(coll,items);
});
}
}
}else{
self._terminate();
}
};
I was thinking of dropping the index before the insert, then reindexing using dropDups, but it seems a bit hacky, my workers are clustered and I have no idea what would happen if they try to insert records while another process is reindexing... Does anyone have a better idea?
Edit :
I forgot to mention one thing. The items I insert have a 'processed' field which is set to 'false'. However the items already in the db may have been processed, so the field can be 'true'. Therefore I can't upsert... Or can I select a field to be untouched by upsert?
The 2.6 Bulk API is what you're looking for, which will require MongoDB 2.6+* and node driver 1.4+.
There are 2 types of bulk operations:
Ordered bulk operations. These operations execute all the operation in order and error out on the first write error.
Unordered bulk operations. These operations execute all the operations in parallel and aggregates up all the errors. Unordered bulk operations do not guarantee order of execution.
So in your case Unordered is what you want. The previous link provides an example:
MongoClient.connect("mongodb://localhost:27017/test", function(err, db) {
// Get the collection
var col = db.collection('batch_write_ordered_ops');
// Initialize the Ordered Batch
var batch = col.initializeUnorderedBulkOp();
// Add some operations to be executed in order
batch.insert({a:1});
batch.find({a:1}).updateOne({$set: {b:1}});
batch.find({a:2}).upsert().updateOne({$set: {b:2}});
batch.insert({a:3});
batch.find({a:3}).remove({a:3});
// Execute the operations
batch.execute(function(err, result) {
console.dir(err);
console.dir(result);
db.close();
});
});
*The docs do state that: "for older servers than 2.6 the API will downconvert the operations. However it’s not possible to downconvert 100% so there might be slight edge cases where it cannot correctly report the right numbers."

Resources