Inserting records without failing on duplicate - node.js

I'm inserting a lot of documents in bulk with the latest node.js native driver (2.0).
My collection has an index on the URL field, and I'm bound to get duplicates out of the thousands of lines I insert. Is there a way for MongoDB to not crash when it encounters a duplicate?
Right now I'm batching records 1000 at a time, and Using insertMany. I've tried various things, including adding {continueOnError=true}. I tried inserting my records one by one, but it's just too slow, I have thousands of workers in a queue and can't really afford the delay.
Collection definition :
self.prods = db.collection('products');
self.prods.ensureIndex({url:1},{unique:true}, function() {});
Insert :
MongoProcessor.prototype._batchInsert= function(coll,items){
var self = this;
if(items.length>0){
var batch = [];
var l = items.length;
for (var i = 0; i < 999; i++) {
if(i<l){
batch.push(items.shift());
}
if(i===998){
coll.insertMany(batch, {continueOnError: true},function(err,res){
if(err) console.log(err);
if(res) console.log('Inserted products: '+res.insertedCount+' / '+batch.length);
self._batchInsert(coll,items);
});
}
}
}else{
self._terminate();
}
};
I was thinking of dropping the index before the insert, then reindexing using dropDups, but it seems a bit hacky, my workers are clustered and I have no idea what would happen if they try to insert records while another process is reindexing... Does anyone have a better idea?
Edit :
I forgot to mention one thing. The items I insert have a 'processed' field which is set to 'false'. However the items already in the db may have been processed, so the field can be 'true'. Therefore I can't upsert... Or can I select a field to be untouched by upsert?

The 2.6 Bulk API is what you're looking for, which will require MongoDB 2.6+* and node driver 1.4+.
There are 2 types of bulk operations:
Ordered bulk operations. These operations execute all the operation in order and error out on the first write error.
Unordered bulk operations. These operations execute all the operations in parallel and aggregates up all the errors. Unordered bulk operations do not guarantee order of execution.
So in your case Unordered is what you want. The previous link provides an example:
MongoClient.connect("mongodb://localhost:27017/test", function(err, db) {
// Get the collection
var col = db.collection('batch_write_ordered_ops');
// Initialize the Ordered Batch
var batch = col.initializeUnorderedBulkOp();
// Add some operations to be executed in order
batch.insert({a:1});
batch.find({a:1}).updateOne({$set: {b:1}});
batch.find({a:2}).upsert().updateOne({$set: {b:2}});
batch.insert({a:3});
batch.find({a:3}).remove({a:3});
// Execute the operations
batch.execute(function(err, result) {
console.dir(err);
console.dir(result);
db.close();
});
});
*The docs do state that: "for older servers than 2.6 the API will downconvert the operations. However it’s not possible to downconvert 100% so there might be slight edge cases where it cannot correctly report the right numbers."

Related

How to read an individual column from Dynamo-Db without using Scan in Node-js?

I have 4.5 millions of records in my Dynamo Db.
I want to read the the id of each record as a batchwise.
i am expecting something like offset and limit like how we can read in Mongo Db.
Is there any way suggestions without scan method in Node-JS.
I have done enough research i can only find scan method which buffers the complete records from Dynamo Db and the it starts scanning the records, which is not effective in performance basis.
Please do give me suggestion.
From my point of view, there's no problem doing scans because (according to the Scan doc):
DynamoDB paginates the results from Scan operations
You can use the ProjectionExpression parameter so that Scan only returns some of the attributes, rather than all of them
The default size for pages is 1MB, but you can also specify the max number of items per page with the Limit parameter.
So it's just basic pagination, the same thing MongoDB does with offset and limit.
Here is an example from the docs of how to perform Scan with the node.js SDK.
Now, if you want to get all the IDs as a batchwise, you could wrap the whole thing with a Promise and resolve when there's no LastEvaluatedKey.
Below a pseudo-code of what you could do :
const performScan = () => new Promise((resolve, reject) => {
const docClient = new AWS.DynamoDB.DocumentClient();
let params = {
TableName:"YOUR_TABLE_NAME",
ProjectionExpression: "id",
Limit: 100 // only if you want something else that the default 1MB. 100 means 100 items
};
let items = [];
var scanExecute = cb => {
docClient.scan(params, (err,result) => {
if(err) return reject(err);
items = items.concat(result.Items);
if(result.LastEvaluatedKey) {
params.ExclusiveStartKey = result.LastEvaluatedKey;
return scanExecute();
} else {
return err
? reject(err)
: resolve(items);
}
});
};
scanExecute();
});
performScan().then(items => {
// deal with it
});
First things to know about DynamoDB is that it is a Key-Value Store with support for secondary indexes.
DynamoDB is a bad choice if the application often has to iterate over the entire data set without using indexes(primary or secondary), because the only way to do that is to use the Scan API.
DynamoDB Table Scan's are (a few things I can think off)
Expensive(I mean $$$)
Slow for big data sets
Might use up the provisioned throughput
If you know the primary key of all the items in DynamoDB (some external knowledge like primary is an auto incremented value, is referenced in another DB etc) then you can use BatchGetItem or Query.
So if it is a one off thing then Scan is your only option else you should look into refactoring your application to remove this scenario.

Massive inserts with pg-promise

I'm using pg-promise and I want to make multiple inserts to one table. I've seen some solutions like Multi-row insert with pg-promise and How do I properly insert multiple rows into PG with node-postgres?, and I could use pgp.helpers.concat in order to concatenate multiple selects.
But now, I need to insert a lot of measurements in a table, with more than 10,000 records, and in https://github.com/vitaly-t/pg-promise/wiki/Performance-Boost says:
"How many records you can concatenate like this - depends on the size of the records, but I would never go over 10,000 records with this approach. So if you have to insert many more records, you would want to split them into such concatenated batches and then execute them one by one."
I read all the article but I can't figure it out how to "split" my inserts into batches and then execute them one by one.
Thanks!
UPDATE
Best is to read the following article: Data Imports.
As the author of pg-promise I was compelled to finally provide the right answer to the question, as the one published earlier didn't really do it justice.
In order to insert massive/infinite number of records, your approach should be based on method sequence, that's available within tasks and transactions.
var cs = new pgp.helpers.ColumnSet(['col_a', 'col_b'], {table: 'tableName'});
// returns a promise with the next array of data objects,
// while there is data, or an empty array when no more data left
function getData(index) {
if (/*still have data for the index*/) {
// - resolve with the next array of data
} else {
// - resolve with an empty array, if no more data left
// - reject, if something went wrong
}
}
function source(index) {
var t = this;
return getData(index)
.then(data => {
if (data.length) {
// while there is still data, insert the next bunch:
var insert = pgp.helpers.insert(data, cs);
return t.none(insert);
}
// returning nothing/undefined ends the sequence
});
}
db.tx(t => t.sequence(source))
.then(data => {
// success
})
.catch(error => {
// error
});
This is the best approach to inserting massive number of rows into the database, from both performance point of view and load throttling.
All you have to do is implement your function getData according to the logic of your app, i.e. where your large data is coming from, based on the index of the sequence, to return some 1,000 - 10,000 objects at a time, depending on the size of objects and data availability.
See also some API examples:
spex -> sequence
Linked and Detached Sequencing
Streaming and Paging
Related question: node-postgres with massive amount of queries.
And in cases where you need to acquire generated id-s of all the inserted records, you would change the two lines as follows:
// return t.none(insert);
return t.map(insert + 'RETURNING id', [], a => +a.id);
and
// db.tx(t => t.sequence(source))
db.tx(t => t.sequence(source, {track: true}))
just be careful, as keeping too many record id-s in memory can create an overload.
I think the naive approach would work.
Try to split your data into multiple pieces of 10,000 records or less.
I would try splitting the array using the solution from this post.
Then, multi-row insert each array with pg-promise and execute them one by one in a transaction.
Edit : Thanks to #vitaly-t for the wonderful library and for improving my answer.
Also don't forget to wrap your queries in a transaction, or else it
will deplete the connections.
To do this, use the batch function from pg-promise to resolve all queries asynchronously :
// split your array here to get splittedData
int i = 0
var cs = new pgp.helpers.ColumnSet(['col_a', 'col_b'], {table: 'tmp'})
// values = [..,[{col_a: 'a1', col_b: 'b1'}, {col_a: 'a2', col_b: 'b2'}]]
let queries = []
for (var i = 0; i < splittedData.length; i++) {
var query = pgp.helpers.insert(splittedData[i], cs)
queries.push(query)
}
db.tx(function () {
this.batch(queries)
})
.then(function (data) {
// all record inserted successfully !
}
.catch(function (error) {
// error;
});

Enforce limit on mongodb bulk API

I'd like to delete a large number of old documents from one collection and so it makes sense to use the bulk api. Deleting them is as simple as:
var bulk = db.myCollection.initializeUnorderedBulkOp();
bulk.find({
_id: {
$lt: oldestAllowedId
}
}).remove();
bulk.execute();
The only problem is this will attempt to delete every single document matching this criteria and in this case that is millions of documents, so for performance reasons I don't want to delete them all at once. I want to enforce a limit on the operation so that I can do something like bulk.limit(10000).execute(); and space the operations out by a few seconds to prevent locking the database for longer than necessary. However I have been unable to find any options that can be passed to bulk for limiting the number it executes.
Is there a way to limit bulk operations in this manner?
Before anyone mentions it, I know that bulk will split operations into 1000 document chunks automatically, but it will still execute all of those operations sequentially as fast as it can. This results in a much larger performance impact than I can deal with right now.
You can iterate the array of _id that of those documents that match your query using the .forEach method. The best way to return that array is by using the .distinct() method. You then use "bulk" operations to remove your documents.
var bulk = db.myCollection.initializeUnorderedBulkOp();
var count = 0;
var ids = db.myCollection.distinct('_id', { '_id': { '$lt': oldestAllowedId } } );
ids.forEach(function(id) {
bulk.find( { '_id': id } ).removeOne();
count++;
if (count % 1000 === 0) {
// Execute per 1000 operations and re-init
bulk.execute();
// Here you can sleep for a while
bulk = db.myCollection.initializeUnorderedBulkOp();
}
});
// clean up queues
if (count > 0 ) {
bulk.execute();
}

How do I see output of SQL query in node-sqlite3?

I read all the documentation and this seemingly simple operation seems completely ignored throughout the entire README.
Currently, I am trying to run a SELECT query and console.log the results, but it is simply returning a database object. How do I view the results from my query in Node console?
exports.runDB = function() {
db.serialize(function() {
console.log(db.run('SELECT * FROM archive'));
});
db.close();
}
run does not have retrieval capabilities. You need to use all, each, or get
According to the documentation for all:
Note that it first retrieves all result rows and stores them in
memory. For queries that have potentially large result sets, use the
Database#each function to retrieve all rows or Database#prepare
followed by multiple Statement#get calls to retrieve a previously
unknown amount of rows.
As an illistration:
db.all('SELECT url, rowid FROM archive', function(err, table) {
console.log(table);
});
That will return all entries in the archive table as an array of objects.

How to do a massive random update with MongoDB / NodeJS

I have a mongoDB collection with more then 1000000 documents and i would like to update each document one by one with a dedicated information (each doc has an information coming from an other collection).
Currently i'm using a cursor that fetch all the data from the collection and i do an update of each records through the async module of Node.js
Fetch all docs :
inst.db.collection(association.collection, function(err, collection) {
collection.find({}, {}, function(err, cursor) {
cursor.toArray(function(err, items){
......
);
});
});
update each doc :
items.forEach(function(item) {
// *** do some stuff with item, add field etc.
tasks.push(function(nextTask) {
inst.db.collection(association.collection, function(err, collection) {
if (err) callback(err, null);
collection.save(item, nextTask);
});
});
});
call the "save" task in parallel
async.parallel(tasks, function(err, results) {
callback(err, results);
});
Ho would you do this type of operation in a more efficient way? I mean how to avoid the initial "find" to load a cursor. Is there now way to do an operation doc by doc knowing that all docs should be updated?
Thanks for your support.
You're question inspired me to create a Gist to do some performance testing of different approaches to your problem.
Here are the results running on a small EC2 instance with the MongoDB at localhost. The test scenario is to uniquely operate on every document of a 100000 element collection.
108.661 seconds -- Uses find().toArray to pull in all the items at once then replaces the documents with individual "save" calls.
99.645 seconds -- Uses find().toArray to pull in all the items at once then updates the documents with individual "update" calls.
74.553 seconds -- Iterates on the cursor (find().each) with batchSize = 10, then uses individual update calls.
58.673 seconds -- Iterates on the cursor (find().each) with batchSize = 10000, then uses individual update calls.
4.727 seconds -- Iterates on the cursor with batchSize = 10000, and does inserts into a new collection 10000 items at a time.
Though not included, I also did a test with MapReduce used as a server side filter which ran at about 19 seconds. I would have liked to have similarly used "aggregate" as a server side filter, but it doesn't yet have an option to output to a collection.
The bottom line answer is that if you can get away with it, the fastest option is to pull items from an initial collection via a cursor, update them locally and insert them into a new collection in big chunks. Then you can swap in the new collection for the old.
If you need to keep the database active, then the best option is to use a cursor with a big batchSize, and update the documents in place. The "save" call is slower than "update" because it needs to replace whole document, and probably needs to reindex it as well.

Resources