Very slow update performance - node.js

I am parsing a CSV file, for each row I want to check if corresponding entry exists in the database, and if it does I want to update it, if it doesn't I want to enter a new entry.
It is very slow - only around 30 entries per second.
Am I doing something incorrectly?
Using node, mongodb, monk
function loadShopsCSV(ShopsName) {
var filename = 'test.csv'
csv
.fromPath(filename)
.on("data", function(data) {
var entry = {
PeriodEST: Date.parse(data[0]),
TextDate: textDateM,
ShopId: parseInt(data[1]),
ShopName: data[2],
State: data[3],
AreaUS: parseInt(data[4]),
AreaUSX: AreaUSArray[stateArray.indexOf(data[3])],
ProductClass: data[5],
Type: data[6],
SumNetVolume: parseInt(data[7]),
Weekday: weekdayNum,
WeightedAvgPrice: parseFloat(data[8]),
}
db.get(ShopsDBname).update(
{"PeriodEST" : entry.PeriodEST,
"ShopName": entry.ShopName,
"State" : entry.State,
"AreaUS" : entry.AreaUS,
"ProductClass" : entry.ProductClass,
"Type" : entry.Type},
{$set : entry},
function(err, result) {
}
);
}
}
})
.on("end", function() {
console.log('finished loading: '+ShopsName)
});
}, function(err) {
console.error(err);
});
}

First I would suggest to localize problem:
replace .on("data", function(data) with dummy .on("data", function() {return;}) and confirm speed of csv parsing.
turn on mongo profiler db.setProfilingLevel(1) and check slow log if there is any query slower than 100 ms.
If there are no problems above - the bottleneck is in one of nodejs libraries you are using to prepare and send query.
Assuming the problem is with slow mongodb queries, you can use explain for the update query for details. It may be the case it does not use any indexes and run a table scan for every update.
Finally, it is recommended to use bulk operations, which was designed for exactly your usecase.

Have you tried updating with no write concern? as MongoDB blocks until whole update is successful and DB sends back that acknowledgement? Are you on cluster or something? (might want to write into primary node if so)
after your {$set : entry},
{writeConcern: {w: 0}}

Related

most efficient way to add a calculated field mongoose

Hi I have a fairly large dataset 55K records. I am calculating the moving average for those. Which is the most efficient way to store these results again?
Currently I am doing this. Which leads to extreme amount of records being written one by one. Is there a way for me to write the whole list back in one call if I add the calculated value to the records array?
updated the code with the bulk update. It however fails to update the "MA16" record. Even though I know for sure that there is valid data in the "list" array.
It seems to match the documents but won't update. It will yield a MA16 field in the database that is always null.
Logged output.
deletedCount:0
insertedCount:0
insertedIds:Object {}
matchedCount:150
modifiedCount:0
n:0
nInserted:0
nMatched:150
nModified:0
nRemoved:0
nUpserted:0
ok:1
var bulkUpdateArray = _.map(records, (record, index) => {
return {
updateOne :{
"filter":{_id : record._id},
"update":{$set: {"MA16": list[index]}},
"upsert":true
}
}
});
mongoose.connection.db.collection(req.body.day).bulkWrite(bulkUpdateArray).then(result => {
console.log("Insert result", result);
}).catch(err=>{
//catch the error here
})
You can use BulkWrite to achieve what you want.
Try this:
var bulkUpdateArray = _.map(records, (record, index) => {
return {
updateOne :{
"filter":{_id : record._id},
"update":{$set: {"MA16": list[index]}},
"upsert":true
}
}
})
mongoose.connection.db.collection(req.body.day).bulkWrite(bulkUpdateArray).then(result => {
//check the result of bulk update here
}).catch(err=>{
//catch the error here
})
You can use updateOne operator of bulkWrite.
From official MongoDB docs, bulkWrite has following options:
{ updateOne :
{
"filter" : <document>,
"update" : <document>,
"upsert" : <boolean>,
"collation": <document>,
"arrayFilters": [ <filterdocument1>, ... ]
}
}
Please read MongoDB bulkWrite documentation for more info.
I hope this helps you out.

Query live data from large MongoDB collection - Nodejs

I want to get last 10 records and newest record which just added.
I tried to use tailable cursor but it took me too much time because it have to scan entire collection before reach the end of collection to wait data.
{
"_id" : ObjectId("56fe349d0ef0edb520f0ca29"),
"topic" : "IoTeam/messages/",
"payload" : "20:15:04:01:12:75,-127.00,679",
"qos" : 0,
"retain" : false,
"_msgid" : "45975d0d.ba68a4",
"mac" : "20:15:04:01:12:75",
"temp" : "-127.00",
"hum" : "679",
"time" : "01/04/2016 15:43:09"
}
Thank for your help.
Still difficult to say the best solution without knowing more information. But here is one suggestion that you could try (all using the mongo shell)
Create an index on the time key.
db.your_collection_name.createIndex({time:-1})
After you have created the index, type the following to ensure it was done correctly.
db.your_collection_name.getIndexes()
This will list your indexes, and you should see that a new one was added for the time key.
Caution: Although this will reduce the amount of time it takes to query on the time key, it will increase the amount of time it takes to insert new records into your database. This is due to the fact that new inserts will need to be indexed. So take that into consideration when scaling your app, and it may mean down the road you will want to handle this in a different way.
First of all, create an index on the field time.
db.collection('nameOfYourCollection')
.createIndex(
{ "time": -1 },
null,
function(err, results){
console.log(results);
});
This will create an index on the time field of your collection. This might take some time. But once the index is created, the queries will be much faster.
After this in your query just do this :
var cursor = db.collection('nameOfYourCollection').find().sort([ ["time", -1] ]).limit(10);
cursor.forEach(function(doc){
if(doc) console.log("Got the document as : " + JSON.stringify(doc));
}, function(err){
if(err) console.log("Error: " + JSON.stringify(err));
});
This will give you the last 10 records that were inserted in the collection.
You can also call toArray instead of forEach in the cursor. Something like this :
var cursor = db.collection('nameOfYourCollection').find().sort([ ["time", -1] ]).limit(10);
cursor.toArray(function(err, docs){
if(err) console.log("Error: " + JSON.stringify(err));
if(docs){
console.log("Got the document as : " + JSON.stringify(docs));
console.log("This is the latest record that was inserted : " + JSON.stringify(docs[0]));
}
});
Hope this helps.

MongoDB, Updates and Controlling Document Expiration

I'm working on a node.js project. I'm trying to understand how MongoDB works. I'm obtaining data hourly via a cron file. I'd like for there to be unique data, so I'm using update instead of insert. That works fine. I'd like to add the option that the data expires after three days. Its not clear to me how to do that.
In pseudo code:
Setup Vars, URL's, a couple of global variables, lineNr=1, end_index=# including databaseUrl.
MongoClient.connect(databaseUrl, function(err, db) {
assert.equal(null, err, "Database Connection Troubles: " + err);
**** db.collection('XYZ_Collection').createIndex({"createdAt": 1},
{expireAfterSeconds: 120}, function() {}); **** (update)
s = fs.createReadStream(text_file_directory + 'master_index.txt')
.pipe(es.split())
.pipe(es.mapSync(function(line) {
s.pause(); // pause the readstream
lineNr += 1;
getContentFunction(line, s);
if (lineNr > end_index) {
s.end();
}
})
.on('error', function() {
console.log('Error while reading file.');
})
.on('end', function() {
console.log('All done!');
})
);
function getContentFunction(line, stream){
(get content, format it, store it as flat JSON CleanedUpContent)
var go = InsertContentToDB(db, CleanedUpContent, function() {
stream.resume();
});
}
function InsertContentToDB(db, data, callback)
(expiration TTL code if placed here generates errors too..)
db.collection('XYZ_collection').update({
'ABC': data.abc,
'DEF': data.def)
}, {
"createdAt": new Date(),
'ABC': data.abc,
'DEF': data.def,
'Content': data.blah_blah
}, {
upsert: true
},
function(err, results) {
assert.equal(null, err, "MongoDB Troubles: " + err);
callback();
});
}
So the db.collection('').update() with two fields forms a compound index to ensure the data is unique. upsert = true allows for insertion or updates as appropriate. My data varies greatly. Some content is unique, other content is an update of prior submission. I think I have this unique insert or update function working correctly. Info from... and here
What I'd really like to add is an automatic expiration to the documents within the collection. I see lots of content, but I'm at a loss as to how to implement it.
If I try
db.collection('XYZ_collection')
.ensureIndex( { "createdAt": 1 },
{ expireAfterSeconds: 259200 } ); // three days
Error
/opt/rh/nodejs010/root/usr/lib/node_modules/mongodb/lib/mongodb/mongo_client.js:390
throw err
^
Error: Cannot use a writeConcern without a provided callback
at Db.ensureIndex (/opt/rh/nodejs010/root/usr/lib/node_modules/mongodb/lib/mongodb/db.js:1237:11)
at Collection.ensureIndex (/opt/rh/nodejs010/root/usr/lib/node_modules/mongodb/lib/mongodb/collection.js:1037:11)
at tempPrice (/var/lib/openshift/56d567467628e1717b000023/app-root/runtime/repo/get_options_prices.js:57:37)
at /opt/rh/nodejs010/root/usr/lib/node_modules/mongodb/lib/mongodb/mongo_client.js:387:15
at process._tickCallback (node.js:442:13)
If I try to use createIndex I get this error...
`TypeError: Cannot call method 'createIndex' of undefined`
Note the database is totally empty, via db.XYZ_collection.drop() So yeah, I'm new to the Mongo stuff. Anybody understand what I need to do? One note, I'm very confused by something I read: in regards to you can't create TTL index if indexed field is already in use by another index. I think I'm okay, but its not clear to me.
There are some restrictions on choosing TTL Index: you can't create
TTL index if indexed field is already used in another index. index
can't have multiple fields. indexed field should be a Date bson type
As always, many thanks for your help.
Update: I've added the createIndex code above. With an empty callback, it runs without error, but the TTL system fails to remove entries at all, sigh.

Properly chaining RethinkDB table and object creation commands with rethinkdbdash

I am processing a stream of text data where I don't know ahead of time what the distribution of its values are, but I know each one looks like this:
{
"datetime": "1986-11-03T08:30:00-07:00",
"word": "wordA",
"value": "someValue"
}
I'm trying to bucket it into RethinkDB objects based on it's value, where the objects look like the following:
{
"bucketId": "1",
"bucketValues": {
"wordA": [
{"datetime": "1986-11-03T08:30:00-07:00"},
{"datetime": "1986-11-03T08:30:00-07:00"}
],
"wordB": [
{"datetime": "1986-11-03T08:30:00-07:00"},
{"datetime": "1986-11-03T08:30:00-07:00"}
]
}
}
The purpose is to eventually count the number of occurrences for each word in each bucket.
Since I'm dealing with about a million buckets, and have no knowledge of the words ahead of time, the plan is to create this objects on the fly. I am new to RethinkDB, however, and I have tried my best to do this in such a way that I don't attempt to add a word key to a bucket that doesn't exist yet, but I am not entirely sure if I'm following best-practice here chaining the commands as follows (note that I am running this on a Node.js server using :
var bucketId = "someId";
var word = "someWordValue"
r.do(r.table("buckets").get(bucketId), function(result) {
return r.branch(
// If the bucket doesn't exist
result.eq(null),
// Create it
r.table("buckets").insert({
"id": bucketId,
"bucketValues" : {}
}),
// Else do nothing
"Bucket already exists"
);
})
.run()
.then(function(result) {
console.log(result);
r.table("buckets").get(bucketId)
.do(function(bucket) {
return r.branch(
// if the word already exists
bucket("bucketValues").keys().contains(word),
// Just append to it (code not implemented yet)
"Word already exists",
// Else create the word and append it
r.table("buckets").get(bucketId).update(
{"bucketValues": r.object(word, [/*Put the timestamp here*/])}
)
);
})
.run()
.then(function(result) {
console.log(result);
});
});
Do I need to execute run here twice, or am I way off base on how you're supposed to properly chain things together with RethinkDB? I just want to make sure I'm not doing this the wrong/hard way before I get much deeper into this.
You don't have to execute run multiple times, depend on what you want. Basically, a run() end the chain and send query to server. So we do all the thing to build the query, and end it with run() to execute it. If you use run() two times, that means it is send to server 2 times.
So if we can do all processing using only RethinkDB function, we need to call run only one time. However, if we want to some kind of post-processing data, using client side, then we have no choice. Usually I tried to do all processing using RethinkDB: with control structure, looping, and anonymous function we can go pretty far without letting client do some logic.
In your case, the query can be rewritten with NodeJS, using official driver:
var r = require('rethinkdb')
var bucketId = "someId2";
var word = "someWordValue2";
r.connect()
.then((conn) => {
r.table("buckets").insert({
"id": bucketId,
"bucketValues" : {}
})
.do((result) => {
// We don't care about result at all
// We just want to ensure it's there
return r.table('buckets').get(bucketId)
.update(function(bucket) {
return {
'bucketValues': r.object(
word,
bucket('bucketValues')(word).default([])
.append(r.now()))
}
})
})
.run(conn)
.then((result) => { conn.close() })
})

Sailjs: how do I save collections with particular fields to the mongo db via waterline

I'm using no FeedParser to loop through a particular XML feed -
I'm getting the required data I want in my console using the following:
console.log('Got article: %s', item.title);
console.log('Got url %s', item.link);
Here is what I want to do:
Save the title + links generated as a document in the collection with the same name as the controller name
I've also setup a cron job so I want to make sure that if the title exists in that collection, the loop should continue to the next article in the RSS feed
Here is what I tried writing below my console.log statements mentioned above and the loop broke after executing once (Buzzfeed is the name of the model so appropriately 'buzzfeed' is the name of the collection where I want the data to be stored)
Buzzfeed.findOrCreate()
.populate('title')
.populate('url')
.exec(function (err, title, url){
buzzfeed[1].title.add(item.title );
buzzfeed[1].url = ( item.url );
buzzfeed[1].save(function (err) {
});
});
Addition: Also tried the following and it did not work:
db.buzzfeed.save( { title: item.title, url: item.link } );
As Andi mentioned above, my code was incorrect, the following query solved it for me:
Buzzfeed.create({'title': newData.title, 'url': newData.link}, function (err, newTitles) {
});

Resources