completablefuture runasync in foreach loop - multithreading

I have defined completablefuture runasync() task in foreach loop.
I have a requirement to insert records in nosql db and need to update the same inserted records as 'processed' in sql db(migrating data from sqldb DB2 to nosql mongodb).
To achieve this, I have defined mongo insertion process in runasyn() and function to update data as processed in db2 in thenAccept() (once check the code snippet)
So the problem is after each record insertion in mongo I am keeping inserted record in list and trying to update entire list in db2 in one shot, but its not behaving like that, for every insertion in mongo its hitting db2 for every record update but this is not feasible approach when processing thousands of records. My expectation is first to have the list of mongo inserted records and update those list of records as 'processed'in db2 in one shot. Is there any possibility for this approach? (I aware both mongo insertion and db2 update defined inside foreach loop but I want mongo should complete insertion for all entries then need to update db2 for entire mongo inserted list in one shot )
or if I able to return list of inserted records from mongoProcess atleast.
Code snippet:
unprocessedList.foreach(entry-> {
Completablefuture <Void> cf= Completablefuture.runAsync(() -> {
mongoHelper.processInMongo(entry, getObj(entry)) } , executor). thenAccept (
updateInDb2 ( entryList)) });

If i understood you correctly you want to finish all insertions and only then do one update.
you can change your code to use Promises (i'm not too familiar with java ones myself) so here is a "basic" solution scheme to your problem.
note you should use atomicInterger for the counter variable as ++ is not atomic and will not work 100% of the times otherwise
function x (count, expected) {
if(count == expected) {
updateInDb2 ( entryList))
}
}
counter = 0:
unprocessedList.foreach(entry-> {
Completablefuture <Void> cf= Completablefuture.runAsync(() -> {
mongoHelper.processInMongo(entry, getObj(entry)) } , executor).thenAccept(
counter++;
x(counter, unprocessedList.length);
});
as i said im not too familiar with java promises but a better solution would be one in the form of:
await promise = insert all mongo documents
promise fulfilled then update list

Related

Nodejs compute gets slow after query big list from Mongodb

I am using mongoose to query a really big list from Mongodb
const chat_list = await chat_model.find({}).sort({uuid: 1}); // uuid is a index
const msg_list = await message_model.find({}, {content: 1, xxx}).sort({create_time: 1});// create_time is a index of message collection, time: t1
// chat_list length is around 2,000, msg_list length is around 90,000
compute(chat_list, msg_list); // time: t2
function compute(chat_list, msg_list) {
for (let i = 0, len = chat_list.length; i < len; i++) {
msg_list.filter(msg => msg.uuid === chat_list[i].uuid)
// consistent handling for every message
}
}
for above code, t1 is about 46s, t2 is about 150s
t2 is really to big, so weird.
then I cached these list to local json file,
const chat_list = require('./chat-list.json');
const msg_list = require('./msg-list.json');
compute(chat_list, msg_list); // time: t2
this time, t2 is around 10s.
so, here comes the question, 150 seconds vs 10 seconds, why? what happened?
I tried to use worker to do the compute step after mongo query, but the time is still much bigger than 10s
The mongodb query returns a FindCursor that includes arrayish methods like .filter() but the result is not an Array.
Use .toArray() on the cursor before filtering to process the mongodb result set like for like. That might not make the overall process any faster, as the result set still needs to be fetched from mongodb, but compute will be similar.
const chat_list = await chat_model
.find({})
.sort({uuid: 1})
.toArray()
const msg_list = await message_model
.find({}, {content: 1, xxx})
.sort({create_time: 1})
.toArray()
Matt typed faster than I did, so some of what was suggested aligns with part of this answer.
I think you are measuring and comparing something different than what you are expecting and implying.
Your expectation is that the compute() function takes around 10 seconds once all of the data is loaded by the application. This is (mostly) demonstrated by your second test, apart from the fact that that test includes the time it takes to load the data from the local files. But you're seeing that there is a difference of 104 seconds (150 - 46) between the completion of message_model.find() and compute() hence leading to the question.
The key thing is that successfully advancing from the find against message_model is not the same thing as retrieving all of the results. As #Matt notes, the find() will return with a cursor object once the initial batch of results are ready. That is very different than retrieving all of the results. So there is more work (apparently ~94 seconds worth) left to do from the two find() operations to further iterate the cursors and retrieve the rest of the results. This additional time is getting reported inside of t2.
Ass suggested by #Matt, calling .toArray() should shift that time back into t1 as you are expecting. Also sounds like it may be more correct due to ambiguity with .filter() functions.
There are two other things that catch my attention. The first is: why are you retrieving all of this data client-side to do the filtering there? Perhaps you would like to do this uuid matching inside of the database via $lookup?
Secondly, this comment isn't clear to me:
// create_time is a index of message collection, time: t1
create_time itself is a field here, existent or not, that you are requesting an ascending sort against.
You are taking data from 2 tables, then with for loop you are comparing ID using filter function, what is happening now is your loop will be executed 2000 time and so the filter function also which contains 90000 records.
So take a worst case scenario here lets consider 2000 uuid you are getting is not inside the msg_list, here you are executing loop 2000*90000 even though you are not getting data.
It wan't take more than 10 to 15 secs if use below code.
//This will generate array of uuid present in message_model
const msg_list = await message_model.find({}, {content: 1, xxx}).sort({create_time: 1}).distinct("uuid");
// Below query will match all uuid present in msg_list array with chat_list UUID
const chat_list = await chat_model.find({uuid:{$in:msg_list}}).sort({uuid: 1});
The above result is doing same as you have done in your code with filter function and loop but this is proper and fastest way to receive the data you required.

OrderBy and StartAt with two different fields firestore

In my app, I have comments that have a field value of threadCommentCount. I want to order the comments using orderBy threadCommentCount descending and then have pagination continue this using startAfter(lastThreadCommentCount). The problem is when threadCommentCount is 0, which is a lot of them, it will return and the same data every time since it starts at 0 everytime. Here is the query:
popularCommentsQuery = db
.collection('comments')
.where('postId', '==', postId)
.orderBy('threadCommentCount', 'desc')
.startAfter(startAfter)
.limit(15)
.get()
This will return the same comments everytime once threadComment count is 0. I'm unable to send the last document snapshot because im using cloud functions and I dont want to send the documentSnapshot in a get query parameter. I don;t really care how the comments are ordered after threadCommentCount is 0, I just need to not get any duplicates. Any help is great!
All Firestore queries have an implicit orderBy("__name__", direction) to resolve any ties between documents that have the same values for the other named orderBy fields. This makes the final sort order stable. But it also enables you to pass another argument to startAfter to provide the document ID of the anchor document that you wish to use for the purpose of pagination.
.startAfter(lastThreadCommentCount, lastDocumentId)
Between these two values, you should be able to uniquely identify the document in the result set to start the next page.
so, I was trying OrderBy and StartAfter with two different fields(time,key) in firestore to establish pagination in flastList.
Key point is that we can pass document snapshot to define the query cursor [reference]
Here is how I managed to do it.
step 1: get the document Id(which is auto generated by firebase) with where() [reference]
const docRef = firestore().collection('shots').where('key','==','custom_key')
const fbDocIdGeneratedByFirebase= await docRef.get().docs[0].id;
step 2: get document snapshot with firebase generated document Id (which we got in the 1st step)
const docRef2= firestore().collection('shots').doc(fbDocIdGeneratedByFirebase)
const snapshot = await docRef2.get();
step 3: pass the snapshot got in step 2 to startAfter() so that the cursor will point there [reference]
let additionalQuery = firestore().collection('shots')
.orderBy("time", "desc")
.startAfter(snapshot)
.limit(this.state.limit)
let documentSnapshots = await additionalQuery.get(); // you know what to do next
...
Can you Improve the solution??

Is there a way to add an incrementing id in one statement in MongoDB?

So I got a small database, It's not going to grow much more and I'm trying to get one document from the db in an API that I implemented in python so that with a given document Id I retrieve the document in the db. However, I find it a little hard to put the user to write a random number from the db. All I require is a function that modifies each document by setting an id field and to Auto-Increment. As I said, it's not going to grow that much and the performance isn't really an issue here.
So far what I've been able to do is this:
var i = 0
db.MyCollection.update({},
{$set : {"new_field":1}},
{upsert:false,
multi:true}
i ++;),
I achieved to set an id field but it sets the same number to each document (the count of every document) So let's say that if the db has 10 docs, it'll set the Id to 10.
Find-and-modify operation returns the document updated (before or after the update depending on returnDocument setting). You can use this with $inc to implement a counter. Ruby example where c is a collection:
irb(main):005:0> c['foo'].insert_one(counter:true,count:1)
=> #<Mongo::Operation::Insert::Result:0x8040 documents=[{"n"=>1, "opTime"=>{"ts"=>#<BSON::Timestamp:0x00005609f260b7e0 #seconds=1594961771, #increment=2>, "t"=>1}, "electionId"=>BSON::ObjectId('7fffffff0000000000000001'), "ok"=>1.0, "$clusterTime"=>{"clusterTime"=>#<BSON::Timestamp:0x00005609f260b538 #seconds=1594961771, #increment=2>, "signature"=>{"hash"=><BSON::Binary:0x8060 type=generic data=0x0000000000000000...>, "keyId"=>0}}, "operationTime"=>#<BSON::Timestamp:0x00005609f260b290 #seconds=1594961771, #increment=2>}]>
irb(main):011:0> c['foo'].find_one_and_update({counter:true},{'$inc':{count:1}})
=> {"_id"=>BSON::ObjectId('5f112f6b2c97a6281f63f575'), "counter"=>true, "count"=>1}
irb(main):012:0> c['foo'].find_one_and_update({counter:true},{'$inc':{count:1}})
=> {"_id"=>BSON::ObjectId('5f112f6b2c97a6281f63f575'), "counter"=>true, "count"=>2}
irb(main):013:0> c['foo'].find_one_and_update({counter:true},{'$inc':{count:1}})
=> {"_id"=>BSON::ObjectId('5f112f6b2c97a6281f63f575'), "counter"=>true, "count"=>3}
irb(main):014:0> c['foo'].find_one_and_update({counter:true},{'$inc':{count:1}})
=> {"_id"=>BSON::ObjectId('5f112f6b2c97a6281f63f575'), "counter"=>true, "count"=>4}
Why not just use this logic? Instead of updating all via one query, just launch multiple queries one by one? Mongo will do it pretty fast, even if you have >1M docs in database (according to your phrase: I got a small database) because pre-builded index on _id field.
this is a javasript code, but I guess, you'll understand the logic of it
let all_documents = db.MyCollection.find({});
for (let i = 0; i < all_documents.length; i++) {
db.MyCollection.update({_id: all_documents[i]._id }, {$set : {"new_field": i}}, {upsert:false})
}

How to retrieve all documents in couchdb database without causing out of memory

I have a coucdb database which contains about 200000 tweets, keys are tweet ID. I have a query which needs to retrieve all documents to look for some information. I'm using lightcouch to work with couchdb in a java web app. If I create a dbClient like this:
List<JsonObject>tweets = dbClient.view("_all_docs").query(JsonObject.class);
and then loop through tweets, for each JsonObject in tweets, use
JsonObject tweetJson = dbClient.find(JsonObject.class, tweet.get("id").toString().replaceAll("\"", ""));
to retrieve each tweet one by one it took extremely long time for 200000 documents. If I load all documents in one single query using includeDocs(true)
List<JsonObject>allTweets = dbClient.view("_all_docs").includeDocs(true).query(JsonObject.class);
it caused outofmemory exception since the number of documents are too large. So how can i deal with this problem? I'm thinking about using limit(5000) to retrieve 5000 documents for each time and loop through whole database, but I don't know how to write the loop to continue to retrieve the next 5000 after the first 5000 docs. One possible solution is using startKey and endKey but I'm confused how to use them when the key is tweet ID.
Use queryPage but make sure to use a String as the Key
See: https://github.com/lightcouch/LightCouch/issues/26#event-122327174
0.1.6 still seems to show this behaviour.
A workaround that I found for this goes something like this:
changes = DbClient.changes()
.since(null) // or... since(since) if you want an offset
.includeDocs(true);
int size = 1;
getCursor("0");
while (size > 0 ) {
ChangesResult resultSet = changes.limit(40000).getChanges();
List<ChangesResult.Row> rowList = resultSet.getResults();
for (ChangesResult.Row feed: rowList) {
<instantiate your object via gson>
.
.
.
}
getCursor(resultSet.getLastSeq());
size = rowList.size();
}

CouchDB - Filtered Replication - Can the speed be improved?

I have a single database (300MB & 42,924 documents) consisting of about 20 different kinds of documents from about 200 users. The documents range in size from a few bytes to many KiloBytes (150KB or so).
When the server is unloaded, the following replication filter function takes about 2.5 minutes to complete.
When the server is loaded, it takes >10 minutes.
Can anyone comment on whether these times are expected, and if not, suggest how I might optimize things in order to
get better performance?
function(doc, req) {
acceptedDate = true;
if(doc.date) {
var docDate = new Date();
var dateKey = doc.date;
docDate.setFullYear(dateKey[0], dateKey[1], dateKey[2]);
var reqYear = req.query.year;
var reqMonth = req.query.month;
var reqDay = req.query.day;
var reqDate = new Date();
reqDate.setFullYear(reqYear, reqMonth, reqDay);
acceptedDate = docDate.getTime() >= reqDate.getTime();
}
return doc.user_id && doc.user_id == req.query.userid && doc._id.indexOf("_design") != 0 && acceptedDate;
}
Filtered replications works slow because for each fetched document runs complex logic to decide whether to replicate it or not:
CouchDB fetches next document;
Because filter function has to be applied the document gets converted to JSON;
JSONifyed document passes through stdio to query server;
Query server handles document and decodes it from JSON;
Now, query server lookups and runs your filter function which returns true or false value to CouchDB;
If result is true document goes to be replicated;
Go to p.1 and loop for all documents;
For non-filtered replications take this list, throw away p.2-5 and let p.6 has always true result. This overhead slows down whole replication process.
To significantly improve filtered replication speed, you may use Erlang filters via Erlang native server. They runs inside CouchDB, doesn't pass through any stdio interface and there is no JSON decode/encode overhead applied.
NOTE, that Erlang query server runs not inside sandbox like JavaScript one, so you need to really trust code that you run with it.
Another option is to optimize your filter function e.g. reduce object creation, method calls, but actually you wouldn't win much with this.

Resources