I need to create a new field sid on each document in a collection of about 500K documents. Each sid is unique and based on that record's existing roundedDate and stream fields.
I'm doing so with the following code:
var cursor = db.getCollection('snapshots').find();
var iterated = 0;
var updated = 0;
while (cursor.hasNext()) {
var doc = cursor.next();
if (doc.stream && doc.roundedDate && !doc.sid) {
db.getCollection('snapshots').update({ "_id": doc['_id'] }, {
$set: {
sid: doc.stream.valueOf() + '-' + doc.roundedDate,
}
});
updated++;
}
iterated++;
};
print('total ' + cursor.count() + ' iterated through ' + iterated + ' updated ' + updated);
It works well at first, but after a few hours and about 100K records it errors out with:
Error: getMore command failed: {
"ok" : 0,
"errmsg": "Cursor not found, cursor id: ###",
"code": 43,
}: ...
EDIT - Query performance:
As #NeilLunn pointed out in his comments, you should not be filtering the documents manually, but use .find(...) for that instead:
db.snapshots.find({
roundedDate: { $exists: true },
stream: { $exists: true },
sid: { $exists: false }
})
Also, using .bulkWrite(), available as from MongoDB 3.2, will be far way more performant than doing individual updates.
It is possible that, with that, you are able to execute your query within the 10 minutes lifetime of the cursor. If it still takes more than that, you cursor will expire and you will have the same problem anyway, which is explained below:
What is going on here:
Error: getMore command failed may be due to a cursor timeout, which is related with two cursor attributes:
Timeout limit, which is 10 minutes by default. From the docs:
By default, the server will automatically close the cursor after 10 minutes of inactivity, or if client has exhausted the cursor.
Batch size, which is 101 documents or 16 MB for the first batch, and 16 MB, regardless of the number of documents, for subsequent batches (as of MongoDB 3.4). From the docs:
find() and aggregate() operations have an initial batch size of 101 documents by default. Subsequent getMore operations issued against the resulting cursor have no default batch size, so they are limited only by the 16 megabyte message size.
Probably you are consuming those initial 101 documents and then getting a 16 MB batch, which is the maximum, with a lot more documents. As it is taking more than 10 minutes to process them, the cursor on the server times out and, by the time you are done processing the documents in the second batch and request a new one, the cursor is already closed:
As you iterate through the cursor and reach the end of the returned batch, if there are more results, cursor.next() will perform a getMore operation to retrieve the next batch.
Possible solutions:
I see 5 possible ways to solve this, 3 good ones, with their pros and cons, and 2 bad one:
π Reducing the batch size to keep the cursor alive.
π Remove the timeout from the cursor.
π Retry when the cursor expires.
π Query the results in batches manually.
π Get all the documents before the cursor expires.
Note they are not numbered following any specific criteria. Read through them and decide which one works best for your particular case.
1. π Reducing the batch size to keep the cursor alive
One way to solve that is use cursor.bacthSize to set the batch size on the cursor returned by your find query to match those that you can process within those 10 minutes:
const cursor = db.collection.find()
.batchSize(NUMBER_OF_DOCUMENTS_IN_BATCH);
However, keep in mind that setting a very conservative (small) batch size will probably work, but will also be slower, as now you need to access the server more times.
On the other hand, setting it to a value too close to the number of documents you can process in 10 minutes means that it is possible that if some iterations take a bit longer to process for any reason (other processes may be consuming more resources), the cursor will expire anyway and you will get the same error again.
2. π Remove the timeout from the cursor
Another option is to use cursor.noCursorTimeout to prevent the cursor from timing out:
const cursor = db.collection.find().noCursorTimeout();
This is considered a bad practice as you would need to close the cursor manually or exhaust all its results so that it is automatically closed:
After setting the noCursorTimeout option, you must either close the cursor manually with cursor.close() or by exhausting the cursorβs results.
As you want to process all the documents in the cursor, you wouldn't need to close it manually, but it is still possible that something else goes wrong in your code and an error is thrown before you are done, thus leaving the cursor opened.
If you still want to use this approach, use a try-catch to make sure you close the cursor if anything goes wrong before you consume all its documents.
Note I don't consider this a bad solution (therefore the π), as even thought it is considered a bad practice...:
It is a feature supported by the driver. If it was so bad, as there are alternatives ways to get around timeout issues, as explained in the other solutions, this won't be supported.
There are ways to use it safely, it's just a matter of being extra cautious with it.
I assume you are not running this kind of queries regularly, so the chances that you start leaving open cursors everywhere is low. If this is not the case, and you really need to deal with these situations all the time, then it does make sense not to use noCursorTimeout.
3. π Retry when the cursor expires
Basically, you put your code in a try-catch and when you get the error, you get a new cursor skipping the documents that you have already processed:
let processed = 0;
let updated = 0;
while(true) {
const cursor = db.snapshots.find().sort({ _id: 1 }).skip(processed);
try {
while (cursor.hasNext()) {
const doc = cursor.next();
++processed;
if (doc.stream && doc.roundedDate && !doc.sid) {
db.snapshots.update({
_id: doc._id
}, { $set: {
sid: `${ doc.stream.valueOf() }-${ doc.roundedDate }`
}});
++updated;
}
}
break; // Done processing all, exit outer loop
} catch (err) {
if (err.code !== 43) {
// Something else than a timeout went wrong. Abort loop.
throw err;
}
}
}
Note you need to sort the results for this solution to work.
With this approach, you are minimizing the number of requests to the server by using the maximum possible batch size of 16 MB, without having to guess how many documents you will be able to process in 10 minutes beforehand. Therefore, it is also more robust than the previous approach.
4. π Query the results in batches manually
Basically, you use skip(), limit() and sort() to do multiple queries with a number of documents you think you can process in 10 minutes.
I consider this a bad solution because the driver already has the option to set the batch size, so there's no reason to do this manually, just use solution 1 and don't reinvent the wheel.
Also, it is worth mentioning that it has the same drawbacks than solution 1,
5. π Get all the documents before the cursor expires
Probably your code is taking some time to execute due to results processing, so you could retrieve all the documents first and then process them:
const results = new Array(db.snapshots.find());
This will retrieve all the batches one after another and close the cursor. Then, you can loop through all the documents inside results and do what you need to do.
However, if you are having timeout issues, chances are that your result set is quite large, thus pulling everything in memory may not be the most advisable thing to do.
Note about snapshot mode and duplicate documents
It is possible that some documents are returned multiple times if intervening write operations move them due to a growth in document size. To solve this, use cursor.snapshot(). From the docs:
Append the snapshot() method to a cursor to toggle the βsnapshotβ mode. This ensures that the query will not return a document multiple times, even if intervening write operations result in a move of the document due to the growth in document size.
However, keep in mind its limitations:
It doesn't work with sharded collections.
It doesn't work with sort() or hint(), so it will not work with solutions 3 and 4.
It doesn't guarantee isolation from insertion or deletions.
Note with solution 5 the time window to have a move of documents that may cause duplicate documents retrieval is narrower than with the other solutions, so you may not need snapshot().
In your particular case, as the collection is called snapshot, probably it is not likely to change, so you probably don't need snapshot(). Moreover, you are doing updates on documents based on their data and, once the update is done, that same document will not be updated again even though it is retrieved multiple times, as the if condition will skip it.
Note about open cursors
To see a count of open cursors use db.serverStatus().metrics.cursor.
It's a bug in mongodb server session management. Fix currently in progress, should be fixed in 4.0+
SERVER-34810: Session cache refresh can erroneously kill cursors that are still in use
(reproduced in MongoDB 3.6.5)
adding collection.find().batchSize(20) helped me with about a tiny reduced performance.
I also ran into this problem, but for me it was caused by a bug in the MongDB driver.
It happened in the version 3.0.x of the npm package mongodb which is e.g. used in Meteor 1.7.0.x, where I also recorded this issue. It's further described in this comment and the thread contains a sample project which confirms the bug: https://github.com/meteor/meteor/issues/9944#issuecomment-420542042
Updating the npm package to 3.1.x fixed it for me, because I already had taken into account the good advises, given here by #Danziger.
When using Java v3 driver, noCursorTimeout should be set in the FindOptions.
DBCollectionFindOptions options =
new DBCollectionFindOptions()
.maxTime(90, TimeUnit.MINUTES)
.noCursorTimeout(true)
.batchSize(batchSize)
.projection(projectionQuery);
cursor = collection.find(filterQuery, options);
in my case, It was a Load balancing issue, had the same issue running with Node.js service and Mongos as a pod on Kubernetes.
The client was using mongos service with default load balancing.
changing the kubernetes service to use sessionAffinity: ClientIP (stickiness) resolved the issue for me.
noCursorTimeout will NOT work
now is 2021 year, for
cursor id xxx not found, full error: {'ok': 0.0, 'errmsg': 'cursor id xxx not found', 'code': 43, 'codeName': 'CursorNotFound'}
official says
Consider an application that issues a db.collection.find() with cursor.noCursorTimeout(). The server returns a cursor along with a batch of documents defined by the cursor.batchSize() of the find(). The session refreshes each time the application requests a new batch of documents from the server. However, if the application takes longer than 30 minutes to process the current batch of documents, the session is marked as expired and closed. When the server closes the session, it also kills the cursor despite the cursor being configured with noCursorTimeout(). When the application requests the next batch of documents, the server returns an error.
that means: Even if you have set:
noCursorTimeout=True
smaller batchSize
will still cursor id not found after default 30 minutes
How to fix/avoid cursor id not found?
make sure two point
(explicitly) create new session, get db and collection from this session
refresh session periodically
code:
(official) js
var session = db.getMongo().startSession()
var sessionId = session.getSessionId().id
var cursor = session.getDatabase("examples").getCollection("data").find().noCursorTimeout()
var refreshTimestamp = new Date() // take note of time at operation start
while (cursor.hasNext()) {
// Check if more than 5 minutes have passed since the last refresh
if ( (new Date()-refreshTimestamp)/1000 > 300 ) {
print("refreshing session")
db.adminCommand({"refreshSessions" : [sessionId]})
refreshTimestamp = new Date()
}
// process cursor normally
}
(mine) python
import logging
from datetime import datetime
import pymongo
mongoClient = pymongo.MongoClient('mongodb://127.0.0.1:27017/your_db_name')
# every 10 minutes to update session once
# Note: should less than 30 minutes = Mongo session defaul timeout time
# https://docs.mongodb.com/v5.0/reference/method/cursor.noCursorTimeout/
# RefreshSessionPerSeconds = 10 * 60
RefreshSessionPerSeconds = 8 * 60
def mergeHistorResultToNewCollection():
mongoSession = mongoClient.start_session() # <pymongo.client_session.ClientSession object at 0x1081c5c70>
mongoSessionId = mongoSession.session_id # {'id': Binary(b'\xbf\xd8\xd...1\xbb', 4)}
mongoDb = mongoSession.client["your_db_name"] # Database(MongoClient(host=['127.0.0.1:27017'], document_class=dict, tz_aware=False, connect=True), 'your_db_name')
mongoCollectionOld = mongoDb["collecion_old"]
mongoCollectionNew = mongoDb['collecion_new']
# historyAllResultCursor = mongoCollectionOld.find(session=mongoSession)
historyAllResultCursor = mongoCollectionOld.find(no_cursor_timeout=True, session=mongoSession)
lastUpdateTime = datetime.now() # datetime.datetime(2021, 8, 30, 10, 57, 14, 579328)
for curIdx, oldHistoryResult in enumerate(historyAllResultCursor):
curTime = datetime.now() # datetime.datetime(2021, 8, 30, 10, 57, 25, 110374)
elapsedTime = curTime - lastUpdateTime # datetime.timedelta(seconds=10, microseconds=531046)
elapsedTimeSeconds = elapsedTime.total_seconds() # 2.65892
isShouldUpdateSession = elapsedTimeSeconds > RefreshSessionPerSeconds
# if (curIdx % RefreshSessionPerNum) == 0:
if isShouldUpdateSession:
lastUpdateTime = curTime
cmdResp = mongoDb.command("refreshSessions", [mongoSessionId], session=mongoSession)
logging.info("Called refreshSessions command, resp=%s", cmdResp)
# do what you want
existedNewResult = mongoCollectionNew.find_one({"shortLink": "http://xxx"}, session=mongoSession)
# mongoSession.close()
mongoSession.end_session()
Refer doc
MongoDB
ClientSession
refreshSessions
pymongo
find
command
Related
I am using mongoose to query a really big list from Mongodb
const chat_list = await chat_model.find({}).sort({uuid: 1}); // uuid is a index
const msg_list = await message_model.find({}, {content: 1, xxx}).sort({create_time: 1});// create_time is a index of message collection, time: t1
// chat_list length is around 2,000, msg_list length is around 90,000
compute(chat_list, msg_list); // time: t2
function compute(chat_list, msg_list) {
for (let i = 0, len = chat_list.length; i < len; i++) {
msg_list.filter(msg => msg.uuid === chat_list[i].uuid)
// consistent handling for every message
}
}
for above code, t1 is about 46s, t2 is about 150s
t2 is really to big, so weird.
then I cached these list to local json file,
const chat_list = require('./chat-list.json');
const msg_list = require('./msg-list.json');
compute(chat_list, msg_list); // time: t2
this time, t2 is around 10s.
so, here comes the question, 150 seconds vs 10 seconds, why? what happened?
I tried to use worker to do the compute step after mongo query, but the time is still much bigger than 10s
The mongodb query returns a FindCursor that includes arrayish methods like .filter() but the result is not an Array.
Use .toArray() on the cursor before filtering to process the mongodb result set like for like. That might not make the overall process any faster, as the result set still needs to be fetched from mongodb, but compute will be similar.
const chat_list = await chat_model
.find({})
.sort({uuid: 1})
.toArray()
const msg_list = await message_model
.find({}, {content: 1, xxx})
.sort({create_time: 1})
.toArray()
Matt typed faster than I did, so some of what was suggested aligns with part of this answer.
I think you are measuring and comparing something different than what you are expecting and implying.
Your expectation is that the compute() function takes around 10 seconds once all of the data is loaded by the application. This is (mostly) demonstrated by your second test, apart from the fact that that test includes the time it takes to load the data from the local files. But you're seeing that there is a difference of 104 seconds (150 - 46) between the completion of message_model.find() and compute() hence leading to the question.
The key thing is that successfully advancing from the find against message_model is not the same thing as retrieving all of the results. As #Matt notes, the find() will return with a cursor object once the initial batch of results are ready. That is very different than retrieving all of the results. So there is more work (apparently ~94 seconds worth) left to do from the two find() operations to further iterate the cursors and retrieve the rest of the results. This additional time is getting reported inside of t2.
Ass suggested by #Matt, calling .toArray() should shift that time back into t1 as you are expecting. Also sounds like it may be more correct due to ambiguity with .filter() functions.
There are two other things that catch my attention. The first is: why are you retrieving all of this data client-side to do the filtering there? Perhaps you would like to do this uuid matching inside of the database via $lookup?
Secondly, this comment isn't clear to me:
// create_time is a index of message collection, time: t1
create_time itself is a field here, existent or not, that you are requesting an ascending sort against.
You are taking data from 2 tables, then with for loop you are comparing ID using filter function, what is happening now is your loop will be executed 2000 time and so the filter function also which contains 90000 records.
So take a worst case scenario here lets consider 2000 uuid you are getting is not inside the msg_list, here you are executing loop 2000*90000 even though you are not getting data.
It wan't take more than 10 to 15 secs if use below code.
//This will generate array of uuid present in message_model
const msg_list = await message_model.find({}, {content: 1, xxx}).sort({create_time: 1}).distinct("uuid");
// Below query will match all uuid present in msg_list array with chat_list UUID
const chat_list = await chat_model.find({uuid:{$in:msg_list}}).sort({uuid: 1});
The above result is doing same as you have done in your code with filter function and loop but this is proper and fastest way to receive the data you required.
Take a windowed virtual list with the capability of loading an arbitrary range of rows at any point in the list, such as in this following example.
The virtual list provides a callback that is called anytime the user scrolls to some rows that have not been fetched from the backend yet, and provides the start and stop indexes, so that, in an offset based pagination endpoint, I can fetch the required items without fetching any unnecessary data.
const loadMoreItems = (startIndex, stopIndex) => {
fetch(`/items?offset=${startIndex}&limit=${stopIndex - startIndex}`);
}
I'd like to replace my offset based pagination with a cursor based one, but I can't figure out how to reproduce the above logic with it.
The main issue is that I feel like I will need to download all the items before startIndex in order to receive the cursor needed to fetch the items between startIndex and stopIndex.
What's the correct way to approach this?
After some investigation I found what seems to be the way MongoDB approaches the problem:
https://docs.mongodb.com/manual/reference/method/cursor.skip/#mongodb-method-cursor.skip
Obviously he same approach can be adopted by any other backend implementation.
They provide a skip method that allows to skip an arbitrary amount of items after the provided cursor.
This means my sample endpoint would look like the following:
/items?cursor=${cursor}&skip=${skip}&limit=${stopIndex - startIndex}
I then need to figure out the cursor and the skip values.
The following code could work to find the closest available cursor, given I store them together with the items:
// Limit our search only to items before startIndex
const fragment = items.slice(0, startIndex);
// Find the closest cursor index
const cursorIndex = fragment.length - 1 - fragment.reverse().findIndex(item => item.cursor != null);
// Get the cursor
const cursor = items[cursorIndex];
And of course, I also have a way to know the skip value:
const skip = items.length - 1 - cursorIndex;
I have a single database (300MB & 42,924 documents) consisting of about 20 different kinds of documents from about 200 users. The documents range in size from a few bytes to many KiloBytes (150KB or so).
When the server is unloaded, the following replication filter function takes about 2.5 minutes to complete.
When the server is loaded, it takes >10 minutes.
Can anyone comment on whether these times are expected, and if not, suggest how I might optimize things in order to
get better performance?
function(doc, req) {
acceptedDate = true;
if(doc.date) {
var docDate = new Date();
var dateKey = doc.date;
docDate.setFullYear(dateKey[0], dateKey[1], dateKey[2]);
var reqYear = req.query.year;
var reqMonth = req.query.month;
var reqDay = req.query.day;
var reqDate = new Date();
reqDate.setFullYear(reqYear, reqMonth, reqDay);
acceptedDate = docDate.getTime() >= reqDate.getTime();
}
return doc.user_id && doc.user_id == req.query.userid && doc._id.indexOf("_design") != 0 && acceptedDate;
}
Filtered replications works slow because for each fetched document runs complex logic to decide whether to replicate it or not:
CouchDB fetches next document;
Because filter function has to be applied the document gets converted to JSON;
JSONifyed document passes through stdio to query server;
Query server handles document and decodes it from JSON;
Now, query server lookups and runs your filter function which returns true or false value to CouchDB;
If result is true document goes to be replicated;
Go to p.1 and loop for all documents;
For non-filtered replications take this list, throw away p.2-5 and let p.6 has always true result. This overhead slows down whole replication process.
To significantly improve filtered replication speed, you may use Erlang filters via Erlang native server. They runs inside CouchDB, doesn't pass through any stdio interface and there is no JSON decode/encode overhead applied.
NOTE, that Erlang query server runs not inside sandbox like JavaScript one, so you need to really trust code that you run with it.
Another option is to optimize your filter function e.g. reduce object creation, method calls, but actually you wouldn't win much with this.
What would be the best way to record a live count of connections using the Meteor framework? I have the requirement of live sharing users online and have resorted to creating a collection and just replacing a record on initialize for each user, but the count seems to reset, what I have so far below, thanks in advanced.
Counts = new Meteor.Collection "counts"
if Meteor.is_client
if Counts.findOne()
new_count = Counts.findOne().count + 1
Counts.remove {}
Counts.insert count: new_count
Template.visitors.count = ->
Counts.findOne().count
if Meteor.is_server
reset_data = ->
Counts.remove {}
Counts.insert count: 0
Meteor.startup ->
reset_data() if Counts.find().count() is 0
You have a race condition when you trust in "get count value, remove from collection, insert in collection the new count". Clients can get the value X in the same time. It's not the way to go.
Instead it, try to make each client insert "itself" in a collection. Put a unique id and the "time" it was inserted. Use Meteor.Method to implement a heartbeat, refreshing this "time".
Clients with too old time can be deleted from the collection. Use a timer in the server to remove idle clients.
You can check some of this here:
https://github.com/francisbyrne/hangwithme/blob/master/server/game.js
I run my bot in a public channel with hundreds of users. Yesterday a person came in and just abused it.
I would like to let anyone use the bot, but if they spam commands consecutively and if they aren't a bot "owner" like me when I debug then I would like to add them to an ignored list which expires in an hour or so.
One way I'm thinking would be to save all commands by all users, in a dictionary such as:
({
'meder#freenode': [{command:'.weather 20851', timestamp: 209323023 }],
'jack#efnet': [{command:'.seen john' }]
})
I would setup a cron job to flush this out every 24 hours, but I would basically determine if a person has made X number of commands in a duration of say, 15 seconds and add them to an ignore list.
Actually, as I'm writing this answer I thought of a better idea.. maybe instead of storing each users commands, just store the the bot's commands in a list and keep on pushing until it reaches a limit of say, 15.
lastCommands = [], limit = 5;
function handleCommand( timeObj, action ) {
if ( lastCommands.length < limit ) {
action();
} else {
// enumerate through lastCommands and compare the timestamps of all 5 commands
// if the user is the same for all 5 commands, and...
// if the timestamps are all within the vicinity of 20 seconds
// add the user to the ignoreList
}
}
watch_for('command', function() {
handleCommand({timestamp: 2093293032, user: user}, function(){ message.say('hello there!') })
});
I would appreciate any advice on the matter.
Here's a simple algorithm:
Every time a user sends a command to the bot, increment a number that's tied to that user. If this is a new user, create the number for them and set it to 1.
When a user's number is incremented to a certain value (say 15), set it to 100.
Every <period> seconds, run through the list and decrement all the numbers by 1. Zero means the user's number can be freed.
Before executing a command and after incrementing the user's counter, check to see if it exceeds your magic max value (15 above). If it does, exit before executing the command.
This lets you rate limit actions and forgive excesses after a while. Divide your desired ban length by the decrement period to find the number to set when a user exceeds your threshold (100 above). You can also add to the number if a particular user keeps sending commands after they've been banned.
Well Nathon has already offered a solution, but it's possible to reduce the code that's needed.
var user = {};
user.lastCommandTime = new Date().getTime(); // time the user send his last command
user.commandCount = 0; // command limit counter
user.maxCommandsPerSecond = 1; // commands allowed per second
function handleCommand(obj, action) {
var user = obj.user, now = new Date().getTime();
var timeDifference = now - user.lastCommandTime;
user.commandCount = Math.max(user.commandCount - (timeDifference / 1000 * user.maxCommandsPerSecond), 0) + 1;
user.lastCommandTime = now;
if (user.commandCount <= user.maxCommandsPerSecond) {
console.log('command!');
} else {
console.log('flooding');
}
}
var obj = {user: user};
var e = 0;
function foo() {
handleCommand(obj, 'foo');
e += 250;
setTimeout(foo, 400 + e);
}
foo();
In this implementation, there's no need for a list or some global callback every X seconds, instead we just reduce the commandCount every time there's a new message, based on time difference to the last command, it's also possible to allow different command rates for specific users.
All we need are 3 new properties on the user object :)
Redis
I would use the insanely fast advanced key-value store redis to write something like this, because:
It is insanely fast.
There is no need for cronjob because you can set expire on keys.
It has atomic operations to increment key
You could use redis-cli for prototyping.
I myself really like node_redis as redis client. It is a really fast redis client, which can easily be installed using npm.
Algorithme
I think my algorithme would look something like this:
For each user create a unique key which counts the commands consecutively executed. Also set expire to the time when you don't flag a user as spammer anymore. Let's assume the spammer has nickname x and the expire 15.
Inside redis-cli
incr x
expire x 15
When you do a get x after 15 seconds then the key does not exist anymore.
If value of key is bigger then threshold then flag user as spammer.
get x
These answers seem to be going the wrong way about this.
IRC Servers will disconnect your client regardless of whether you're "debugging" or not if the client or bot is flooding a channel or the server in general.
Make a blanket flood control, using the method #nmichaels has detailed, but on the bot's network connection to the server itself.