index new document and get the indexed document in the same query - node.js

it is possible to index a new document and return him after he succeeded indexed?
I tried to take the _id that returns but I'm using 2 queries and the index action takes some time and the second query not find the _id so it not always doing it perfectly.
this is the query that index the document:
const query = await elsaticClient.index({
routing: "dasdsad34_d",
index: "milan",
body: {
text: "san siro",
user: {
user_id: "3",
username: "maldini",
},
tags: ["Forza Milan","grande milan"],
publish_date: new Date(),
likes: [],
users_tags: [1,5],
type: {
name: "comment",
parent: "dasdsad34_d",
},
},
});

No, its not possible with default behavior. By default, Elasticsearch has only a near real time support. Its default refresh interval is 1 second as index refresh is deemed as a costly operation.
In order to overcome this, in your indexing operation, you can add refresh=true. You can get further details from below links.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
Please note that this is NOT a recommended option as this comes with huge overhead. Only use this, if your inserts into this index in question are having a very very low number.
Recommended way is to use refresh=wait_for on your indexing operation. But this has a downside of waiting for a second for the natural refresh to complete. So if you have default refresh interval set to 1 and are okay with this as an acceptable trade off, then this is the way to go.
However, if you have a higher refresh interval set, then the wait time for the indexing operation will be as high the refresh interval. So choose your option carefully.

Related

How can I optimize this query in mongo db?

Here is the query:
const tags = await mongo
.collection("positive")
.aggregate<{ word: string; count: number }>([
{
$lookup: {
from: "search_history",
localField: "search_id",
foreignField: "search_id",
as: "history",
pipeline: [
{
$match: {
created_at: { $gt: prevSunday.toISOString() },
},
},
{
$group: {
_id: "$url",
},
},
],
},
},
{
$match: {
history: { $ne: [] },
},
},
{
$group: {
_id: "$word",
url: {
$addToSet: "$history._id",
},
},
},
{
$project: {
_id: 0,
word: "$_id",
count: {
$size: {
$reduce: {
input: "$url",
initialValue: [],
in: {
$concatArrays: ["$$value", "$$this"],
},
},
},
},
},
},
{
$sort: {
count: -1,
},
},
{
$limit: 50,
},
])
.toArray();
I think I need an index but not sure how or where to add.
Perhaps performance of this operation should be revisited after we confirm that it is satisfying the desired application logic that the approach itself is reasonable.
When it comes to performance, there is nothing that can be done to improve efficiency on the positive collection if the intention is to process every document. By definition, processing all documents requires a full collection scan.
To efficiently support the $lookup on the search_history collection, you may wish to confirm that an index on { search_id: 1, created_at: 1, url: 1 } exists. Providing the .explain("allPlansExecution") output would allow us to better understand the current performance characteristics.
Desired Logic
Updating the question to include details about the schemas and the purpose of the aggregation would be very helpful with respect to understanding the overall situation. Just looking at the aggregation, it appears to be doing the following:
For every single document in the positive collection, add a new field called history.
This new field is a list of url values from the search_history collection where the corresponding document has a matching search_id value and was created_at after last Sunday.
The aggregation then filters to only keep documents where the new history field has at least one entry.
The next stage then groups the results together by word. The $addToSet operator is used here, but it may be generating an array of arrays rather than de-duplicated urls.
The final 3 stages of the aggregation seem to be focused on calculating the number of urls and returning the top 50 results by word sorted on that size in descending order.
Is this what you want? In particular the following aspects may be worth confirming:
Is it your intention to process every document in the positive collection? This may be the case, but it's impossible to tell without any schema/use-case context.
Is the size calculation of the urls correct? It seems like you may need to use a $map when doing the $addToSet for the $group instead of using $reduce for the subsequent $project.
The best thing to do is to limit the number of documents passed to each stage.
Indexes are used by mongo in aggregations only in the first stage only if it's a match, using 1 index max.
So the best thing to do is to have a match on an indexed field that is very restrictive.
Moreover, please note that $limit, $skip and $sample are not panaceas because they still scan the entire collection.
A way to efficiently limit the number of documents selected on the first stage is to use a "pagination". You can make it work like this :
Once every X requests
Count the number of docs in the collection
Divide this in chunks of Yk max
Find the _ids of the docs at the place Y, 2Y, 3Y etc with skip and limit
Cache the results in redis/memcache (or as global variable if you really cannot do otherwise)
Every request
Get the current chunk to scan by reading the redis keys used and nbChunks
Get the _ids cached in redis used to delimit the next aggregation id:${used%nbChunks} and id:${(used%nbChunks)+1} respectively
Aggregate using $match with _id:{$gte: ObjectId(id0), $lt: ObjectId(id1)}) }
Increment used, if used > X then update chunks
Further optimisation
If using redis, supplement every key with ${cluster.worker.id}:to avoid hot keys.
Notes
The step 3) of the setup of chunks can be a really long and intensive process, so do it only when necessary, let's say every X~1k requests.
If you are scanning the last chunk, do not put the $lt
Once this process implemented, your job is to find the sweet spot of X and Y that suits your needs, constrained by a Y being large enough to retrieve max documents while being not too long and a X that keeps the chunks roughly equals as the collection has more and more documents.
This process is a bit long to implement but once it is, time complexity is ~O(Y) and not ~O(N). Indeed, the $match being the first stage and _id being a field that is indexed, this first stage is really fast and limits to max Y documents scanned.
Hope it help =) Make sure to ask more if needed =)

How to limit users daily post limit (MERN)

I'm currently using passport for authentication and mongodb to store user information.
However I'm stuck trying to limit user's daily post limit. I was thinking of having a field like daily post limit in User Schema and whenever user post something I deduct the count.
const user = new mongoose.Schema({
githubId: {
required: true,
type: String,
},
username: {
required: true,
type: String,
},
dailyPostLimit: {
type: Number,
default: 3,
},
});
However I'm not sure if there's a way to reset that count to default(3) everyday. Is CRON task suitable here or is there a simpler way to accomplish this?
A cron task works well for resetting a value like this one, and caching a value like this one is a reasonable approach to solving this problem. But, keep in mind that you're caching this value, and cache invalidation is a hard problem that can often lead to bugs & additional complexity.
counting posts
Rather than caching, my first instinct would be to count the number of posts each time. Here's some pseudo code:
const count = await posts.count({userId, createdAt: {$gte: startOfDay}});
// alternative: const count = await posts.count({userId, _id: {$gte: startofDayConvertedToMongoId});
if (count > 3) throw new Error("no more posts for you");
await posts.create(newPost)
(note: if you're worried about race conditions, any solution you choose will need to check the count in a transaction)
If you have an index that starts with {userId: 1, createdAt: 1}, or if you use the _id instead {userId: 1, _id: 1} (assuming that you're not allowing client _id creation), these queries will be quite cheap, and it'll be hard for them to get out of sync.
separate cache collection
Even if you do decide to cache these creation values, I'd recommend caching them away from your user's collection to keep your collections more focused. You could create a post_count collection and then update only the cache collection for these counts: post_count.updateOne({userId, day}, {$incr: {count: 1}, $setOnInsert: {day, userId, count: 0}}, {upsert: true});. One nice benefit of this approach is you can use a ttl index on day to have mongo automatically remove the old documents in this collection after they've expired.
Since you are using MongoDB I would suggest,
Use agenda and create a job that runs at UTC 00:00 (If you have diverse users from different time zone) or time zone specific to your user's country.
In this job call updateMany function on your user model to reset dailyPostLimit field.

Batch updates reporting contention error using nodejs

I am trying to update in a collection and there are 1400+ offices are there and after checking and running the query I am updating in a collection document and update in the subcollection with few details after querying but sometimes i am getting this error
10 ABORTED: Too much contention on these documents. Please try again.
and i am simply using batch for writing in the doc here is my code for updation in the collection.
batch.set(
rootCollections.x.doc(getISO8601Date())
.collection(subCollection.y)
.doc(change.after.get('Id')),
{
officeId: change.after.get('Id'),
office: change.after.get('office'),
status: change.after.get('status'),
creationTimestamp:
change.after.get('creationTimestamp') ||
change.after.get('createTimestamp') ||
change.after.get('timestamp'),
activeUsers: [...new Set(phoneNumbers)].length,
confirmedUers: activityCheckinSubscription.docs.length,
uniqueActivities: [...new Set(activities)].length,
payments: 0,
contact: [
`${change.after.data().creator.phoneNumber},${
change.after.data().attachment['First Contact'].value
}`,
],
},
{ merge: true },
);
batch.set(
rootCollections.x.doc(getISO8601Date()),
{
Added: admin.firestore.FieldValue.increment(1),
},
{ merge: true },
);
PromiseArray.push(batch.commit());
await Promise.all(PromiseArray);
It seems that you are facing the same issue from this similar case here, where there were thousands of records in the database being updated. As clarified there, you have limitation of how much writes you can perform in a document in one second - more details here - and even though Firestore sometimes might hang on with the faster writes, it will fail at some point.
As this is hard coded and a limit that it's imposed by Firestore, what you can try is the solution explained in this similar case here, that it's either change to Realtime Database, where the limit is not the number of writes, but the size of the data or in case of the usage of a counter or some other data aggregation in Firestore, to use a distributed counter solution, that you can get more details here.
To summarize, there isn't much you can do unless of workaround it with this solution, as this is a limitation documented of Firestore.

Upsert and $inc Sub-document in Array

The following schema is intended to record total views and views for a very specific day only.
const usersSchema = new Schema({
totalProductsViews: {type: Number, default: 0},
productsViewsStatistics: [{
day: {type: String, default: new Date().toISOString().slice(0, 10), unique: true},
count: {type: Number, default: 0}
}],
});
So today views will be stored in another subdocument different from yesterday. To implement this I tried to use upsert so as subdocument will be created each day when product is viewed and counts will be incremented and recorded based on a particular day. I tried to use the following function but seems not to work the way I intended.
usersSchema.statics.increaseProductsViews = async function (id) {
//Based on day only.
const todayDate = new Date().toISOString().slice(0, 10);
const result = await this.findByIdAndUpdate(id, {
$inc: {
totalProductsViews: 1,
'productsViewsStatistics.$[sub].count': 1
},
},
{
upsert: true,
arrayFilters: [{'sub.day': todayDate}],
new: true
});
console.log(result);
return result;
};
What do I miss to get the functionality I want? Any help will be appreciated.
What you are trying to do here actually requires you to understand some concepts you may not have grasped yet. The two primary ones being:
You cannot use any positional update as part of an upsert since it requires data to be present
Adding items into arrays mixed with "upsert" is generally a problem that you cannot do in a single statement.
It's a little unclear if "upsert" is your actual intention anyway or if you just presumed that was what you had to add in order to get your statement to work. It does complicate things if that is your intent, even if it's unlikely give the finByIdAndUpdate() usage which would imply you were actually expecting the "document" to be always present.
At any rate, it's clear you actually expect to "Update the array element when found, OR insert a new array element where not found". This is actually a two write process, and three when you consider the "upsert" case as well.
For this, you actually need to invoke the statements via bulkWrite():
usersSchema.statics.increaseProductsViews = async function (_id) {
//Based on day only.
const todayDate = new Date().toISOString().slice(0, 10);
await this.bulkWrite([
// Try to match an existing element and update it ( do NOT upsert )
{
"updateOne": {
"filter": { _id, "productViewStatistics.day": todayDate },
"update": {
"$inc": {
"totalProductsViews": 1,
"productViewStatistics.$.count": 1
}
}
}
},
// Try to $push where the element is not there but document is - ( do NOT upsert )
{
"updateOne": {
"filter": { _id, "productViewStatistics.day": { "$ne": todayDate } },
"update": {
"$inc": { "totalProductViews": 1 },
"$push": { "productViewStatistics": { "day": todayDate, "count": 1 } }
}
}
},
// Finally attempt upsert where the "document" was not there at all,
// only if you actually mean it - so optional
{
"updateOne": {
"filter": { _id },
"update": {
"$setOnInsert": {
"totalProductViews": 1,
"productViewStatistics": [{ "day": todayDate, "count": 1 }]
}
}
}
])
// return the modified document if you really must
return this.findById(_id); // Not atomic, but the lesser of all evils
}
So there's a real good reason here why the positional filtered [<identifier>] operator does not apply here. The main good reason is the intended purpose is to update multiple matching array elements, and you only ever want to update one. This actually has a specific operator in the positional $ operator which does exactly that. It's condition however must be included within the query predicate ( "filter" property in UpdateOne statements ) just as demonstrated in the first two statements of the bulkWrite() above.
So the main problems with using positional filtered [<identifier>] are that just as the first two statements show, you cannot actually alternate between the $inc or $push as would depend on if the document actually contained an array entry for the day. All that will happen is at best no update will be applied when the current day is not matched by the expression in arrayFilters.
The at worst case is an actual "upsert" will throw an error due to MongoDB not being able to decipher the "path name" from the statement, and of course you simply cannot $inc something that does not exist as a "new" array element. That needs a $push.
That leaves you with the mechanic that you also cannot do both the $inc and $push within a single statement. MongoDB will error that you are attempting to "modify the same path" as an illegal operation. Much the same applies to $setOnInsert since whilst that operator only applies to "upsert" operations, it does not preclude the other operations from happening.
Thus the logical steps fall back to what the comments in the code also describe:
Attempt to match where the document contains an existing array element, then update that element. Using $inc in this case
Attempt to match where the document exists but the array element is not present and then $push a new element for the given day with the default count, updating other elements appropriately
IF you actually did intend to upsert documents ( not array elements, because that's the above steps ) then finally actually attempt an upsert creating new properties including a new array.
Finally there is the issue of the bulkWrite(). Whilst this is a single request to the server with a single response, it still is effectively three ( or two if that's all you need ) operations. There is no way around that and it is better than issuing chained separate requests using findByIdAndUpdate() or even updateOne().
Of course the main operational difference from the perspective of code you attempted to implement is that method does not return the modified document. There is no way to get a "document response" from any "Bulk" operation at all.
As such the actual "bulk" process will only ever modify a document with one of the three statements submitted based on the presented logic and most importantly the order of those statements, which is important. But if you actually wanted to "return the document" after modification then the only way to do that is with a separate request to fetch the document.
The only caveat here is that there is the small possibility that other modifications could have occurred to the document other than the "array upsert" since the read and update are separated. There really is no way around that, without possibly "chaining" three separate requests to the server and then deciding which "response document" actually applied the update you wanted to achieve.
So with that context it's generally considered the lesser of evils to do the read separately. It's not ideal, but it's the best option available from a bad bunch.
As a final note, I would strongly suggest actually storing the the day property as a BSON Date instead of as a string. It actually takes less bytes to store and is far more useful in that form. As such the following constructor is probably the clearest and least hacky:
const todayDate = new Date(new Date().setUTCHours(0,0,0,0))

create mongodb document with subdocuments atomically?

I hope I'm having a big brainfart moment. But here's my situation in a scraping szenario;
I'm wanting to be able to scrape over multiple machines and cores. Per site, I have different Front pages, I scrape (exmpl. for the site stackoverflow I'd have fronts stackoverflow.com/questions/tagged/javascript and stackoverflow.com/questions/tagged/nodejs).
An article could be on every Front and when I discover an article I want to create an Article if the url is unknown, if its known I want to make an Front entry in article.discover if Front is unknown and otherwise insert my FrontDiscovery to the apropriate Front.
Here are my Schemas;
FrontDiscovery = new Schema({
_id :{ type:ObjectId, auto:true },
date :{ type: Date, default:Date.now},
dims :{ type: Object, default:null},
pos :{ type: Object, default:null}
});
Front = new Schema({
_id :{ type:ObjectId, auto:true },
url :{type:String}, //front
found :[ FrontDiscovery ]
});
Article = new Schema({
_id :{ type:ObjectId, auto:true },
url :{ type: String , index: { unique: true } },
site :{ type: String },
discover:[ Front]
});
The Problem I am thinking I will eventually be running into is a race condition. When two job-runners (in parallel) find the same (before unknown) article and create a new one. Yes, I have a unique index on it and could handle it that way - quite inelegantly imho.
But lets go further; When - for what ever reason - my 2 job-runners are scraping the same front at the same time and both notice that for Front there is no entry yet and create a new one adding the FrontDiscovery, I'd end with two entry's for the same Front.
What are your strategies to circumvent such a situation? findByIdAndUpdate with the upsert:true for each document seperately? If so, how can I only push something to the embedded document collection and not overwrite everything else at the same time but still create the defaults if it hasnt been created?
Thank you for any help in directing me in the right direction! I really hope I'm having a massive brainfart..
Update with upsert=true can be used to perform an atomic "insert or update" (http://docs.mongodb.org/manual/core/update/#update-operations-with-the-upsert-flag).
For instance if we want to make sure a document in Front collection with specific url is inserted exactly once, we could run something like:
db.Front.update(
{url: 'http://example.com'},
{$set: {
url: 'http://example.com'},
found: true
}
)
Operations on a single document in MongoDB are always atomic. If you make updates that span over multiple documents, then no atomicity is guaranteed. In such cases, you can ask yourself: do I really need the operations to be atomic? If the answer is no, then you probably will find your way around working with potentially unconsistent data. If the answer is yes and you want to stick with MongoDB, check out the design pattern on Two Phase Commits.

Resources