Elastsearch query causing NodeJS heap out of memory - node.js

What's happen now?
Recenly I build a Elasticsearch query. The main function is to get data count per hours until 12 weeks ago.
When the query get call over and over again. NodeJS memory will start from 20mb growing to 1024mb. And surprisingly the memory aren’t immediately get to the top. Its more like stably under 25mb ( maintain about several minutes ) and suddenly start to growing like (25mb,46mb,125mb,350mb...until 1024mb) and finally causing NodeJS memory leak. Whatever I call this query or not, The memory will still growing up and won’t release at all. And this scenario only happen at remote server (running in docker), At local docker env is totally fine (the memory is identical).
enter image description here
How am I query?
like below.
const query = {
"size": 0,
"query": {
"bool": {
"must": [
{ terms: { '_id.keyword': array_id } },
{
"range": {
"date_created": {
"gte": start_timestamp - timestamp_twelve_weeks,
"lt": start_timestamp
}
}
}
]
}
},
"aggs": {
"shortcode_log": {
"date_histogram": {
"field": "date_created",
"interval": "3600ms"
}
}
}
}
What's the return value?
like below ( total query time is around 2 sec ) .
{
"aggs_res": {
"shortcode_log": {
"buckets": [
{
"key": 1594710000,
"doc_count": 2268
},
{
"key": 1594713600,
"doc_count": 3602
},
{//.....total item count 2016
]
}
}
}

If your histogram interval is really of 3600ms(it should not be 3600s ?), it's a really short period of time to do the aggregation on 12 weeks.
It means 0.06 minutes.
24000 periods per day
168000 per week
2016000 for 12 weeks.
It can explain
Why your script wait for a long before doing anything
Why your memory explode when you try to loop on the buckets
In your example, you have 2016 buckets back only.
I think that their is a small difference between your 2 tests.

New update. The issue is solved. The problem in project has a layer between server and DB. So the code of this layer causing the query memory can't release.

Related

How to get the total Gitlab CI/CD minutes used from the API?

I'm looking to get the total number of Gitlab CI/CD minutes used for a group using the API. It would be useful to also get the group's quota/minutes left.
I saw this documentation on how to get it from the website, but it doesn't specify how to get it from the API.
I also saw the "Can I proactively monitor my CI/CD Minutes usage?" section on this page, but the projects it links to seem to all of the pipelines and then aggregating their durations. I'd prefer it if I could make a single API call to get the total minutes used.
I've managed to figure out now by looking at the API calls from the website on the https://gitlab.com/groups/<my group>/-/usage_quotas page.
It uses a query like this with the Gitlab GraphQL API:
{
ciMinutesUsage(namespaceId: "gid://gitlab/Group/<group ID>") {
nodes {
month
monthIso8601
minutes
projects {
nodes {
minutes
project {
id
name
}
}
}
}
}
}
The response looks like this:
{
"data": {
"ciMinutesUsage": {
"nodes": [
{
"month": "January",
"monthIso8601": "2023-01-01",
"minutes": 3,
"projects": {
"nodes": [
{
"minutes": 3,
"project": {
"id": "gid://gitlab/Project/<project ID>",
"name": "Description"
}
}
]
}
}
]
}
}
}
If you only want to get the usage for the last month, you can change your query to ciMinutesUsage(namespaceId: \"gid://gitlab/Group/<group ID>\", first: 1).

How to get date from object stored in mongodb?

I'm making an application with chats and posts, and in the chat, I want to show the time of the most recent post.
The date part of my mongoose schema for posts is below, as well as a picture of what my Mongodb date field looks like.
I'm trying to get the date to the front end, but am unsure about how to format it. I want to format it as yesterday at 12:45pm or 6 days ago, etc.
Any help is very much appreciated.
Mongoose Schema field for date:
date: {
type: Date,
default: Date.now
}
Mongodb data field
Mongo (and almost? no other database in the world) stores constants like "yesterday" or "last week".
The problem with these concepts like "yesterday" is that it's very semantic. if it's 00:01 is yesterday 2min ago? if the answer is yes you will actually have to update your database every minute if you're willing to compromise to look at time difference you will still have to do it every day.
I'm not sure what your actual business needs that make you want to do this. but I recommend you do it whilst fetching documents. otherwise this is not scaleable.
Here is a quick example on how to do this:
db.collection.aggregate([
{
"$addFields": {
currDay: {
"$dayOfMonth": "$$NOW"
},
dateDay: {
"$dayOfMonth": "$date"
},
dayGap: {
"$divide": [
{
"$subtract": [
"$$NOW",
"$date"
]
},
86400000/**miliseconds in a day*/
]
}
}
},
{
$addFields: {
date: {
"$switch": {
"branches": [
{
"case": {
$and: [
{
$lt: [
"$dayGap",
1
]
},
{
$eq: [
"$dateDay",
"$currDay"
]
}
]
},
"then": "today"
},
{
"case": {
$lt: [
"$dayGap",
2
]
},
"then": "yesterday"
},
{
"case": {
$lt: [
"$dayGap",
1
]
},
"then": "today"
}
],
default: {
"$concat": [
{
"$toString": {
"$round": "$dayGap"
}
},
" days ago"
]
}
}
}
}
}
],
{
allowDiskUse: true
})
MongoPlayground
As you can see you'll have to manually construct the "phrase" you want for every single option. You can obviously do the same in code I just choose to show the "Mongoi" way as I feel is more complicated.
If you do end up choosing updating your database ahead of time you can use the same pipeline combined with $out to achieve this.
One final note is that I cheated a little as this aggregation looks at the miliseconds difference only (apart from today field). meaning if it's 1AM then 50 hours ago. even though the date is "three" days ago will still show up as two days ago.
I hope this example shows you why this formatting is not used anywhere and the difficulties it brings. Mind you I haven't even brought up timezones concepts like "yesterday" are even more semantic for different regions.
In my option the only viable "real" solution is to build a custom function that does this in code. mind you this is not so much fun as you have to account for events like gap years, timezones, geographical zone and more, however it is doable.

Compare with time part only in mongodb query [duplicate]

I have a MongoDB datastore set up with location data stored like this:
{
"_id" : ObjectId("51d3e161ce87bb000792dc8d"),
"datetime_recorded" : ISODate("2013-07-03T05:35:13Z"),
"loc" : {
"coordinates" : [
0.297716,
18.050614
],
"type" : "Point"
},
"vid" : "11111-22222-33333-44444"
}
I'd like to be able to perform a query similar to the date range example but instead on a time range. i.e. Retrieve all points recorded between 12AM and 4PM (can be done with 1200 and 1600 24 hour time as well).
e.g.
With points:
"datetime_recorded" : ISODate("2013-05-01T12:35:13Z"),
"datetime_recorded" : ISODate("2013-06-20T05:35:13Z"),
"datetime_recorded" : ISODate("2013-01-17T07:35:13Z"),
"datetime_recorded" : ISODate("2013-04-03T15:35:13Z"),
a query
db.points.find({'datetime_recorded': {
$gte: Date(1200 hours),
$lt: Date(1600 hours)}
});
would yield only the first and last point.
Is this possible? Or would I have to do it for every day?
Well, the best way to solve this is to store the minutes separately as well. But you can get around this with the aggregation framework, although that is not going to be very fast:
db.so.aggregate( [
{ $project: {
loc: 1,
vid: 1,
datetime_recorded: 1,
minutes: { $add: [
{ $multiply: [ { $hour: '$datetime_recorded' }, 60 ] },
{ $minute: '$datetime_recorded' }
] }
} },
{ $match: { 'minutes' : { $gte : 12 * 60, $lt : 16 * 60 } } }
] );
In the first step $project, we calculate the minutes from hour * 60 + min which we then match against in the second step: $match.
Adding an answer since I disagree with the other answers in that even though there are great things you can do with the aggregation framework, this really is not an optimal way to perform this type of query.
If your identified application usage pattern is that you rely on querying for "hours" or other times of the day without wanting to look at the "date" part, then you are far better off storing that as a numeric value in the document. Something like "milliseconds from start of day" would be granular enough for as many purposes as a BSON Date, but of course gives better performance without the need to compute for every document.
Set Up
This does require some set-up in that you need to add the new fields to your existing documents and make sure you add these on all new documents within your code. A simple conversion process might be:
MongoDB 4.2 and upwards
This can actually be done in a single request due to aggregation operations being allowed in "update" statements now.
db.collection.updateMany(
{},
[{ "$set": {
"timeOfDay": {
"$mod": [
{ "$toLong": "$datetime_recorded" },
1000 * 60 * 60 * 24
]
}
}}]
)
Older MongoDB
var batch = [];
db.collection.find({ "timeOfDay": { "$exists": false } }).forEach(doc => {
batch.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": {
"$set": {
"timeOfDay": doc.datetime_recorded.valueOf() % (60 * 60 * 24 * 1000)
}
}
}
});
// write once only per reasonable batch size
if ( batch.length >= 1000 ) {
db.collection.bulkWrite(batch);
batch = [];
}
})
if ( batch.length > 0 ) {
db.collection.bulkWrite(batch);
batch = [];
}
If you can afford to write to a new collection, then looping and rewriting would not be required:
db.collection.aggregate([
{ "$addFields": {
"timeOfDay": {
"$mod": [
{ "$subtract": [ "$datetime_recorded", Date(0) ] },
1000 * 60 * 60 * 24
]
}
}},
{ "$out": "newcollection" }
])
Or with MongoDB 4.0 and upwards:
db.collection.aggregate([
{ "$addFields": {
"timeOfDay": {
"$mod": [
{ "$toLong": "$datetime_recorded" },
1000 * 60 * 60 * 24
]
}
}},
{ "$out": "newcollection" }
])
All using the same basic conversion of:
1000 milliseconds in a second
60 seconds in a minute
60 minutes in an hour
24 hours a day
The modulo from the numeric milliseconds since epoch which is actually the value internally stored as a BSON date is the simple thing to extract as the current milliseconds in the day.
Query
Querying is then really simple, and as per the question example:
db.collection.find({
"timeOfDay": {
"$gte": 12 * 60 * 60 * 1000, "$lt": 16 * 60 * 60 * 1000
}
})
Of course using the same time scale conversion from hours into milliseconds to match the stored format. But just like before you can make this whatever scale you actually need.
Most importantly, as real document properties which don't rely on computation at run-time, you can place an index on this:
db.collection.createIndex({ "timeOfDay": 1 })
So not only is this negating run-time overhead for calculating, but also with an index you can avoid collection scans as outlined on the linked page on indexing for MongoDB.
For optimal performance you never want to calculate such things as in any real world scale it simply takes an order of magnitude longer to process all documents in the collection just to work out which ones you want than to simply reference an index and only fetch those documents.
The aggregation framework may just be able to help you rewrite the documents here, but it really should not be used as a production system method of returning such data. Store the times separately.

Time Series Insights - 'uniqueValues' aggregate not working as expected: does not return any data

I'm trying to execute some aggregate queries against data in TSI. For example:
{
"searchSpan": {
"from": "2018-08-25T00:00:00Z",
"to": "2019-01-01T00:00:00Z"
},
"top": {
"sort": [
{
"input": {
"builtInProperty": "$ts"
}
}
]
},
"aggregates": [
{
"dimension": {
"uniqueValues": {
"input": {
"builtInProperty": "$esn"
},
"take": 100
}
},
"measures": [
{
"count": {}
}
]
}
]
}
The above query, however, does not return any record, although there are many events stored in TSI for that specific searchSpan. Here is the response:
{
"warnings": [],
"events": []
}
The query is based on the examples in the documentation which can be found here and which is actually lacking crucial information for requirements and even some examples do not work...
Any help would be appreciated. Thanks!
#Vladislav,
I'm sorry to hear you're having issues. In reviewing your API call, I see two fixes that should help remedy this issue:
1) It looks like you're using our /events API with payload for /aggregates API. Notice the "events" in the response. Additionally, “top” will be redundant for /aggregates API as we don't support top-level limit clause for our /aggregates API.
2) We do not enforce "count" property to be present in limit clause (“take”, “top” or “sample”) and it looks like you did not specify it, so by default, the value was set to 0, that’s why the call is returning 0 events.
I would recommend that you use /aggregates API rather than /events, and that “count” is specified in the limit clause to ensure you get some data back.
Additionally, I'll note your feedback on documentation. We are ramping up a new hire on documentation now, so we hope to improve the quality soon.
I hope this helps!
Andrew

Aggregations Size makes different results

I have simples aggregation like
"aggs": {
"firm_aggregation": {
"terms": {
"field": "experience.company_name.slug",
"size": 10
}
}
}
and this gives me result like
"aggregations": {
"firm_aggregation": {
"buckets": [
... (some others)
{
"key": "freelancer",
"doc_count": 33
},
but when I increase aggregation size to 2000 i get
"aggregations": {
"firm_aggregation": {
"buckets": [
... (some others)
{
"key": "freelancer",
"doc_count": 35
},
why is this happening ?? I thouht that size will increase number of aggregations which elastic return.
This is owing to the estimation done on shard level.
For results of size 5 , only top 5 terms are taken from each shard and this is added to get the result. This need not be very accurate.
There is a very good explanation about this here.
Along with size , you can pass shard_size parameter which can control this behavior without affecting the data that is returned

Resources