Find documents in MongoDB with non-typical limit

Find documents in MongoDB with non-typical limit - node.js

I have a problem, but don't have idea how to resolve it.
I've got PointValues collection in MongoDB.
PointValue schema has 3 parameters:
dataPoint (ref to DataPoint schema)
value (Number)
time (Date)
There is one pointValue for every hour (24 per day).
I have API method to get PointValues for specified DataPoint and time range. Problem is I need to limit it to max 1000 points. Typical limit(1000) method isn't good way, because I need point for whole, specified time range, with time step depends on specified time range and point values count.
So... for example:
Request data for 1 year = 1 * 365 * 24 = 8760
It should return 1000 values but approx 1 value per (24 / (1000 / 365)) = ~9 hours
I don't have idea what method i should use to filter that data in MongoDB.
Thanks for help.

Sampling exactly like that on the database would be quite hard to do and likely not very performant. But an option which gives you a similar result would be to use an aggregation pipeline which $group's the $first best value by $year, $dayOfYear, and $hour (and $minute and $second if you need smaller intervals). That way you can sample values by time steps, but your choices of step lengths are limited to what you have date-operators for. So "hourly" samples is easy, but "9-hourly" samples gets complicated. When this query is performance-critical and frequent, you might want to consider to create additional collections with daily, hourly, minutely etc. DataPoints so you don't need to perform that aggregation on every request.
But your documents are quite lightweight due to the actual payload being in a different collection. So you might consider to get all the results in the requested time range and then do the skipping on the application layer. You might want to consider combining this with the above described aggregation to pre-reduce the dataset. So you could first use an aggregation-pipeline to get hourly results into the application and then skip through the result set in steps of 9 documents. Whether or not this makes sense depends on how many documents you expect.
Also remember to create a sorted index on the time-field.

Related

How to aggregate data by period in a rrdtool graph

I have a rrd file with average ping times to a server (GAUGE) every minute and when the server is offline (which is very frequent for reasons that doesn't matter now) it stores a NaN/unknown.
I'd like to create a graph with the percentage the server is offline each hour which I think can be achieved by counting every NaN within 60 samples and then dividing by 60.
For now I get to the point where I define a variable that is 1 when the server is offline and 0 otherwise, but I already read the docs and don't know how to aggregate this:
DEF:avg=server.rrd:rtt:AVERAGE CDEF:offline=avg,UN,1,0,IF
Is it possible to do this when creating a graph? Or I will have to store that info in another rrd?

I don't think you can do exactly what you want, but you have a couple of options.
You can define a sliding window average, that shows the percentage of the previous hour that was unknown, and graph that, using TRENDNAN.
DEF:avg=server.rrd:rtt:AVERAGE:step=60
CDEF:offline=avg,UN,100,0,IF
CDEF:pcavail=offline,3600,TREND
LINE:pcavail#ff0000:Availability
This defines avg as the 1-min time series of ping data. Note we use step=60 to ensure we get the best resolution of data even in a smaller graph. Then we define offline as 100 when the server is there, 0 when not. Then, pcavail is a 1-hour sliding window average of this, which will in effect be the percentage of time during the previous hour during which the server was available.
However, there's a problem in that RRDTool will silently summarise the source data before you get your hands on it, if there are many data points to a pixel in the graph (this won't happen if doing a fetch of course). To get around that, you'd need to have the offline CDEF done at store time -- IE, have a COMPUTE type DS that is 100 or 0 depending on if the avg DS is known. Then, any averaging will preserve data (normal averaging omits the unknowns, or the xff setting makes the whole cdp unknown).
rrdtool create ...
DS:rtt:GAUGE:120:0:9999
DS:offline:COMPUTE:rtt,UN,100,0,IF
rrdtool graph ...
DEF:offline=server.rrd:offline:AVERAGE:step=3600
LINE:offline#ff0000:Availability
If you are able to modify your RRD, and do not need historical data, then use of a COMPUTE in this way will allow you to display your data in a 1-hour stepped graph as you wanted.

Prometheus recording rule to keep the max of (rate of counter)

Iam facing one dillema.
For performance reasons, I'm creating recording rules for my Nginx request/second metrics.
Original Query
sum(rate(nginx_http_request_total[5m]))
Recording Rule
rules:
- expr: sum(rate(nginx_http_requests_total[5m])) by (cache_status, host, env, status)
record: job:nginx_http_requests_total:rate:sum:5m
In original query I can see that my max traffic is 6.6k but in recording rule, its 6.2k. That's 400 TPS difference.
This is the metric for last one week
Question :
Is there any way to take the max of the original query and save it as recording rule. As it's TPS, I only care about the max, not the min.

I think having having 6% difference on value in some very short burst is pretty OK.
In your query your are getting (and recording) an average TPS during the last 5 minutes. There is no "max" being performed there.
The value will change depending on the exact time of the query evaluation - possibly why you see difference between raw query and values stored by recording rule.
Prometheus will extrapolate data some when executing functions like rate(). If you have last data point at time t, but running query at t+30s, then Prometheus will try to extrapolate value at t+30s (often this is noticed as a counter for discrete events will show fractional values)
You may want to use irate() function if you are really after peak values. It will use at each evaluation two most recent points to calculate most current increase as opposed to X minutes average that rate() provides.

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.
I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.
To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.
If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.
This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).
My current code collects every single document (n = 1):
db.collection(collection).find().sort({$natural:-1}).limit(limit)
I could then loop through the returned array and remove every document when:
index % n != 0
This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.
Does anyone know a method to accomplish this?

Apparenlty, there is no inbuilt solution in mongo to solve your problem.
The way forward would be to archive your data smartly, in fragments.
So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.
If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.
Alternate fragmentation(update) :
Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.

I think the $bucket stage might help you with it.
You can do something like,
db.collection.aggregate([
{
$bucketAuto: {
groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
}
}
])
Each document in the result for the above query would be something like this,
"_id": {
"max": 3,
"min": 1
},
"count": 2
If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample

You might have another problem. Docs state not to rely on natural ordering:
This ordering is an internal implementation feature, and you should
not rely on any particular structure within it.
You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

Batch processing/updating Monogdb documents in Nodejs

I would like to process/update every document in a Mongodb collection periodically (every 5 mins or so) and save the results back to the DB. The update function requires actual code to execute on each document (as far as I know) because it needs to perform computations such as taking the difference in timestamps and taking exponents with Math.pow, which the standard MongoDB update operators do not cover.
What is the best way to do this in NodeJS?
Full context: I am trying to implement the Hacker News ranking algorithm, which is time-dependent. The discussion I've seen around this involves using a separate thread/process to periodically update the scores on documents.

without wasting back and forth investigation it seems you have fields that i will call points, time of initial creation created_date and, then the ycombinator result of (p - 1) / (t + 2)^1.5
the easiest is to write a very simple 3 liner mongo shell script.
db.ycombinator.find().forEach(function(doc) {
var diff = ISODate() - doc.created_date; // subtract date using some form of date ISODate is available in mongo shell
var hours = diff.tomagicalhours(); // some regulr javascript
var result = (doc.points - 1) / Math.pow((hours + 2), 1.5); // perform yc algo
db.ycombinator.update({"_id":doc._id}, {$set:{"result": result} }); // write back into same collection and field, result
})
that goes into a file ycombinator_update.js and then do a 5 minute crontab.
*/5 * * * * mongo ycombinator_update.js
the performance of your reads will be noticeably slower during the writes operation contingent on the number of records in that collection.

you could assign scores based on the document timestamp at lookup time, and only keep the raw timestamps in the database. Since the score is a function of the timestamp anyway, the scoring algorithm can incorporate the exponential decay logic on the unmodified data. Scores can be converted to timestamps if to search by score.

Another option that isn't represented here is the MongoDB MapReduce or Aggregation frameworks.
Both these frameworks provide a way to iterate over all elements in a collection and output some results into a different collection. The aggregation API does not directly include the primitives we need to calculate the 1.5 exponent in the HN algorithm (no $sqrt or $pow), but there is a workaround.
I'm not certain at this point which approach is the most performant for this use case (and how it compares to the MongoDB shell script suggested by Gabe Rainbow).
I believe the next step is to run the update operations in a separate process, which is either scheduled with something like cron, or it could be kicked off via the node app itself using fork with the following logic:
On request for front page:
# when did we last update the scores for the front page?
if last_update was within last X minutes:
return list sorted by score right away
else
fork a process to sort the front page
last_update := Date.Now
return list sorted by score (either right away [stale], or after the update completes [takes a while])

Mongoose limiting query to 1000 results when I want more/all (migrating from 2.6.5 to 3.1.2)

I'm migrating my app from Mongoose 2.6.5 to 3.1.2, and I'm running into some unexpected behavior. Namely I notice that query results are automatically being limited to 1000 records, while pretty much everything else works the same. In my code (below) I set a value maxIvDataPoints that limits the number of data points returned (and ultimately sent to the client browser), and that value was set elsewhere to 1500. I use a count query to determine the total number of potential results, and then a subsequent mod to limit the actual query results using the count and the value of maxIvDataPoints to determine the value of the mod. I'm running node 0.8.4 and mongo 2.0.4, writing server-side code in coffeescript.
Prior to installing mongoose 3.1.x the code was working as I had wanted, returning just under 1500 data points each time. After installing 3.1.2 I'm getting exactly 1000 data points returned each time (assuming there are more than 1000 data points in the specified range). The results are truncated, so that data points 1001 to ~1500 are the ones no longer being returned.
It seems there may be some setting somewhere that governs this behavior, but I can't find anything in the docs, on here, or in the Google group. I'm still a relative n00b so I may have missed something obvious.
DataManager::ivDataQueryStream = (testId, minTime, maxTime, callback) ->
# If minTime and maxTime have been provided, set a flag to limit time extents of query
unless isNaN(minTime)
timeLimits = true
# Load the max number of IV data points to be displayed from CONFIG
maxIvDataPoints = CONFIG.maxIvDataPoints
# Construct a count query to determine the number if IV data points in range
ivCountQuery = TestDataPoint.count({})
ivCountQuery.where "testId", testId
if timeLimits
ivCountQuery.gt "testTime", minTime
ivCountQuery.lt "testTime", maxTime
ivCountQuery.exec (err, count) ->
ivDisplayQuery = TestDataPoint.find({})
ivDisplayQuery.where "testId", testId
if timeLimits
ivDisplayQuery.gt "testTime", minTime
ivDisplayQuery.lt "testTime", maxTime
# If the data set is too large, use modulo to sample, keeping the total data series
# for display below maxIvDataPoints
if count > maxIvDataPoints
dataMod = Math.ceil count/maxIvDataPoints
ivDisplayQuery.mod "dataPoint", dataMod, 1
ivDisplayQuery.sort "dataPoint" #, 1 <-- new sort syntax for Mongoose 3.x
callback ivDisplayQuery.stream()

You're getting tripped up by a pair of related factors:
Mongoose's default query batchSize changed to 1000 in 3.1.2.
MongoDB has a known issue where a query that requires an in-memory sort puts a hard limit of the query's batch size on the number of documents returned.
So your options are to put a combo index on TestDataPoint that would allow mongo to use it for sorting by dataPoint in this type of query or increase the batch size to at least the total count of documents you're expecting.

Wow that's awful. I'll publish a fix to mongoose soon removing the batchSize default (was helpful when streaming large result sets). Thanks for the pointer.
UPDATE: 3.2.1 and 2.9.1 have been released with the fix (removed batchSize default).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string