How to improve Update query in arangodb - arangodb

I have a collection which holds more than 15 million documents. Out of those 15 million documents I update 20k records every hour. But update query takes a long time to finish (30 min around).
Document:
{ "inst" : "instance1", "dt": "2015-12-12T00:00:000Z", "count": 10}
I have an array which holds 20k instances to be updated.
My Query looks like this:
For h in hourly filter h.dt == DATE_ISO8601(14501160000000)
For i in instArr
filter i.inst == h.inst
update h with {"inst":i.inst, "dt":i.dt, "count":i.count} in hourly
Is there any optimized way of doing this. I have hash indexing on inst and skiplist indexing on dt.
Update
I could not use 20k inst in the query manually so following is the execution plan for just 2 inst:
FOR r in hourly FILTER r.dt == DATE_ISO8601(1450116000000) FOR i IN
[{"inst":"0e649fa22bcc5200d7c40f3505da153b", "dt":"2015-12-14T18:00:00.000Z"}, {}] FILTER i.inst ==
r.inst UPDATE r with {"inst":i.inst, "dt": i.dt, "max":i.max, "min":i.min, "sum":i.sum, "avg":i.avg,
"samples":i.samples} in hourly OPTIONS { ignoreErrors: true } RETURN NEW.inst
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
5 CalculationNode 1 - LET #6 = [ { "inst" : "0e649fa22bcc5200d7c40f3505da153b", "dt" : "2015-12-14T18:00:00.000Z" }, { } ] /* json expression */ /* const assignment */
13 IndexRangeNode 103067 - FOR r IN hourly /* skiplist index scan */
6 EnumerateListNode 206134 - FOR i IN #6 /* list iteration */
7 CalculationNode 206134 - LET #8 = i.`inst` == r.`inst` /* simple expression */ /* collections used: r : hourly */
8 FilterNode 206134 - FILTER #8
9 CalculationNode 206134 - LET #10 = { "inst" : i.`inst`, "dt" : i.`dt`, "max" : i.`max`, "min" : i.`min`, "sum" : i.`sum`, "avg" : i.`avg`, "samples" : i.`samples` } /* simple expression */
10 UpdateNode 206134 - UPDATE r WITH #10 IN hourly
11 CalculationNode 206134 - LET #12 = $NEW.`inst` /* attribute expression */
12 ReturnNode 206134 - RETURN #12
Indexes used:
Id Type Collection Unique Sparse Selectivity Est. Fields Ranges
13 skiplist hourly false false n/a `dt` [ `dt` == "2015-12-14T18:00:00.000Z" ]
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 move-filters-up
3 move-calculations-up-2
4 move-filters-up-2
5 remove-data-modification-out-variables
6 use-index-range
7 remove-filter-covered-by-index
Write query options:
Option Value
ignoreErrors true
waitForSync false
nullMeansRemove false
mergeObjects true
ignoreDocumentNotFound false
readCompleteInput true

I assume the selection part (not the update part) will be the bottleneck in this query.
The query seems problematic because for each document matching the first filter (h.dt == DATE_ISO8601(...)), there will be an iteration over the 20,000 values in the instArr array. If instArr values are unique, then only one value from it will match. Additionally, no index will be used for the inner loop, as the index selection has happened in the outer loop already.
Instead of looping over all values in instArr, it will be better to turn the accompanying == comparison into an IN comparison. That would already work if instArr would be an array of instance names, but it seems to be an array of instance objects (consisting of at least attributes inst and count). In order to use the instance names in an IN comparison, it would be better to have a dedicated array of instance names, and a translation table for the count and dt values.
Following is an example for generating these with JavaScript:
var instArr = [ ], trans = { };
for (i = 0; i < 20000; ++i) {
var instance = "instance" + i;
var count = Math.floor(Math.random() * 10);
var dt = (new Date(Date.now() - Math.floor(Math.random() * 10000))).toISOString();
instArr.push(instance);
trans[instance] = [ count, dt ];
}
instArr would then look like this:
[ "instance0", "instance1", "instance2", ... ]
and trans:
{
"instance0" : [ 4, "2015-12-16T21:24:45.106Z" ],
"instance1" : [ 0, "2015-12-16T21:24:39.881Z" ],
"instance2" : [ 2, "2015-12-16T21:25:47.915Z" ],
...
}
These data can then be injected into the query using bind variables (named like the variables above):
FOR h IN hourly
FILTER h.dt == DATE_ISO8601(1450116000000)
FILTER h.inst IN #instArr
RETURN #trans[h.inst]
Note that ArangoDB 2.5 does not yet support the #trans[h.inst] syntax. In that version, you will need to write:
LET trans = #trans
FOR h IN hourly
FILTER h.dt == DATE_ISO8601(1450116000000)
FILTER h.inst IN #instArr
RETURN trans[h.inst]
Additionally, 2.5 has a problem with longer IN lists. IN-list performance decreases quadratically with the length of the IN list. So in this version, it will make sense to limit the length of instArr to at most 2,000 values. That may require issuing multiple queries with smaller IN lists instead of just one with a big IN list.
The better alternative would be to use ArangoDB 2.6, 2.7 or 2.8, which do not have that problem, and thus do not require the workaround. Apart from that, you can get away with the slightly shorter version of the query in the newer ArangoDB versions.
Also note that in all of the above examples I used a RETURN ... instead of the UPDATE statement from the original query. This is because all my tests revealed that the selection part of the query is the major problem, at least with the data I had generated.
A final note on the original version of the UPDATE: updating each document's inst value with i.inst seems redudant, because i.inst == h.inst so the value won't change.

Related

How to force a waiting time of a few seconds between execution of commands in Pine? Need cooldown between close and open possition

A simple strategy script sends alerts to open and exit trades, that need to switch between long and short when conditions are met.
Problem: Two alerts (e.g. exit short / enter long) are generated one after the other. Enter long fails, as the previous short deal didn't have time to close.
Question: How can I delay script execution by 5-10 seconds?
Have tried Utilities.sleep(10000), but it does not compile.
*I am a complete beginner, and looking for a simple answer. Hope there is one :]
Here the code:
'''
strategy("My Strategy", overlay=true, default_qty_type=strategy.percent_of_equity, default_qty_value=15)
////////////
// Inputs //
length = input(100)
mult = input(2.0)
message_long_entry = input("long entry message")
message_long_exit = input("long exit message")
message_short_entry = input("short entry message")
message_short_exit = input("short exit message")
atrPeriod = input(10, "ATR Length")
factor = input.float(3.0, "Factor", step = 0.01)
[_, direction] = ta.supertrend(factor, atrPeriod)
if ta.change(direction) < 0
strategy.entry("My Long Entry Id", strategy.long, when = barstate.isconfirmed)
alert(message_short_exit)
/// Utilities.sleep(10000) <--- Delay needed here
alert(message_long_entry)
if ta.change(direction) > 0
strategy.entry("My Short Entry Id", strategy.short, when = barstate.isconfirmed)
alert(message_long_exit)
/// Utilities.sleep(10000) <--- Delay needed here
alert(message_short_entry)
'''
You can use this example from Pinecoders FAQ
//#version=5
strategy('Strat with time delay', overlay=true)
i_qtyTimeUnits = -input.int(20, 'Quantity', inline='Delay', minval=0, tooltip='Use 0 for no delay')
i_timeUnits = input.string('minutes', '', inline='Delay', options=['seconds', 'minutes', 'hours', 'days', 'months', 'years'])
// ————— Converts current chart timeframe into a float minutes value.
f_tfInMinutes() =>
_tfInMinutes = timeframe.multiplier * (timeframe.isseconds ? 1. / 60 : timeframe.isminutes ? 1. : timeframe.isdaily ? 60. * 24 : timeframe.isweekly ? 60. * 24 * 7 : timeframe.ismonthly ? 60. * 24 * 30.4375 : na)
_tfInMinutes
// ————— Calculates a +/- time offset in variable units from the current bar's time or from the current time.
// WARNING:
// This functions does not solve the challenge of taking into account irregular gaps between bars when calculating time offsets.
// Optimal behavior occurs when there are no missing bars at the chart resolution between the current bar and the calculated time for the offset.
// Holidays, no-trade periods or other irregularities causing missing bars will produce unpredictable results.
f_timeFrom(_from, _qty, _units) =>
// _from : starting time from where the offset is calculated: "bar" to start from the bar's starting time, "close" to start from the bar's closing time, "now" to start from the current time.
// _qty : the +/- qty of _units of offset required. A "series float" can be used but it will be cast to a "series int".
// _units : string containing one of the seven allowed time units: "chart" (chart's resolution), "seconds", "minutes", "hours", "days", "months", "years".
// Dependency: f_resInMinutes().
int _timeFrom = na
// Remove any "s" letter in the _units argument, so we don't need to compare singular and plural unit names.
_unit = str.replace_all(_units, 's', '')
// Determine if we will calculate offset from the bar's time or from current time.
_t = _from == 'bar' ? time : _from == 'close' ? time_close : timenow
// Calculate time at offset.
if _units == 'chart'
// Offset in chart res multiples.
_timeFrom := int(_t + f_tfInMinutes() * 60 * 1000 * _qty)
_timeFrom
else
// Add the required _qty of time _units to the _from starting time.
_year = year(_t) + (_unit == 'year' ? int(_qty) : 0)
_month = month(_t) + (_unit == 'month' ? int(_qty) : 0)
_day = dayofmonth(_t) + (_unit == 'day' ? int(_qty) : 0)
_hour = hour(_t) + (_unit == 'hour' ? int(_qty) : 0)
_minute = minute(_t) + (_unit == 'minute' ? int(_qty) : 0)
_second = second(_t) + (_unit == 'econd' ? int(_qty) : 0)
// Return the resulting time in ms Unix time format.
_timeFrom := timestamp(_year, _month, _day, _hour, _minute, _second)
_timeFrom
// Entry conditions.
ma = ta.sma(close, 100)
goLong = close > ma
goShort = close < ma
// Time delay filter
var float lastTradeTime = na
if nz(ta.change(strategy.position_size), time) != 0
// An order has been executed; save the bar's time.
lastTradeTime := time
lastTradeTime
// If user has chosen to do so, wait `i_qtyTimeUnits` `i_timeUnits` between orders
delayElapsed = f_timeFrom('bar', i_qtyTimeUnits, i_timeUnits) >= lastTradeTime
if goLong and delayElapsed
strategy.entry('Long', strategy.long, comment='Long')
if goShort and delayElapsed
strategy.entry('Short', strategy.short, comment='Short')
plot(ma, 'MA', goLong ? color.lime : color.red)
plotchar(delayElapsed, 'delayElapsed', '•', location.top, size=size.tiny)
What worked for me was just adding a while loop in my webhook side code checking if there is in realtime an active trade. Im using binance so through postman i accessed to GET /fapi/v2/positionRisk which shows you your current positions information which shows like this:
[
{
"symbol": "BTCUSDT",
"positionAmt": "0.000",
"entryPrice": "0.0",
"markPrice": "22615.15917559",
"unRealizedProfit": "0.00000000",
"liquidationPrice": "0",
"leverage": "10",
"maxNotionalValue": "20000000",
"marginType": "isolated",
"isolatedMargin": "0.00000000",
"isAutoAddMargin": "false",
"positionSide": "BOTH",
"notional": "0",
"isolatedWallet": "0",
"updateTime": 165963
} ]
so accessing your current positions amount:
check = float(client.futures_position_information(symbol="BTCUSDT")[0]["positionAmt"])
note if you have hedge mode activated you would have to check from both your long or short side positions changing through [0] or [1] depending on which side you need.
now you can just add the loop before any of your entries:
while check != 0:
check = float(client.futures_position_information(symbol="BTCUSDT")[0]["positionAmt"])
order = client.futures_create_order(symbol=symbol, side=side, type=order_type, quantity=quantity) //this is your entry after loop breaks
this will delay any of your entries with the loop always updating and checking your current position amount until all of your positions are closed which will make it 0, breaking, and then allowing the code to keep going and start a new position in this case

ArangoDB sharding cluster performance issue

I have a query that runs well in single-instance setup. However, when I tried to run it on a sharded cluster, the performance dropped (4x longer execution time).
The query plan shows that practically all processing is done on Coordinator node, not on DbServer.
How to push the query to be executed at DbServer?
To give a bit of a context: I have a collection of ~120k (will grow to several millions) of multi-level JSON documents with nested arrays. And the query needs to unnest these arrays before getting to the proper node.
AQL Query:
for doc IN doccollection
for arrayLevel1Elem in doc.report.container.children.container
for arrayLevel2Elem in arrayLevel1Elem.children.container.children.num
for arrayLevel3Elem in arrayLevel2Elem.children.code
filter doc.report.container.concept.simpleCodedValue == 'A'
filter arrayLevel1Elem.concept.codedValue == "B"
filter arrayLevel2Elem.concept.simpleCodedValue == "C"
filter arrayLevel3Elem.concept.simpleCodedValue == 'X'
filter arrayLevel3Elem.value.simpleCodedValue == 'Y'
collect studyUid = doc.report.study.uid, personId = doc.report.person.id, metricName = arrayLevel2Elem.concept.meaning, value = to_number(arrayLevel2Elem.value)
return {studyUid, personId, metricName, value}
Query Plan:
Id NodeType Site Est. Comment
1 SingletonNode DBS 1 * ROOT
2 EnumerateCollectionNode DBS 121027 - FOR doc IN doccollection /* full collection scan, projections: `report`, 2 shard(s) */ FILTER (doc.`report`.`container`.`concept`.`simpleCodedValue` == "A") /* early pruning */
3 CalculationNode DBS 121027 - LET #8 = doc.`report`.`container`.`children`.`container` /* attribute expression */ /* collections used: doc : doccollection */
19 CalculationNode DBS 121027 - LET #24 = doc.`report`.`study`.`uid` /* attribute expression */ /* collections used: doc : doccollection */
20 CalculationNode DBS 121027 - LET #26 = doc.`report`.`person`.`id` /* attribute expression */ /* collections used: doc : doccollection */
29 RemoteNode COOR 121027 - REMOTE
30 GatherNode COOR 121027 - GATHER /* parallel, unsorted */
4 EnumerateListNode COOR 12102700 - FOR arrayLevel1Elem IN #8 /* list iteration */
11 CalculationNode COOR 12102700 - LET #16 = (arrayLevel1Elem.`concept`.`codedValue` == "B") /* simple expression */
12 FilterNode COOR 12102700 - FILTER #16
5 CalculationNode COOR 12102700 - LET #10 = arrayLevel1Elem.`children`.`container`.`children`.`num` /* attribute expression */
6 EnumerateListNode COOR 1210270000 - FOR arrayLevel2Elem IN #10 /* list iteration */
13 CalculationNode COOR 1210270000 - LET #18 = (arrayLevel2Elem.`concept`.`simpleCodedValue` == "C") /* simple expression */
14 FilterNode COOR 1210270000 - FILTER #18
7 CalculationNode COOR 1210270000 - LET #12 = arrayLevel2Elem.`children`.`code` /* attribute expression */
21 CalculationNode COOR 1210270000 - LET #28 = arrayLevel2Elem.`concept`.`meaning` /* attribute expression */
22 CalculationNode COOR 1210270000 - LET #30 = TO_NUMBER(arrayLevel2Elem.`value`) /* simple expression */
8 EnumerateListNode COOR 121027000000 - FOR arrayLevel3Elem IN #12 /* list iteration */
15 CalculationNode COOR 121027000000 - LET #20 = ((arrayLevel3Elem.`concept`.`simpleCodedValue` == "X") && (arrayLevel3Elem.`value`.`simpleCodedValue` == "Y")) /* simple expression */
16 FilterNode COOR 121027000000 - FILTER #20
23 CollectNode COOR 96821600000 - COLLECT studyUid = #24, personId = #26, metricName = #28, value = #30 /* hash */
26 SortNode COOR 96821600000 - SORT studyUid ASC, personId ASC, metricName ASC, value ASC /* sorting strategy: standard */
24 CalculationNode COOR 96821600000 - LET #32 = { "studyUid" : studyUid, "personId" : personId, "metricName" : metricName, "value" : value } /* simple expression */
25 ReturnNode COOR 96821600000 - RETURN #32
Thanks a lot for any hint.
Queries are not actually executed at the DB server - the coordinators handle query compilation and execution, only really asking the DB server(s) for data.
This means memory load for query execution happens on the coordinators (good!) but that the coordinator has to transport (sometimes LARGE amounts of) data across the network. This is THE BIGGEST downside to moving to a cluster - and not one that is easily solved.
I walked this same road in the beginning and found ways to optimize some of my queries, but in the end, it was easier to go with a "one-shard" cluster or an "active-failover" setup.
It's tricky to make architecture suggestions because each use case can be so different, but there are some general AQL guidelines I follow:
Collecting FOR and FILTER statements is not recommended (see #2). Try this version to see if it runs any faster (and try indexing report.container.concept.simpleCodedValue) :
FOR doc IN doccollection
FILTER doc.report.container.concept.simpleCodedValue == 'A'
FOR arrayLevel1Elem in doc.report.container.children.container
FILTER arrayLevel1Elem.concept.codedValue == 'B'
FOR arrayLevel2Elem in arrayLevel1Elem.children.container.children.num
FILTER arrayLevel2Elem.concept.simpleCodedValue == 'C'
FOR arrayLevel3Elem in arrayLevel2Elem.children.code
FILTER arrayLevel3Elem.concept.simpleCodedValue == 'X'
FILTER arrayLevel3Elem.value.simpleCodedValue == 'Y'
COLLECT
studyUid = doc.report.study.uid,
personId = doc.report.person.id,
metricName = arrayLevel2Elem.concept.meaning,
value = to_number(arrayLevel2Elem.value)
RETURN { studyUid, personId, metricName, value }
The FOR doc IN doccollection pattern will recall the ENTIRE document from the DB server for each item in doccollection. Best practice is to either limit the number of documents you are retrieving (best done with an index-backed search) and/or return only a few attributes. Don't be afraid of using LET - in-memory on the coordinator can be faster than in-memory on the DB. This example does both - filters and returns a smaller set of data:
LET filteredDocs = (
FOR doc IN doccollection
FILTER doc.report.container.concept.simpleCodedValue == 'A'
RETURN {
study_id: doc.report.study.uid,
person_id: doc.report.person.id,
arrayLevel1: doc.report.container.children.container
}
)
FOR doc IN filteredDocs
FOR arrayLevel1Elem in doc.arrayLevel1
FILTER arrayLevel1Elem.concept.codedValue == 'B'
...

Get real-time distance between running cars on Azure Stream Analytics

We will get streaming data of many cars on a particular Stream Analytics. Each row will have vehicleId, latitude and longitude of vehicle. I need to raise an alarm whenever distance between ANY two cars is less than suppose x meters.
Right now we can consider radial distance to keep it simple. Hence, we need to calculate distance of cars NOT from a fix point but from other cars ( near by cars can keep changing with time ). Hence, we cannot hard-code vehicle id in the query for sure.
We do have Geo-spatial functions support https://learn.microsoft.com/en-us/stream-analytics-query/geospatial-functions ..
I am not sure if this can even be done by Stream Analytics query directly.
I created a small example of the potential solution, not perfect one perhaps, but it resolves the problem in the ASA job.
Essentially I have re-used javascript function that expects simple latitude and longitude and gives the distance in meters. You can use potentially the geospatial embedded function - I haven't tried to play with that.
So, idea is to cross join input, for all input messages(unfortunately yes, you get duplicated result but it works), and then you apply distance function and filter to the output only those that have a distance less than a threshold value. The following example propagates to the output only if the distance is not zero(it means it compared with itself) and if it is less than 5 meters:
with inputData as (select * from input i1 inner join input i2 ON DATEDIFF(second,i1,i2) BETWEEN 0 AND 5),
distances as (select *, udf.getGeoDistance(i1.lat,i1.long,i2.lat,i2.long) as distance from inputData)
select *
into output
from distances
where distance <> 0 and distance < 5
UDF function:
// Sample UDF which returns sum of two values.
function getDistanceFromLatLonInKm(lat1, lon1, lat2, lon2) {
'use strict';
var R = 6371; // Radius of the earth in km
var dLat = deg2rad(lat2 - lat1); // deg2rad below
var dLon = deg2rad(lon2 - lon1);
var a =
Math.sin(dLat / 2) * Math.sin(dLat / 2) +
Math.cos(deg2rad(lat1)) * Math.cos(deg2rad(lat2)) *
Math.sin(dLon / 2) * Math.sin(dLon / 2)
;
var c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a));
var d = R * c * 1000; // Distance in m
return d;
}
function deg2rad(deg) {
return deg * (Math.PI / 180);
}
Input:
[
{
"name" : "car1",
"lat" : 59.3293371,
"long" : 13.4877472
},
{
"name" : "car2",
"lat" : 59.293371,
"long" : 13.2619422
},
{
"name" : "car3",
"lat" : 59.3293371,
"long" : 13.4877040
}
]
And result(car1 and car3 are close to each other):
Follow your latest comment,you use tumbling-window and set 5 seconds timeunit to get slice of data.Per my knowledge, you still could not calculate distance of cars each other by sql and Geo-spatial functions directly.Not to mention a warning.
I came up with an idea that you may could use Azure Function as the output of ASA job.Collect the slice of data and send them into Azure Function as json parameter.Inside function,you could write code to calculate the distances between cars each other,even alert warnings to other destinations.

Difference in Performance when using vertices and edges VS Joins in ArangoDB

Below are few details.
Query 1: Using Graph Traversal attached execution plan as well.
Here i am using an edge between CollectionA and CollectionB.
Query string:
for u in CollectionA filter u.FilterA == #opId and u.FilterB >= #startTimeInLong and u.FilterB <= #endTimeInLong
for v in 1..1 OUTBOUND u CollectionALinksCollectionB
filter
v.FilterC == null return v
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
9 IndexNode 45088 - FOR u IN CollectionA /* skiplist index scan */
5 TraversalNode 1 - FOR v /* vertex */ IN 1..1 /* min..maxPathDepth */ OUTBOUND u /* startnode */ CollectionALinksCollectionB
6 CalculationNode 1 - LET #6 = (v.`ReceivedRating` == null) /* simple expression */
7 FilterNode 1 - FILTER #6
8 ReturnNode 1 - RETURN v
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
9 skiplist CollectionA false false 100.00 % [ `FilterA`, `FilterB` ] ((u.`FilterA` == "8277") && (u.`FilterB` >= 1526947200000) && (u.`FilterB` <= 1541030400000))
5 edge CollectionALinksCollectionB false false 100.00 % [ `_from` ] base OUTBOUND
Traversals on graphs:
Id Depth Vertex collections Edge collections Options Filter conditions
5 1..1 CollectionALinksCollectionB uniqueVertices: none, uniqueEdges: path
Optimization rules applied:
Id RuleName
1 use-indexes
2 remove-filter-covered-by-index
3 remove-unnecessary-calculations-2
Query 2:
Query string:
for u in CollectionA filter u.FilterA == #opId and u.FilterB >= #startTimeInLong and
u.FilterB <= #endTimeInLong
for v in CollectionB
filter
v._key==u._key and
v.FilterC == null return v
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
8 CalculationNode 1 - LET #6 = CollectionB /* all collection documents */ /* v8 expression */
11 IndexNode 45088 - FOR u IN CollectionA /* skiplist index scan */
10 IndexNode 45088 - FOR v IN CollectionB /* primary index scan, scan only */
12 CalculationNode 45088 - LET #4 = (CollectionB /* all collection documents */.`FilterC` == null) /* v8 expression */
7 FilterNode 45088 - FILTER #4
9 ReturnNode 45088 - RETURN #6
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
11 skiplist CollectionA false false 100.00 % [ `FilterA`, `FilterB` ] ((u.`FilterA` == "8277") && (u.`FilterB` >= 1526947200000) && (u.`FilterB` <= 1541030400000))
10 primary CollectionB true false 100.00 % [ `_key` ] (CollectionB.`_key` == u.`_key`)
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 use-indexes
3 remove-filter-covered-by-index
4 remove-unnecessary-calculations-2
How Does Query 1 perform better than Query 2. Also, the query result is almost similar for smaller dataset but Query 1 performs better with larger data.
Can some one explain me in detail how does Graph traversing help here

MongoDB Filter data points within an interval

I have a database query that selects all documents having a timestamp field (tmp) falling in a certain range, like so
{ tmp: { '$gte': 1411929000000, '$lte': 1419010200000 } }
This query returns a large number of records, say 10000.
Objective:
To fetch documents in the same interval range, but separated by say (1 hour timestamp) interval in between hence reduce the number of records that is fetched.
Is there a way of doing this entirely using MongoDB query system?
Due to NDA I can not show the code, but it basically contains Stock Exchange data (say in 1 minute interval). And the objective is to send a sample of these data between two endpoints (time). But the thing is, the client can ask for a 5 minute interval data or 10 min, or 1 hour etc, so from these 1 minute interval data I need to sample and send only the relevant ones. Hope that makes it more clear.
Any comments would be very helpful. Thanks.
There's no way to accomplish your objective directly, but you can do something very close. Given a range of time [s, t] and a separation p, you're looking for approximately (t - s) / p documents evenly spread over the range, to give a "zoomed-out" sense of the data. Pick x, ideally small compared to p, large enough to contain documents but small enough not to contain very many, and look for documents within an interval of width x around evenly spaced points separated by p. You can do this with a single $or query or with a series of queries. For example, simplifying using integers instead of dates, if I have a field score with values in the range [0, 50] and want a resolution of p = 10, I'll look at intervals of width x = 1 around points separated by 10:
db.test.find({ "$or" : [
{ "score" : { "$gte" : 0, "$lte" : 1 } },
{ "score" : { "$gte" : 9, "$lte" : 11 } },
{ "score" : { "$gte" : 19, "$lte" : 21 } },
{ "score" : { "$gte" : 29, "$lte" : 31 } },
{ "score" : { "$gte" : 39, "$lte" : 41 } },
{ "score" : { "$gte" : 49, "$lte" : 50 } },
] })
You could break this into 6 ((t - s) / p + 1) queries and limit 1 result in each query, alternatively.
There are a couple of other higher-level ways to approach your problem. I'd suggest looking at the following two schema design articles from the MongoDB Manual:
Pre-Aggregated Reports
Hierarchical Aggregation

Resources