Order By and Group By in Google Datastore Node Query JS - node.js

I am trying to write a datastore query in NodeJS.
I want to order by timestamp but also only distinct unique ID's (no duplicates) and only retrieve the latest datastore item for each unique ID.
For example
USER_ID - TIMESTAMP
10 - 1000
10 - 500
5 - 10
5 - 1500
5 - 50
I want the query to result with
USER_ID - TIMESTAMP
10 - 1000
5 - 1500
What I've tried:
datastore.createQuery('example')
.groupBy('USER_ID')
.order('USER_ID')
.order('TIMESTAMP')
But it returns the data ordered by USER_ID, not TIMESTAMP
Here's a pastebin to help answer the question: https://pastebin.com/MQCibmiw

You'll need to do the sort by timestamp yourself. As previously mentioned, order by USER_ID takes priority, and it's needed because you are running a distinct on query for the grouping.

As Jim Morrison said, I don't think this is possible.
I filed a Feature Request on your behalf. Check if my understanding was correct and feel free to add as many details as you think is needed.
Meanwhile I think the only option you have is to order the TIMESTAMP "manually".

Related

How Mongodb queries works in node.js?

I have a very easy search query in Node.js Express.js MongoDB with Mongoose:
await Model.find({}).limit(10);
My question is how do the architects work? Is it first to get all Models Data and then limit to 10 or before getting all data will select 10 items from the database? I mean the steps:
Find all data from Model and return as List(Array) --> 2. Limit 10 first items and remove others from List(Array).
Find first 10 items and return as List(Array)
The difference in performance is high cause with first step if we got a million data it will return 1 mill items with a huge 10 20 sec and then limiting the 10 of it which we loose 10 20 seconds of time and when the user are more the server will be done but with the second way even with 100 mil items it will always take same time.
The limit function sets specifies the maximum number of elements a cursor will return. In the case of your example, the cursor will return the first 10 items matching the query only (option 2). You can find more information on how the cursor.limit() works via the links below:
https://docs.mongodb.com/manual/reference/method/cursor.limit/
http://mongodb.github.io/node-mongodb-native/3.5/api/Cursor.html#limit

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.
I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.
To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.
If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.
This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).
My current code collects every single document (n = 1):
db.collection(collection).find().sort({$natural:-1}).limit(limit)
I could then loop through the returned array and remove every document when:
index % n != 0
This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.
Does anyone know a method to accomplish this?
Apparenlty, there is no inbuilt solution in mongo to solve your problem.
The way forward would be to archive your data smartly, in fragments.
So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.
If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.
Alternate fragmentation(update) :
Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.
I think the $bucket stage might help you with it.
You can do something like,
db.collection.aggregate([
{
$bucketAuto: {
groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
}
}
])
Each document in the result for the above query would be something like this,
"_id": {
"max": 3,
"min": 1
},
"count": 2
If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample
You might have another problem. Docs state not to rely on natural ordering:
This ordering is an internal implementation feature, and you should
not rely on any particular structure within it.
You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

Find documents in MongoDB with non-typical limit

I have a problem, but don't have idea how to resolve it.
I've got PointValues collection in MongoDB.
PointValue schema has 3 parameters:
dataPoint (ref to DataPoint schema)
value (Number)
time (Date)
There is one pointValue for every hour (24 per day).
I have API method to get PointValues for specified DataPoint and time range. Problem is I need to limit it to max 1000 points. Typical limit(1000) method isn't good way, because I need point for whole, specified time range, with time step depends on specified time range and point values count.
So... for example:
Request data for 1 year = 1 * 365 * 24 = 8760
It should return 1000 values but approx 1 value per (24 / (1000 / 365)) = ~9 hours
I don't have idea what method i should use to filter that data in MongoDB.
Thanks for help.
Sampling exactly like that on the database would be quite hard to do and likely not very performant. But an option which gives you a similar result would be to use an aggregation pipeline which $group's the $first best value by $year, $dayOfYear, and $hour (and $minute and $second if you need smaller intervals). That way you can sample values by time steps, but your choices of step lengths are limited to what you have date-operators for. So "hourly" samples is easy, but "9-hourly" samples gets complicated. When this query is performance-critical and frequent, you might want to consider to create additional collections with daily, hourly, minutely etc. DataPoints so you don't need to perform that aggregation on every request.
But your documents are quite lightweight due to the actual payload being in a different collection. So you might consider to get all the results in the requested time range and then do the skipping on the application layer. You might want to consider combining this with the above described aggregation to pre-reduce the dataset. So you could first use an aggregation-pipeline to get hourly results into the application and then skip through the result set in steps of 9 documents. Whether or not this makes sense depends on how many documents you expect.
Also remember to create a sorted index on the time-field.

Cassandra - IN or TOKEN query for querying an entire partition?

I want to query a complete partition of my table.
My compound partition key consists of (id, date, hour_of_timestamp). id and date are strings, hour_of_timestamp is an integer.
I needed to add the hour_of_timestamp field to my partition key because of hotspots while ingesting the data.
Now I'm wondering what's the most efficient way to query a complete partition of my data?
According to this blog, using SELECT * from mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp IN (0,1,...23); is causing a lot of overhead on the coordinator node.
Is it better to use the TOKEN function and query the partition with two tokens? Such as SELECT * from mytable WHERE TOKEN(id,date,hour_of_timestamp) >= TOKEN('x','10-10-2016',0) AND TOKEN(id,date,hour_of_timestamp) <= TOKEN('x','10-10-2016',23);
So my question is:
Should I use the IN or TOKEN query for querying an entire partition of my data? Or should I use 23 queries (one for each value of hour_of_timestamp) and let the driver do the rest?
I am using Cassandra 3.0.8 and the latest Datastax Java Driver to connect to a 6 node cluster.
You say:
Now I'm wondering what's the most efficient way to query a complete
partition of my data? According to this blog, using SELECT * from
mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp
IN (0,1,...23); is causing a lot of overhead on the coordinator node.
but actually you'd query 24 partitions.
What you probably meant is that you had a design where a single partition was what now consists of 24 partitions, because you add the hour to avoid an hotspot during data ingestion. Noting that in both models (the old one with hotspots and this new one) data is still ordered by timestamp, you have two choices:
Run 1 query at time.
Run 2 queries the first time, and then one at time to "prefetch" results.
Run 24 queries in parallel.
CASE 1
If you process data sequentially, the first choice is to run the query for the hour 0, process the data and, when finished, run the query for the hour 1 and so on... This is a straightforward implementation, and I don't think it deserves more than this.
CASE 2
If your queries take more time than your data processing, you could "prefetch" some data. So, the first time you could run 2 queries in parallel to get the data of both the hours 0 and 1, and start processing data for hour 0. In the meantime, data for hour 1 arrives, so when you finish to process data for hour 0 you could prefetch data for hour 2 and start processing data for hour 1. And so on.... In this way you could speed up data processing. Of course, depending on your timings (data processing and query times) you should optimize the number of "prefetch" queries.
Also note that the Java Driver does pagination for you automatically, and depending on the size of the retrieved partition, you may want to disable that feature to avoid blocking the data processing, or may want to fetch more data preemptively with something like this:
ResultSet rs = session.execute("your query");
for (Row row : rs) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults(); // this is asynchronous
// Process the row ...
}
where you could tune that rs.getAvailableWithoutFetching() == 100 to better suit your prefetch requirements.
You may also want to prefetch more than one partition the first time, so that you ensure your processing won't wait on any data fetching part.
CASE 3
If you need to process data from different partitions together, eg you need both data for hour 3 and 6, then you could try to group data by "dependency" (eg query both hour 3 and 6 in parallel).
If you need all of them then should run 24 queries in parallel and then join them at application level (you already know why you should avoid the IN for multiple partitions). Remember that your data is already ordered, so your application level efforts would be very small.

How to find out active members during a specific period (month) in cassandra?

In the following real-world scenario:
Users come to a club (e.g. : gym) and purchase a membership for an indefinite amount of time and after a specified amount of time the membership is cancelled.
After the membership cancelled, the same user at a later time can purchase another membership for a one or more months.
I have an event table in which the event of starting and stoping a membership is being logged.
membership_events
member_id : int
event_type_id: int
event_time: timeuuid
PK (member_id, event_type_id, event_time)
One thing which can happen is that a member can have multiple memberships:
2015.1 - 2015.5
2016.1 - 2016.3
2016.5 - ?
How can i find out the via cassandra which is the number of active memberships within a specified month?
Sample data
User1
memberships:
2015.4 - 2015.6
2016.1 - 2016.3
User 2
memberships
2015.7 - 2015.8
2015.9 - 2016.3
User 3
memberships
2015.8 - 2015.12
2016.5 - ?
Active memberships for the month 2016.1:
User 1
User 2
The simple fact that your PK consists of
PK (member_id, event_type_id, event_time)
makes your question hard to solve, and at least inefficient because you need to query all partitions without being able to filter any record at database level (basically you must perform a SELECT without a WHERE clausole).
Just to alleviate this problem, I would transform your model in something like:
CREATE TABLE events (
dummy int,
event_start timestamp,
event_stop timestamp,
member_id int,
PRIMARY KEY (dummy_partition, event_start, event_stop)
);
This table makes use of a dummy partition (this is an HOTSPOT!!! Don't try this at home... and in production...) that would allow to specify something like WHERE dummy = 0 AND ... which you can exploit by writing something like
SELECT member_id FROM events WHERE dummy = 0 AND event_start <= '2016-01-01' AND event_stop > '2016-01-01';
to get the records for the 2016.1 period, assuming that an indefinite membership is stored with a far timestamp (2100-01-01 should be far enough).
With this, you'll extract the member_ids that are active in the 2016.1 month, and the results will eventually contain some member_id duplicates. You'll need to filter them manually at application level.
The truth is that you should rethink you model, and something like creating a new table that holds membership month by month should be the best option you have, and probably it is the best way to solve that specific problem in the C* way.
Hope that helps.

Resources