filter on partition key before iterating over array cosmos db

filter on partition key before iterating over array cosmos db - azure

I have a CosmosDbQuery that works fine but is a bit slow and expensive:
SELECT c.actionType as actionType, count(1) as count
FROM c in t.processList
WHERE c.processTimestamp > #from
GROUP BY c.actionType
To optimise my query i would like to first have a Where clause on my parent partitionKey e.g. parent.minute > x before iterating over the processlist. After this where there is no need for the c.processTimestamp > #from.
"id": "b6fd10cc-3a0b-4666-bf55-f22436a5f8d9",
"Name": "xxx",
"Age": 1,
"minute": 202302021026,
"processList": [
{
"processTimestamp": "2023-02-01T10:28:48.3004825Z",
"actionType": "Action1",
"oldValue": "2/1/2023 10:28:41 AM",
"newValue": "2/1/2023 10:28:48 AM"
},
{
"processTimestamp": "2023-02-01T10:28:48.3004825Z",
"actionType": "Action2",
"oldValue": "2/1/2023 10:28:48 AM",
"newValue": "2/1/2023 10:28:48 AM"
}],
}
I have tried subqueries and joins but i could not get it to work:
SELECT c.actionType as actionType, count(1) as count
FROM (SELECT * FROM C WHERE c.minute > 9) in t.processList
WHERE c.processTimestamp > #from
GROUP BY c.actionType")
My desired result would be:
[
{
"actionType": "action1",
"count": 85351
},
{
"actionType": "action2",
"count": 2354
}
]

A few comments here.
As noted in my comment, Group By with Sub-Queries is unsupported, documented here.
Using a Date/Time value as a partition key is typically an anti-pattern for Cosmos DB. This query may be slow and expensive because at large scales, using time as a partition key means that most queries are hitting the same partition due of data recency (newer data gets more requests than older data). This is also bad for writes as well for the same reason.
When this happens, it is typical to increase the throughput. However this often does little to help and in some cases can even make things worse. Also, because throughput is evenly distributed across all partitions, this results in wasted unused throughput on partition keys for older dates.
Two things to consider. Make your partition key a combination of two properties to increase cardinality. In an IOT scenario this would typically be deviceId_dateTime (Hierarchical Partition keys, in preview now, is a better way you can do this today). This will help with writes especially where data is always written with the current dateTime.
On the read path for queries, you might explore implementing a materialized view using Change Feed into a second container. This will move the throughput for reads off of the container used for ingestion and can result in more efficient throughput usage. However, you should measure this yourself to be sure.
If your container is small and will always stay that way, then this information below will not apply (< 10K RU/s and 50GB). However, such a design will not scale.

Like Mark said: Groupby is not supported on a subquery. Tried to fix it with linq but groupby is not supported for linq aswell so i changed my code so it uses join insteadof looping over the array with the IN keyword:
SELECT pl.actionType as actionType, count(1) as count
FROM c
JOIN pl IN c.processList
WHERE c.minute > #from
GROUP BY pl.actionType")

Related

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.
I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.
To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.
If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.
This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).
My current code collects every single document (n = 1):
db.collection(collection).find().sort({$natural:-1}).limit(limit)
I could then loop through the returned array and remove every document when:
index % n != 0
This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.
Does anyone know a method to accomplish this?

Apparenlty, there is no inbuilt solution in mongo to solve your problem.
The way forward would be to archive your data smartly, in fragments.
So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.
If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.
Alternate fragmentation(update) :
Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.

I think the $bucket stage might help you with it.
You can do something like,
db.collection.aggregate([
{
$bucketAuto: {
groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
}
}
])
Each document in the result for the above query would be something like this,
"_id": {
"max": 3,
"min": 1
},
"count": 2
If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample

You might have another problem. Docs state not to rely on natural ordering:
This ordering is an internal implementation feature, and you should
not rely on any particular structure within it.
You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

Mongodb mapreduce performance

I have a mapreduce function that I use to prepare data for my web app to be used in realtime.
It works fine but it doesn't require my performance requirements.
My aim is (and I know that it's not meant to be this way) to perform it when the webapp user request for it (more or less in realtime).
I do use Mapreduce because the transformation of the data needs a lot of if/else conditions due to functional requirements.
My subset of initial data to be transformed is about 100k rich documents ( < 1kB ).
The result is stored in a collection (in Replace mode) that will be then used by the webapp.
The duration of processing now is about 6-9 seconds and the CPU and RAM usage are very low.
The acceptable waiting time for my users should be less than 5 seconds.
So, to benefit from the not used CPU, I tried to divide my initial input data into subsets and perform the mapreduce in each subset by a different thread (20k documents per thread).
For that I had to change the Replae mode to Merge mode to be able to collect the result into the same collection.
But it didn't help. It consumes more CPU but the total execution time is the more or less the same.
Setting "nonAtomic" to true in my mapReduce calls didn't help neither.
I read somewhere that there are (at least) 2 issues with running it this way :
My threads are not running in parallel for the inserts as the insert locks the output collection.
My threads are not running in parallel during processing because the js engine used by mongodb is not thread safe.
Are these points correct? And do you know any other better solutions?
PS: My mapreduce doesn't group data, it only tranforms it based on functional conditions (a lot of them). All emitted documents are unique (so reduce is always 0).
EDIT:
Here is an example:
My input objects are a products groups. ie:
{
_id : "1",
products : [
{code : "P1", name : "P1", price : 22.1 ...., competitors : [{code : "c1", price : 22.2},{code : "c2", price : 21.9}]},
{code : "P2", name : "P2", price : 22.1 ...., competitors : [{code : "c1", price : 22.2},{code : "c2", price : 21.9}]},
]
}
Users should be able to define dynamically functional groups based on some criterias applied to each product and define a pricing strategy for each one of them.
As a simple example of functional groupping, they could define 4 groups like this :
Cheap Products (whose price is less than 20)
Products that are sold by both competitors "C1" and "C2"
Products that are sold only by the competitor "C3"
Products that are sold by the competitor "C4" and is not in Promo
...
All these groups are defined based on properties of the Product object and because 1 product can possibly fit more than 1 group, the first encountred should be the used one (if it fits in the first group, it must not appear in any other one).
Once the groups criteria defined, users can define for each group a strategy to apply to calculate a new price for each product based on some conditions (also uses Product properties BUT ALSO other products properties on the same Array of the original input object).
The result is a collection of separate products with its functinal group, its new price and some other calculated stats and values.

Cassandra - IN or TOKEN query for querying an entire partition?

I want to query a complete partition of my table.
My compound partition key consists of (id, date, hour_of_timestamp). id and date are strings, hour_of_timestamp is an integer.
I needed to add the hour_of_timestamp field to my partition key because of hotspots while ingesting the data.
Now I'm wondering what's the most efficient way to query a complete partition of my data?
According to this blog, using SELECT * from mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp IN (0,1,...23); is causing a lot of overhead on the coordinator node.
Is it better to use the TOKEN function and query the partition with two tokens? Such as SELECT * from mytable WHERE TOKEN(id,date,hour_of_timestamp) >= TOKEN('x','10-10-2016',0) AND TOKEN(id,date,hour_of_timestamp) <= TOKEN('x','10-10-2016',23);
So my question is:
Should I use the IN or TOKEN query for querying an entire partition of my data? Or should I use 23 queries (one for each value of hour_of_timestamp) and let the driver do the rest?
I am using Cassandra 3.0.8 and the latest Datastax Java Driver to connect to a 6 node cluster.

You say:
Now I'm wondering what's the most efficient way to query a complete
partition of my data? According to this blog, using SELECT * from
mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp
IN (0,1,...23); is causing a lot of overhead on the coordinator node.
but actually you'd query 24 partitions.
What you probably meant is that you had a design where a single partition was what now consists of 24 partitions, because you add the hour to avoid an hotspot during data ingestion. Noting that in both models (the old one with hotspots and this new one) data is still ordered by timestamp, you have two choices:
Run 1 query at time.
Run 2 queries the first time, and then one at time to "prefetch" results.
Run 24 queries in parallel.
CASE 1
If you process data sequentially, the first choice is to run the query for the hour 0, process the data and, when finished, run the query for the hour 1 and so on... This is a straightforward implementation, and I don't think it deserves more than this.
CASE 2
If your queries take more time than your data processing, you could "prefetch" some data. So, the first time you could run 2 queries in parallel to get the data of both the hours 0 and 1, and start processing data for hour 0. In the meantime, data for hour 1 arrives, so when you finish to process data for hour 0 you could prefetch data for hour 2 and start processing data for hour 1. And so on.... In this way you could speed up data processing. Of course, depending on your timings (data processing and query times) you should optimize the number of "prefetch" queries.
Also note that the Java Driver does pagination for you automatically, and depending on the size of the retrieved partition, you may want to disable that feature to avoid blocking the data processing, or may want to fetch more data preemptively with something like this:
ResultSet rs = session.execute("your query");
for (Row row : rs) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults(); // this is asynchronous
// Process the row ...
}
where you could tune that rs.getAvailableWithoutFetching() == 100 to better suit your prefetch requirements.
You may also want to prefetch more than one partition the first time, so that you ensure your processing won't wait on any data fetching part.
CASE 3
If you need to process data from different partitions together, eg you need both data for hour 3 and 6, then you could try to group data by "dependency" (eg query both hour 3 and 6 in parallel).
If you need all of them then should run 24 queries in parallel and then join them at application level (you already know why you should avoid the IN for multiple partitions). Remember that your data is already ordered, so your application level efforts would be very small.

How to make edges unique and to quantify them without out-of-memory error

I've created an edge collection with about 16 Mio edges. The edges are not unique, means there are more than one edge from vertex a to vertex b. The edge collection size is about 2.4 GB data and has 1.6 GB edge index size. I am using a computer with 16 GB RAM (and additionally, 16 BG swap space).
Now I try to calculate unique edges (between each couple of vertex a-b) with a statement like this one:
FOR wf IN DeWritesWith
COLLECT from = wf._from, to = wf._to WITH COUNT INTO res
INSERT { "_from": from, "_to": to, "type": "writesWith", "numArticles": res } INTO DeWritesWithAggregated
// Does also lead to out-of-memory error:
// RETURN { "_from": from, "_to": to, "type": "writesWith", "numArticles": res }
My Problem: I always run out-of-memory (32 GB RAM). As the problem also occures when I do not want to write the result, I assume it is not a problem of huge write transaction logs.
Is this normal, and can I optimize the AQL somehow? I am hoping for a solution as I think this scenario is a more generic usage scenario in graphs ...

Since ArangoDB 2.6, the COLLECT can run in two modes:
the sorted mode that uses a sort step before aggregation
a hash table mode that does not require an upfront sort step
The optimizer will choose the hash table mode automatically if it is considered to be cheaper than the sorted mode with the sort step.
The new COLLECT implementation in 2.6 should make the selection part of the query run much faster in 2.6 than in 2.5 and before. Note that COLLECT still produces a sorted output of its result (not its input) even with the hash table mode. This is done for compatibility with the sorted mode. This result sort step can be avoided by adding an extra SORT null instruction after the COLLECT statement. The optimizer can then optimize away the sorting of the result.
A blog post that explains the two modes is here:
http://jsteemann.github.io/blog/2015/04/22/collecting-with-a-hash-table/

Aggregating data with CouchDB reduce function

I have a process which posts documents similar to the one below to CouchDB:
{
"timestamp": [2010, 8, 4, 9, 25, 24],
"type": "quote",
"bid": 95.0,
"offer": 96.5
}
Many such documents are posted over the course of a day, each timestamped appropriately.
I want to create a CouchDB view which returns the last quote stored every day.
I've been reading View Cookbook for SQL Jockeys on how to create complex views but I have trouble seeing how to combine map and reduce functions to achieve the desired result. The map function is easy; it's the reduce function I'm having trouble with.
Any pointers gratefully received.

Create a map-function that returns all documents for a given time period using the same key. For example, return all documents in the 17th hour of the day with key 17.
Create a reduce-function that emits only the latest bid for that hour. Your view will return 24 documents, and your client side code will do the final merge.
There are many ways to accomplish this. You can retrieve a single latest-bid by emitting from your map-function a single key and then reducing this by searching all bids, but I'm not sure how that will perform for extremely large sets, such as those you'd encounter with a bidding system.
Update
http://wiki.apache.org/couchdb/View_Snippets#Computing_simple_summary_statistics_.28min.2Cmax.2Cmean.2Cstandard_deviation.29

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string