I want to do a select query in Cosmos DB that returns a maximum number of results (say 50) and then gives me the continuation token so I can continue the search where I left off.
Now let's say my query has 2 equality conditions in my where clause, e.g.
where prop1 = "a" and prop2 = "w" and prop3 = "g"
In the results that are returned, I want the records that satisfy prop1 = "a" to appear first, followed by the results that have prop2 = "w" followed by the ones with prop3 = "g".
Why do I need it? Because while I could just get all the data to my application and sort it there, I can't pull all records obviously as that would mean pulling in too much data. So if I can't order it this way in cosmos itself, in the results that I get, I might only have those records that don't have prop1 = "a" at all. Now I could keep retrying this till I get the ones with prop1 = "a" (I need this because I want to show the results with prop1 = "a" as the first set of results to the user) but I might have to pull like a 100 times to get the first record since I have a huge dataset sitting in my Cosmos DB.
How can I handle this scenario in Cosmos? Thanks!
So if I am understanding your question correctly, you want to accomplish this:
SELECT * FROM c
WHERE
c.prop1 = 'a'
AND
c.prop2 = 'b'
AND
c.prop3 = 'c'
ORDER BY
c.prop1, c.prop2, c.prop3
OFFSET 0 LIMIT 25
Now, luckily you can now do this in CosmosDB SQL. But, there is a caveat. You have to set up a composite index in your collection to allow for this.
So, for this collection, my composite index would look like this:
Now, if I wanted to change it to this:
SELECT * FROM c
WHERE
c.prop1 = 'a'
AND
c.prop2 = 'b'
AND
c.prop3 = 'c'
ORDER BY
c.prop1 DESC, c.prop2, c.prop3
OFFSET 0 LIMIT 25
I could add another composite index to cover that use-case. You can see in your settings it's an array of arrays so you can add as many combinations as you'd like.
This should get you to where you need to be if I understood your question correctly.
Related
I have a collection which holds documents, with each document having a data observation and the time that the data was captured.
e.g.
{
_key:....,
"data":26,
"timecaptured":1643488638.946702
}
where timecaptured for now is a utc timestamp.
What I want to do is get the duration between consecutive observations, with SQL I could do this with LAG for example, but with ArangoDB and AQL I am struggling to see how to do this at the database. So effectively the difference in timestamps between two documents in time order. I have a lot of data and I don't really want to pull it all into pandas.
Any help really appreciated.
Although the solution provided by CodeManX works, I prefer a different one:
FOR d IN docs
SORT d.timecaptured
WINDOW { preceding: 1 } AGGREGATE s = SUM(d.timecaptured), cnt = COUNT(1)
LET timediff = cnt == 1 ? null : d.timecaptured - (s - d.timecaptured)
RETURN timediff
We simply calculate the sum of the previous and the current document, and by subtracting the current document's timecaptured we can therefore calculate the timecaptured of the previous document. So now we can easily calculate the requested difference.
I only use the COUNT to return null for the first document (which has no predecessor). If you are fine with having a difference of zero for the first document, you can simply remove it.
However, neither approach is very straight forward or obvious. I put on my TODO list to add an APPEND aggregate function that could be used in WINDOW and COLLECT operations.
The WINDOW function doesn't give you direct access to the data in the sliding window but here is a rather clever workaround:
FOR doc IN collection
SORT doc.timecaptured
WINDOW { preceding: 1 }
AGGREGATE d = UNIQUE(KEEP(doc, "_key", "timecaptured"))
LET timediff = doc.timecaptured - d[0].timecaptured
RETURN MERGE(doc, {timediff})
The UNIQUE() function is available for window aggregations and can be used to get at the desired data (previous document). Aggregating full documents might be inefficient, so a projection should do, but remember that UNIQUE() will remove duplicate values. A document _key is unique within a collection, so we can add it to the projection to make sure that UNIQUE() doesn't remove anything.
The time difference is calculated by subtracting the previous' documents timecaptured value from the current document's one. In the case of the first record, d[0] is actually equal to the current document and the difference ends up being 0, which I think is sensible. You could also write d[-1].timecaptured - d[0].timecaptured to achieve the same. d[1].timecaptured - d[0].timecaptured on the other hand will give you the inverted timestamp for the first record because d[1] is null (no previous document) and evaluates to 0.
There is one risk: UNIQUE() may alter the order of the documents. You could use a subquery to sort by timecaptured again:
LET timediff = doc.timecaptured - (
FOR dd IN d SORT dd.timecaptured LIMIT 1 RETURN dd.timecaptured
)[0]
But it's not great for performance to use a subquery. Instead, you can use the aggregation variable d to access both documents and calculate the absolute value of the subtraction so that the order doesn't matter:
LET timediff = ABS(d[-1].timecaptured - d[0].timecaptured)
I would like to make an exists PostgreSQL query.
Let's say I have a Q ArangoDB query (AQL). How can I check if Q returns any result?
Example:
Q = "For u in users FILTER 'x#example.com' = u.email"
What is the best way to do it (most performant)?
I have ideas, but couldn't find an easy way to measure the performance:
Idea 1: using Length:
RETURN LENGTH(%Q RETURN 1) > 0
Idea 2: using Frist:
RETURN First(%Q RETURN 1) != null
Above, %Q is a substitution for the query defined at the beginning.
I think the best way to achieve this for a generic selection query with a structure like
Q = "For u in users FILTER 'x#example.com' = u.email"
is to first add a LIMIT clause to the query, and only make it return a constant value (in contrast to the full document).
For example, the following query returns a single match if there is such document or an empty array if there is no match:
FOR u IN users FILTER 'x#example.com' == u.email LIMIT 1 RETURN 1
(please note that I also changed the operator from = to == because otherwise the query won't parse).
Please note that this query may benefit a lot from creating an index on the search attribute, i.e. email. Without the index the query will do a full collection scan and stop at the first match, whereas with the index it will just read at most a single index entry.
Finally, to answer your question, the template for the EXISTS-like query will then become
LENGTH(%Q LIMIT 1 RETURN 1)
or fleshed out via the example query:
LENGTH(FOR u IN users FILTER 'x#example.com' == u.email LIMIT 1 RETURN 1)
LENGTH(...) will return the number of matches, which in this case will either be 0 or 1. And it can also be used in filter conditions like as follows
FOR ....
FILTER LENGTH(...)
RETURN ...
because LENGTH(...) will be either 0 or 1, which in context of a FILTER condition will evaluate to either false or true.
Do you need and AQL solution?
Only the count:
var q = "For u in users FILTER 'x#example.com' = u.email";
var res = db._createStatement({query: q, count: true}).execute();
var ct = res.count();
Is the fastest I can think of.
I have couple of records rec1 and rec2.
Both are having a common key/value name1.
when the name1 is equal in both the records then I need to set few values of rec2 to rec1.
I put them into two different loops as below
rec1.each{r1-> each
rec2.each{r2-> each
if(r2.name1 == r1.name1){
r1.name2 = r2.name2
r1.name3 = r2.name3
}
}
}
Is there any better way of doing this
Example : (sorry I am just pasting the contents)
recoRecord : [["CHANNEL":INBOUND, "STOCK_LEVEL":2410.0,
"OFFER_TARIFF_ID":FBUN-MVP-VME-VIRGIN-31-24-04, "P_BAND":P4-6,
"CONTRACT_LENGTH":24.0, "INCENTIVE_POINTS":10.0,
"HANDSET_PKEY_ID":SAM-STD-I9300-1, "CUST_TYPE":MEDIA]]
records : [["MEDIA_SUBSIDY_VALUE":0.0, "CREDIT_CLASS":C5,
"DOM_OTHER_MARGIN":0.0, "isBatchTerminator":false,
"CALL_GROUP_DESC":COMBINED, "DM":20.0, "BLACKBERRY_IND":N,
"PREFERRED_BLACKBERRY":N, "ERROR_ID":0, "CUST_TYPE":MEDIA,
"TARIFF_MRC":30.99, "MOST_USED_TAC":35961404, "FORM_FACTOR":null,
"CAMERA_IND":null, "NEW_MARGIN":22.272501, "MODEL":null,
"IS_MMS_ALLOWANCE":N, "ACTIVE_HANDSET_BANDS":,
"CUST_OUT_OF_ALLOWANCE_PLAN":JV15, "OOB_DOM_VOICE":0.0,
"OOB_DOM_SMS":0.0, "VM_CUST_FLAG":Y, "IB_DATA":0.0,
"CHANNEL_FLAG":INBOUND, "SMS_ALLOWANCE":5000.0, "ROAM_SMS_MARGIN":0.0,
"TARIFF_DESC":30.99 Virgin Media 24 month+1GB 1300mins,
"MARGIN_CHANGE_PCT":0.12691319, "OFFER_VOICE_ALLOWANCE":600,
"MAKE":null, "IS_ONNET_ALLOWANCE":Y, "OFFER_CONTRACT_TERM":24.0,
"PREFERRED_MINUTES":1300, "PREFERRED_ON_NET":Y,
"MOST_USED_IMEI":359614048625860, "DISCOUNT":3.0,
"NetPresentValue":1.15, "RecInd":1, "WIFI_IND":null, "IPHONE_IND":N,
"OFFER_TARIFF_ID":FBUN-MVP-VME-VIRGIN-31-24-04,
"IncentivePoints":-1.0]
when OFFER_TARIFF_ID in both the records are same then I would like to set few values of first record to second record
You do not need to iterate over both the maps. Just need to check the value of that particular key matches or not.
if(r2.'OFFER_TARIFF_ID' == r1.'OFFER_TARIFF_ID'){
//push the required entries from r1 to r2
}
Although in your edit, I do not see a valid data structure for records, I considered r1 and r2 as Maps.
I have a query object (SQL) with some records, the problem is that some of the records contain duplicate values. :( (I can't use DISTINCT in my SQL Query, so how to remove in my object?)
categories[1].id = 1
categories[2].id = 1
categories[3].id = 2
categories[4].id = 3
categories[5].id = 2
Now I want to get a list with 1, 2, 3
Is that possible?
I'm not quite sure why you say you can't use DISTINCT, even given the qualification you offered. It doesn't matter were a query came from (<cfquery>, <cfldap>, <cfdirectory>, built by hand) by the time it's exposed to your CFML code, it's just "a query", so you can definitely use DISTINCT on it:
<cfquery name="distinctCategories" dbtype="query">
SELECT DISTINCT id
FROM categories
</cfquery>
I am trying to query the WadPerformanceCountersTable generated by Azure Diagnostics which has a PartitionKey based on tick marks accurate up to the minute. This PartitionKey is stored as a string (which I do not have any control over).
I want to be able to query against this table to get data points for every minute, every hour, every day, etc. so I don't have to pull all of the data (I just want a sampling to approximate it). I was hoping to using the modulus operator to do this, but since the PartitionKey is stored as a string and this is an Azure Table, I am having issues.
Is there any way to do this?
Non-working example:
var query =
(from entity in ServiceContext.CreateQuery<PerformanceCountersEntity>("WADPerformanceCountersTable")
where
long.Parse(entity.PartitionKey) % interval == 0 && //bad for a variety of reasons
String.Compare(entity.PartitionKey, partitionKeyEnd, StringComparison.Ordinal) < 0 &&
String.Compare(entity.PartitionKey, partitionKeyStart, StringComparison.Ordinal) > 0
select entity)
.AsTableServiceQuery();
If you just want to get a single row based on two different time interval (now and N time back) you can use the following query which returns the single row as described here:
// 10 minutes span Partition Key
DateTime now = DateTime.UtcNow;
// Current Partition Key
string partitionKeyNow = string.Format("0{0}", now.Ticks.ToString());
DateTime tenMinutesSpan = now.AddMinutes(-10);
string partitionKeyTenMinutesBack = string.Format("0{0}", tenMinutesSpan.Ticks.ToString());
//Get single row sample created last 10 mminutes
CloudTableQuery<WadPerformanceCountersTable> cloudTableQuery =
(
from entity in ServiceContext.CreateQuery<PerformanceCountersEntity>("WADPerformanceCountersTable")
where
entity.PartitionKey.CompareTo(partitionKeyNow) < 0 &&
entity.PartitionKey.CompareTo(partitionKeyTenMinutesBack) > 0
select entity
).Take(1).AsTableServiceQuery();
The only way I can see to do this would be to create a process to keep the Azure table in sync with another version of itself. In this table, I would store the PartitionKey as a number instead of a string. Once done, I could use a method similar to what I wrote in my question to query the data.
However, this is a waste of resources, so I don't recommend it. (I'm not implementing it myself, either.)