ARRAY_CONTAINS vs JOIN in azure-cosmosDB - azure

The JSON documents that we plan to ingest into DocumentDb look as follows…
[
{"id":"id1","LastName": “user1”, "GroupMembership":["g1","g2"]},
{"id":"id2","LastName": “user2”, "GroupMembership":["g1","g4","g5"]},
{"id":"id3","LastName": “user3”, "GroupMembership":["g3","g4","g2"]},
…
]
We want to answer queries such as, get me count of all users who are members of group “g1” or “g2” etc…. The number of users is very large (few millions)…
What is the best way to implement this query and use the index and avoid any scans…
Should I be using ARRAY_CONTAINS or JOIN (does ARRAY_CONTAINS internally use the index or is it doing a scan)…
Option1)
SELECT VALUE COUNT(1) FROM Users WHERE ARRAY_CONTAINS(Users.GroupMembership, "g1") or ARRAY_CONTAINS(Users.GroupMembership, "g2")
Option2)
SELECT VALUE COUNT(1) FROM Users JOIN Membership in Users.GroupMembership WHERE Membership = "g1" or Membership = "g2"

Both queries should utilize the index the same way, but ARRAY_CONTAINS is likely to provide a better execution time compared to JOIN. You could profile both queries using the Query Metrics as per this article: https://learn.microsoft.com/en-us/azure/cosmos-db/documentdb-sql-query-metrics#query-execution-metrics

Both shall provide same index utilization, however with the JOIN usage you can get duplicating results per entry and with the ARRAY_CONTAINS you won't. I guess that difference is very significant. See more about duplicating issue in the replies to Getting duplicate records in select query for the Azure DocumentDB and Cosmos db joins give duplicate results SO question.

Related

How to count number of record filtered by selector

I'm making smart contract with Go and I want to use Rich Query to get total count of records from CouchDB filtered by some selector like:
{\"selector\":{\"doc_type\": \"person\"}}
It is similar to:
select count(*) from tb where ...
as SQL query but how to do it with CouchDB?
If you are going to perform a rich query in chaincode then all you can do is iterate over the results and count each one. Also note that hyperledger fabric bounds the total number records you can query (it's a configuration parameter) so that would also be another consideration.
I would recommend reading this section https://hyperledger-fabric.readthedocs.io/en/release-1.4/couchdb_as_state_database.html#good-practices-for-queries
as it sounds like what you are trying to perform is likely to be not very performant

GET vs Query on Partition Key and Item Key in Cosmos DB

I was reading the Cosmos DB docs on best practices for query performance, and I found the following ambiguous:
With Azure Cosmos DB, typically queries perform in the following order
from fastest/most efficient to slower/less efficient.
GET on a single partition key and item key
Query with a filter clause on a single partition key
Query without an equality or range filter clause on any property
Query without filters
Is there a difference in performance or RUs between a "GET on a single partition key and item key" and a "QUERY on a single partition key and item key". It's not entirely clear to me whether this falls into case #1 or #2 or is somewhere in between.
Basically, I'm asking whether we ever need to use GET at all. The docs don't seem to clarify this anywhere.
A direct GET will be faster. As documented, a 1K document should cost 1 RU to retrieve. You will have a higher RU cost for a query, as you're engaging the query engine.
One caveat: with a direct read (the GET), you will retrieve the entire document. With a query, you can choose the projection of properties. For very large documents, this could result in significant bandwidth savings for your app, when using a query.

Order of results in Cassandra

I have two questions about query results in Cassandra.
When I make a "full" select of a table in Cassandra (ie. select * from table) is it guaranteed that the results will be returned in increasing order of partition tokens?
For instance, having the following table:
create table users(id int, name text, primary key(id));
Is it guaranteed that the following query will return the results with increasing values in the token column?
select token(id), id from users;
If so, is it also guaranteed if the data is distributed to multiple nodes in the cluster?
If the anwer to the above question is 'yes', is it still valid if we use secondary index? For instance, if we would have the following index:
create index on users(name);
and we query the table by using the index:
select token(id), id from users where name = 'xyz';
is there any guarantee regarding the order of results?
The motivation for the above questions is if the token is the right thing to use in order in implement paging and/or resuming of broken longer "data exports".
EDIT: There are multiple resources on the net that state that the order matches the token order (eg. in description of partitioner results or this Datastax page):
Without a partition key specified in the WHERE clause, the actual order of the result set then becomes dependent on the hashed values of userid.
However the order of results is not specified in official Cassandra documentation, eg. of SELECT statement.
Is it guaranteed that the following query will return the results with increasing values in the token column?
Yes it is
If so, is it also guaranteed if the data is distributed to multiple nodes in the cluster?
The data distribution is orthogonal to the ordering of the retrieved data, no relationship
If the anwer to the above question is 'yes', is it still valid if we use secondary index?
Yes, even if you query data using a secondary index (be it SASI or the native implementation), the returned results will always be sorted by token order. Why ? The technical explanation is given in my blog post here: http://www.doanduyhai.com/blog/?p=13191#cluster_read_path
That's the main reason that explain why SASI is not a good fit if you want the search to return data ordered by some column values. Only a real search engine integration (like Datastax Enterprise Search) can yield you the correct ordering because it bypasses the cluster read path layer.

Query Multiple Partition Keys At Same Time DynamoDB - Node

I have an array of Job ID's.
[ '01', '02', '03', '04' ]
Currently I am looping through the array and executing a Query for each item in the array to get the job details.
Is there a way to use a single query to get all the jobs whose Partition Key is in the array ?
You can use Batch Get Item API to get multiple items from DynamoDB table based on key attributes (I.e. partition key and sort key if available).
There are a few options, each with some pros/cons:
BatchGetItem as #notionquest pointed out. This can fetch up to 100 items or 16MB of data in a single call, but you need to provide all of the key values (both partition and sort for each item, if your table uses a composite key schema).
TransactGetItems - this can retrieve up to 25 items in a single call and as the name implies, is transactional, so the entire operation will fail if there is another pending operation on any of the items being queries. This can be good or bad depending on your use case. Similar to BatchGetItem, you need to provide all key attributes.
Query, which you are already using. Query has high performance but only supports 1 key per request (partition key required, sort key optional). Will return up to 1MB of data at a time, and supports paginated results.
If your jobs table has a composite key (partition + sort), Query is the best option in terms of performance and no constraint on specifying the sort key values. If the table only has a partition key, then BatchGetItems is probably the best bet (assuming the each job item is relatively small in size, and you expect less than 100 total jobs to be returned). If either of those assumptions is incorrect, multiple Querys would be the best option.
You can use partiQL for this use case:
SELECT *
FROM <TABLE_NAME>
WHERE "Id" IN [ARRAY]
But do note that partiQL has length constraints: Minimum length of 1. Maximum length of 8192.
let statement = {
"Statement": "SELECT * \nFROM <TABLE_NAME> \nWHERE \"Id\" IN [ARRAY]"
}
let result = await dynamoDbClient.executeStatement(statement).promise();

how to perform "not in" filter in cql3 query select?

I need to fetch rows without specific keys.
for sample:
select * from users where user_id not in ("mikko");
I have tried with "not in" and this is the response:
Bad Request: line 1:35 no viable alternative at input 'not'
"not in" is not a supported operation in CQL. Cassandra at its heart is still based on key indexed rows. So that query is basically the same as "select * from users", as you have to go through every row and figure out if it does not match the in. If you want to do that type of query you will want to setup a map reduce job to perform it.
When using Cassandra what you actually want to do is de-normalize your data model so that the queries you application performs end up querying a single partition (or just a few partitions) for their results.
Also find some great webinars and talks on Cassandra data modeling
http://www.youtube.com/watch?v=T_WRC_GjRd0&feature=youtu.be
http://youtu.be/x4Q9JeLIyNo
http://www.youtube.com/watch?v=HdJlsOZVGwM&list=PLqcm6qE9lgKJzVvwHprow9h7KMpb5hcUU&index=10

Resources