CosmosDB - does distinct scan indexe for the field or scan all records? - azure

When there is a table with records of this format :
{
"productId" : "123"
"productCategory" : "fmcg"
}
There can be millions of such product records spread across a few hundred productCategory.
If I do a select distinct c.productCategory from c on this table, does it just do a scan on the index it has created for productCategory (is there such index??), or does it go through the millions of records to figure out the distinct productCategory?

At this moment Cosmos won't leverage your index to perform the DISTINCT in your query. It goes through all documents that match your filter and therefore will increase in RU linearly the more documents it'll have to scan.
It's however something that is being worked on according to the Azure Share Your Ideas board.

It looks like Microsoft has optimized the way Distinct, Group By, OFFSET LIMIT and JOIN operators work by leveraging the index better.
This feature was out around June 21.
More details:-
https://feedback.azure.com/d365community/idea/8d3cad4c-0e25-ec11-b6e6-000d3a4f0858
https://devblogs.microsoft.com/cosmosdb/introducing-a-new-system-function-and-optimized-query-operators/

Related

How to find Duplicate documents in Cosmos DB

I have seen like a huge amount of data write to cosmos DB from stream analytics job on a particular day.
It was not supposed to write huge amount of documents in a day. I have to check if there is duplication of documents on that particular day.
Is there any query/any way to find out duplicate records in cosmos DB?
It is possible if you know the properties to check for duplicates.
We had a nasty production issue causing many duplicate records as well.
Upon contacting MS Support to help us identify the duplicate documents, they gave us the following query;
Bear in mind: property A and B together define the uniqueness in our case. So if two documents have the same value for A and B, they are duplicate.
You can then use the output of this query to, for example, delete the oldest ones but keep the recent (based on _ts)
SELECT d.A, d.B From
(SELECT c.A, c.B, count(c._ts) as counts FROM c
GROUP BY c.Discriminator, c.EndDateTime) AS d
WHERE d.counts > 1
Is there any query/any way to find out duplicate records in cosmos DB?
Quick answer is YES.Please use distinct keyword in the cosmos db query sql.And filter the _ts(System generated unix timestamp:https://learn.microsoft.com/en-us/azure/cosmos-db/databases-containers-items#properties-of-an-item)
Something like:
Select distinct c.X,c.Y,C.Z....(all columns you want to check) from c where c._ts = particular day
Then you could delete the duplicate data using this bulk delete lib:https://github.com/Azure/azure-cosmosdb-bulkexecutor-dotnet-getting-started/tree/master/BulkDeleteSample.

Cassandra - get all data for a certain time range

Is it possible to query a Cassandra database to get records for a certain range?
I have a table definition like this
CREATE TABLE domain(
domain_name text,
status int,
last_scanned_date long
PRIMARY KEY(text,last_scanned_date)
)
My requirement is to get all the domains which are not scanned in the last 24 hours. I wrote the following query, but this query is not efficient as Cassandra is trying to fetch entire dataset because of ALLOW FILTERING
SELECT * FROM domain where last_scanned_date<=<last24hourstimeinmillis> ALLOW FILTERING;
Then I decided to do it in two queries
1st query:
SELECT DISTINCT name from domain;
2nd query:
Use IN operator to query domains which are not scanned i nlast 24 hours
SELECT * FROM domain where
domain_name IN('domain1','domain2')
AND
last_scanned_date<=<last24hourstimeinmillis>
My second approach works, but comes with an extra overhead of querying first for distinct values.
Is there any better approach than this?
You should update your structure table definition. Currently, you are selecting domain name as your partition key while you can not have more than 2 billion records in single Cassandra partition.
I would suggest you should use your time as part of your partition key. If you are not going to receive more than 2 billion requests per day. Try to use day since epoch as the partition key. You can do composite partition keys but they won't be helpful for your query.
While querying you have to scan at max two partitions with an additional filter in a query or in your application filtering out results which do not belong to a
the range you have specified.
Go over following concepts before finalizing your design.
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCompositePartitionKeyConcept.html
https://docs.datastax.com/en/dse-planning/doc/planning/planningPartitionSize.html
Cassandra can effectively perform range queries only inside one partition. The same is for use of the aggregations, such as DISTINCT. So in your case you'll need to have only one partition that will contain all data. But that's is bad design.
You may try to split this big partition into smaller ones, by using TLDs as separate partition keys, and perform fetching in parallel from every partition - but this also will lead to imbalance, as some TLDs will have more sites than other.
Another issue with your schema is that you have last_scanned_date as clustering column, and this means that when you update last_scanned_date, you're effectively insert a new row into database - you'll need to explicitly remove row for previous last_scanned_date, otherwise the query last_scanned_date<=<last24hourstimeinmillis> will always fetch old rows that you already scanned.
Partially your problem with your current design could be solved by using the Spark that is able to perform effective scanning of full table via token range scan + range scan for every individual row - this will return only data in given time range. Or if you don't want to use Spark, you can perform token range scan in your code, something like this.

check in one query if multiple records exist in cassandra

I have a list of Strings "A", "B", "C".
I would like to know how can I check if all these Strings exist in a Cassandra column.
I have two approaches I have previously used for relational databases but I recently moved to Cassandra and I don't know how to achieve this.
The problem is I have about 100 string that I have to check and I don't want to send 100 requests to my database. It wouldn't be wise.
Interesting question... I don't know the schema you're using, but if your strings are in the only PK column (or in a composite PK where the other columns values are known at query time) then you could probably issue 100 queries without worries. The key cache will help not to hit disks, so your could get fast responses.
Instead, if you intend to use this for a column that is not part of any PK, you'll have hard time to figure this out unless you perform some kind of tricks, and this is all subject to some performance restrictions and/or increased code complexity anyway.
As an example, you could build a "frequency" table with the purpose described above, where you store how many times you "saw" each string "A", "B" etc..., and query this table when you need to retrieve the information:
SELECT frequencies FROM freq_table WHERE pk = IN ('A', 'B', 'C');
Then you still need to loop over the result set and check that each record is > 0. An alternative could be to issue a SELECT COUNT(*) before the real query, because you know in advance how many records you should get (eg 3 in my example), but having the correct number of retrieved records could be enough (eg one counter is zero).
Of course you'd need to maintain this table on every insert/update/delete of your main table, raising the complexity of the solution, and of course all the IN clause and COUNT related warning applies...
I would probably stick with 100 queries: with a well designed table they should not be a problem, unless you have an inadequate cluster for the problem size you're dealing with.
CQL gives you the possibility to use IN clause like:
SELECT first_name, last_name FROM emp WHERE empID IN (105, 107, 104);
More information here.
But this approach might not be the best since it can trigger select's across all nodes from the cluster.
So depends very much on how your data is structured.
From this perspective, it might be better to run 100 separate queries.

Wide rows vs Collections in Cassandra

I am trying to model many-to-many relationships in Cassandra something like Item-User relationship. User can like many items and item can be bought by many users. Let us also assume that the order in which the "like" event occurs is not a concern and that the most used query is simply returning the "likes" based on item as well as the user.
There are a couple of posts dicussing data modeling
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
An alternative would be to store a collection of ItemID in the User table to denote the items liked by that user and do something similar in the Items table in CQL3.
Questions
Are there any hits in performance using the collection? I think they translate to composite columns? So the read pattern, caching and other factors should be similar?
Are collections less performant for write heavy applications? Is updating the collection frequently less performant?
There are a couple of advantages of using wide rows over collections that I can think of:
The number of elements allowed in a collection is 65535 (an unsigned short). If it's possible to have more than that many records in your collection, using wide rows is probably better as that limitation is much higher (2 billion cells (rows * columns) per partition).
When reading a collection column, the entire collection is read every time. Compare this to wide row where you can limit the number of rows being read in your query, or limit the criteria of your query based on clustering key (i.e. date > 2015-07-01).
For your particular use case I think modeling an 'items_by_user' table would be more ideal than a list<item> column on a 'users' table.

How to store and query spatial data on Azure Table Service

I am trying to create a location based app where people can query the records created within 5 miles of their location.
When the record is created, I will store the Latitude and Longitude in the Azure Table Service.
Once I have this data, how do I fetch all the records within 5 miles from my current location?
Thank you.
For Azure Table Storage queries to be optimized, they'll need to run on the Partition Key and the Row Key. A solution could be to store the latitude in the Partition Key and the longitude in the RowKey. The Partition Key and Row Key combinations need to be unique (think primary key in SQL). I would use this strategy and if you have multiple entries for the same latitude and longitude, then you could use ATS' dynamic properties or InsertOrMerge to store them in the same row. That way you could query like this:
IQueryable<Entries> query =
(from q in _table.CreateQuery<Entries>()
where q.PartitionKey.CompareTo(minLatitude) > 0
&& q.PartitionKey.CompareTo(maxLatitude) < 0
&& q.RowKey.CompareTo(minLongitude) > 0
&& q.RowKey.CompareTo(maxLongitude) < 0
select q);
You could also get clever with the PartitionKey and use it to store a range of latitudes or regions in order to limit the # of partitions needed. SQL Azure also supports geospatial queries
Just doing work in this area and found that Azure Search will provide geospatial searches over Azure Table Storage. One setback is that the smallest scale beyond the developer sandbox is $200 per month - well worth the money for a commercial venture but rather high for small operations.
In order to make this work I needed to duplicate the Latatitude and Longitude fields into the GeoJson format, ie.
{"type": "Point", "coordinates": [102.0, 0.5]}
The free developer search option will allow one datasource based on partition key. For the purposes of my testing I have a table with everything in the same partition and unique RowKeys. I indexed the RowKey and the GeoJson value and found it works very well to search for all records within a radius of a given point.
While this is great, I think there or other storage solutions that will work better. DocumentDb and SqlAzure both support geospatial queries and, given the combination of the cost of storage and search, the cost of these alternatives is attractive.

Resources