Query (with Cosmos DB) on Partition key results in multiple Partition key ranges, How is this possible? [duplicate] - azure

I'm having difficulty understanding the difference between the partition keys & the partition key ranges in Cosmos DB. I understand generally that a partition key in cosmos db is a JSON property/path within each document that is used to evenly distribute data among multiple partitions to avoid any uneven "hot partitions" -- and partition key decides the physical placement of documents.
But its not clear to me what the partition key range is...is this just a range of literal partition keys starting from first to last grouped by each individual partition in the collection? I know the ranges can be found by performing a GET request to the endpoint https://{databaseaccount}.documents.azure.com/dbs/{db-id}/colls/{coll-id}/pkranges but just conceptionally want to be sure I understand. Also still not clear on how to granularly view the specific partition key that a specific document belongs to.
https://learn.microsoft.com/en-us/rest/api/cosmos-db/get-partition-key-ranges

You define property on your documents that you want to use as a partition key.
Cosmos db hashes value of that property for all documents in collection and maps different partition keys to different physical partitions.
Over time, your collection will grow and you might end up having, for example, 100 logical partition distributed over 5 physical partitions.
Partition key ranges are just collections of partition keys grouped by physical partitions they are mapped to.
So, in this example, you would get 5 pkranges with min/max partition key value for each.
Notice that pkranges might change because in future, as your collection grows, physical partitions will get split causing some partition keys to be moved to new physical partition causing part of the previous range to be moved to new location.

Related

How does Cassandra Partitioning actually work?

I understand that two tables with same partition columns and values have same token generated. Does that mean that all the cells of this partition in both tables are actually in the same partition ? How does Cassandra store data internally ?
Eg:
Create table table1 (emp_id int PRIMARY KEY, name text, role text);
Create table table2 (emp_id int PRIMARY KEY, name text, role text);​
​​
​​INSERT INTO table1(emp_id, name, role) VALUES (1, 'sahil', 'MTS');
​​INSERT INTO table2(emp_id, name, role) VALUES (1, 'sahil', 'MTS');
SELECT token(emp_id) from table1 where token(emp_id) = token(11596);
system.token(emp_id)
----------------------
**7447223576279188802**​
SELECT token(emp_id) from table2 where token(emp_id) = token(1);
system.token(emp_id)
----------------------
**7447223576279188802**
​​
For your example, because both tables have the same partition key, then when identical values are inserted, they will be mapped to the same token. It is on insert that the hash function to the PK is applied to determine what replica will get the data. If you use the Murmur3 partitioner (which is used by default) then you get a consistent token value, i.e. using the same PK and PK value, the result is the same. You can reference this page for understanding:
https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/architecture/archDataDistributeHashing.html
Rows (items of data) that have the same table and the same partition key are said to be in the same partition. The most important consequence of being in the same partition is that data in the same partition is guaranteed to be co-located - handled by the same replica nodes and in ScyllaDB, even by the same CPU. This allows efficiently scanning a partition: All the partition's data can be read from a single node and Cassandra doesn't to go back and forth between replicas to read the various pieces of the partition and combine them. This is also what allows a node that handles the partition's full data to maintain it sorted by the clustering key: A process called compaction is merging different pieces of a sorted partition (these are sstables, or sorted string tables) into a bigger sorted partition.
When you have two different tables in the same keyspace, and use the same partition key in both, they are not stored physically on disk together - because each table has its own set of sstables (files on disk), so in that sense they are not "in the same partition". However, the co-location property which I mentioned earlier still holds (if the two tables are in the same keyspace): Two identically-keyed partitions in the two tables will be stored on exactly the same node. Why is this important/useful? Usually it isn't. One place where this knowledge can become useful is that it can be used in some situations to achieve atomic batch write to both tables at once, utilizing the fact that all replicas will see both writes together, whereas usually two writes to two tables go to different nodes at different times.

Cassandra - Composite Partition Keys and Performance

I am working on the keyspace and tables for a Cassandra environment. I understand the size limitations of Cassandra and dealing with Partition keys to keep it optimized. However, I am having a disagreement with a developer regarding how to handle the keys. Is there any downside in having a key that would include a large number of data rather than a small amount of data. For example,
I have 100k records. I can create a key that will partition this into 10k; I could also create a key that will partition this into 10 records (by day). So either I store 10k and 10 partitions or 10 records and 10,000 partitions.
Keep in mind that having more columns in the key requires you to specify those columns in your select statements, which sometimes isn't desired. The more partitions the better - whether by picking a better single column or having multiple columns.
Cassandra reads data via the partition key, and can get help with performance if clustering columns are used. If you have a large partition, the entire partition must be read (memory and disk) and then merged for the output. If you have large partitions, this will definitely slow you down.

cassandra: `sstabledump` output questions

I'm inspecting the output of sstabledump, to gain better understanding of the cassandra data model, and I have some questions
From the output of of sstabledump it seems that
a table is a list of partitions (split by partition key)
a partition is a list of rows (split according to clustering key)
a row is a map of key-value pairs, where the keys belong in a predefined list
Question 1: For each partition, as well as for each row inside a partition, there is a position key. What does this value correspond to? Physical storage details? And how exactly?
Question 2: Each row inside each partition has a type: row key-value pair. Could this type be anything else? If yes, what? If not
why have a value that is always the same?
why is cassandra is classified as wide-column and other similar terms? Looks more like a two-level row storage.
Partition key is the murmur3 hash of whatever you assigned as the primary key. Consistent hashing is used with that hash to determine which node in the cluster that partition belongs to and its replicas. Within each partition data is sorted by clustering key, and then by cell name within the row. The structure is used so redundant things like timestamps if inserted for a row at once is only inserted once as a vint delta sequence from the partitions to save space.
On disk the partitions are sorted in order of this hashed key. The output of the position key is just referring to where in the sstable's data file its located (decompressed byte offset). type can also identify in that spot as a static block, which is located at the beginning of each partition for any static cells or a ranged tombstone marker (beginning or end). Note that values are sometimes for sstabledump repeated in json for readability even if not physically written on disk (ie repeated timestamps).
You can have many of these rows inside a partition, a common datamodel for time series for example is to use timestamp as the clustering key which makes very wide partitions with millions of rows. Pre 3.0 as well the data storage was closer to big table's design. It was essentially a Map<byte[], SortedMap<byte[], Cell>> where the Comparator of the sorted map was changed based on schema. It did not differentiate rows and columns within a partition, and it lead to massive amounts redundant data and was redesigned to fit the query language better.
Some more references:
Explanation of motivation of 3.0 change by DataStax here
Blog post by TLP has a good detailed explanation of the new disk format
CASSANDRA-8099

Is Azure table storage partition key only query is optimized?

Table storage documentation says that the tables are indexed on partition and row keys and there are 4 types of queries sorted based on their performance (Point, Row Scan, Partition Scan, Table Scan). But it is a bit unclear what category does the Partition Key only query falls in. So if my query has one filter "PartitionKey eq SomeKey" is the indexing optimized and this will be as fast as point query (apart from the fact that it will return much more results)? Or the indexing does not allow that and it will be a partition scan or some other type?
It will be a partition scan. If the partition key is determined in filter but row key isn't, all the entities throughout the whole partition have to be scanned.

cassandra primary key design

I am using datastax enterprise 4.5. Is there any disadvantage of defining a composite partition key than only a single column partition key in terms of any performance? What if one column of composite partition has high cardinality but the other coulmn of the composite has low cardinality?
A composite key is used to increase the cardinality of your partitions. For example a key like PRIMARY KEY ((x,y)) with 5 values of x and 10 values of y will end up creating 50 different partitions. This is usefuls if you need to distribute your data more but is unnecessary if you have a single variable with high enough cardinality.
A more realistic example might be creating a composite key of PRIMARY KEY ((Gender, ZipCode), age, userid). If you used only Gender as the Partition key you would end up with only 2 partitions to store your data! Adding zipcode allows for a total of all 99999 zipcodes or (zip+4 to get even more) while still allowing you to segregate your data by gender. This would be ideal for looking demographic information by location or something like that.
Basically the rule of thumb is that you want a large number of partitions to avoid hotspots in your cluster and composite keys allow an easy way of increasing the number of partitions by combining the cardinality of your fields.

Resources