I created one collection with partition key as "/countryId", in that when i read multiple documents using SQL query i specified FeedOptions { EnableCrossPartitionQuery = true } and query like follows
select * from collection c where (c.countryId=1 or c.countryId=2 or c.countryId=3)
I would like to know how internally execute, i mean
I am specified countryId(partitionKey) in where condition,will it go to that particular partitions only for getting documents?
or
It will go to all partition of collection and check on each document countryId(partitionkey)
Thanks in Advance !!.
The DocumentDB query will execute against only the partitions that match the filter, not all partitions:
The DocumentDB SDK/gateway will retrieve the partition key metadata for the collection and know that the partition key is countryId, as well as the physical partitions, and what ranges for partition key hashes map to which physical partitions.
During query execution, the SDK/gateway will parse the SQL query and detect that there are filters against the partition key. It will hash the values, find the matching partitions based on its owning partition key ranges. For example, countries 1, 2, and 3 may be all in one physical partition, or three different partions.
The query will be executed in series or parallel by the SDK/gateway based on the configured degree of parallelism. If any post-processing like ORDER BY or aggregation is required, then it will be performed by the SDK/gateway.
Related
I'm new to AWS. Can I query the dynamo data on the same table where the Partition key(docType) equals to AA or BB or CC etc? I don't see "OR" or "IN" clauses in keyConditionExpression. Is there any way to achieve this? I'm using node.js Thanks.
No.
DDB Query only works for a single partition. Thus the requirement for a single partition key value.
You must provide the name of the partition key attribute and a single
value for that attribute.
Use the KeyConditionExpression parameter to provide a specific value
for the partition key.
Scan is the only read operation that will work across partitions. (optionally at the same time)
Using scan routinely is a very bad practice. As it reads the entire table, beginning to end, every time. (though it remains subject to DDB 1MB read limit)
If I store documents without providing partition key, In this case documentId will be treated as Partition Key of Logical Partition?
If yes: how about Billion logical partitions in that collection? I have query to only look by documentId.
Now inside Document JSON:
I have multiple fields & I have provided /asset as the partitionKey. Is this a composite partition key now: /asset/documentId?
or /asset will tel partition to search for documentId from?
If I store documents without providing partition key, In this case
documentId will be treated as Partition Key of Logical Partition?
No. If you create a document without Partition Key, document id will not be treated as the partition key. Cosmos DB engine will put all the documents without a partition key value in a hidden logical partition. This particular partition can be accessed by specifying partition key as {}.
You define the partition key when you create the collection (according to the screenshot asset is a partition key in your case). If you dont provide a partition key when you create a collection - it will be limited to 10 GB of data (because it wouldn't be able to shard it without partition key).
Only partition key is used to determine the partition of the document. other fields are irrelevant when deciding which partition this document belongs to.
I am trying to store the following structure in cassandra.
ShopID, UserID , FirstName , LastName etc....
The most of the queries on it are
select * from table where ShopID = ? , UserID = ?
That's why it is useful to set (ShopID, UserID) as the primary key.
According to docu the default partitioning key by Cassandra is the first column of primary key - for my case it's ShopID, but I want to distribute the data uniformly on Cassandra cluster, I can not allow that all data from one shopID are stored only in one partition, because some of shops have 10M records and some only 1k.
I can setup (ShopID, UserID) as partitioning keys then I can reach the uniform distribution of records in the Cassandra cluster . But after that I can not receive all users that belong to some shopid.
select *
from table
where ShopID = ?
Its obvious that this query demand full scan on the whole cluster but I have no any possibility to do it. And it looks like very hard constraint.
My question is how to reorganize the data to solve both problem (uniform data partitioning, possibility to make full scan queries) in the same time.
In general you need to make user id a clustering column and add some artificial information to your table and partition key during saving. It allows to break a large natural partition to multiple synthetic. But now you need to query all synthetic partitions during reading to combine back natural partition. So the goal is find a reasonable trade-off between number(size) of synthetic partitions and read queries to combine all of them.
Comprehensive description of possible implementations can be found here and here
(Example 2: User Groups).
Also take a look at solution (Example 3: User Groups by Join Date) when querying/ordering/grouping is performed by clustering column of date type. It can be useful if you also have similar queries.
Each node in Cassandra is responsible for some token ranges. Cassandra derives a token from row's partition key using hashing and sends the record to node whose token range includes this token. Different records can have the same token and they are grouped in partitions. For simplicity we can assume that each cassandra nodes stores the same number of partitions. And we also want that partitions will be equal in size for uniformly distribution between nodes. If we have a too huge partition that means that one of our nodes needs more resources to process it. But if we break it in multiple smaller we increase the chance that they will be evenly distirbuted between all nodes.
However distribution of token ranges between nodes doesn't related with distribution of records between partitions. When we add a new node it just assumes responsibility for even portion of token ranges from other nodes and as the result the even number of partitions. If we had 2 nodes with 3 GB of data, after adding a third node each node stores 2 GB of data. That's why scalability isn't affected by partitioning and you don't need to change your historical data after adding a new node.
I have an array of Job ID's.
[ '01', '02', '03', '04' ]
Currently I am looping through the array and executing a Query for each item in the array to get the job details.
Is there a way to use a single query to get all the jobs whose Partition Key is in the array ?
You can use Batch Get Item API to get multiple items from DynamoDB table based on key attributes (I.e. partition key and sort key if available).
There are a few options, each with some pros/cons:
BatchGetItem as #notionquest pointed out. This can fetch up to 100 items or 16MB of data in a single call, but you need to provide all of the key values (both partition and sort for each item, if your table uses a composite key schema).
TransactGetItems - this can retrieve up to 25 items in a single call and as the name implies, is transactional, so the entire operation will fail if there is another pending operation on any of the items being queries. This can be good or bad depending on your use case. Similar to BatchGetItem, you need to provide all key attributes.
Query, which you are already using. Query has high performance but only supports 1 key per request (partition key required, sort key optional). Will return up to 1MB of data at a time, and supports paginated results.
If your jobs table has a composite key (partition + sort), Query is the best option in terms of performance and no constraint on specifying the sort key values. If the table only has a partition key, then BatchGetItems is probably the best bet (assuming the each job item is relatively small in size, and you expect less than 100 total jobs to be returned). If either of those assumptions is incorrect, multiple Querys would be the best option.
You can use partiQL for this use case:
SELECT *
FROM <TABLE_NAME>
WHERE "Id" IN [ARRAY]
But do note that partiQL has length constraints: Minimum length of 1. Maximum length of 8192.
let statement = {
"Statement": "SELECT * \nFROM <TABLE_NAME> \nWHERE \"Id\" IN [ARRAY]"
}
let result = await dynamoDbClient.executeStatement(statement).promise();
My data set will only ever be directly queried (meaning I am looking up a specific item by some identifier) or will be queried in full (meaning return every item in the table). Given that, is there any reason to not use a unique partition key?
From what I have read (e.g.: https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#choosing-an-appropriate-partitionkey) the advantage of a non-unique partition key is being able to do transactional updates. I don't need transactional updates in this data set so is there any reason to partition by anything other than some unique thing (e.g., GUID)?
Assuming I go with a unique partition key per item, this means that each partition will have one row in it. Should I repeat the partition key in the row key or should I just have an empty string for a row key? Is a null row key allowed?
Zhaoxing's answer is essentially correct but I want to expand on it so you can understand a bit more why.
A table partition is defined as the table name plus the partition key. A single server can have many partitions, but a partition can only ever be on one server.
This fundamental design means that access to entities stored in a single partition cannot be load-balanced because partitions support atomic batch transactions. For this reason, the scalability target for an individual table partition is lower than for the table service as a whole. Spreading entities across many partitions allows Azure storage to scale your load much better.
Point queries are optimal which is great because it sounds like that's what you will be doing a lot of. If partition key has no logical meaning (ie, you won't want all the entities in a particular partition) you're best splitting out to many partition keys. Listing all entities in a table will always be slower because it's a scan. Azure storage will return continuation tokens if we hit timeout, 1000 entities, or a server boundary (as discussed above). Many of the storage client libraries have convenience methods which should help you handle this by automatically following these tokens as you iterate through the list.
TL;DR: With the information you've given I'd recommend a unique partition key per item. Null row keys are not allowed, but however else you'd like to construct the row key is fine.
Reading:
Azure Storage Table Design Guide
Azure Storage Performance Check List
If you don't need EntityGroupTransaction to update entities in batch, unique partition keys are good option to you.
Table service auto-scale feature may not work perfectly I think. When some of data in a partition are 'hot', table service will move them to another cluster to enhance performance. But since you have unique partition key, probably non of your entity will be determined as 'hot', while if you grouped them in partitions some partition will be 'hot' and moved. This problem below may also be there if you are using static partition key.
Besides, table service may returns partial entities of your query when
More than 1000 entities in result.
Partition boundary is crossed.
From your request you also need full query (return all entities). If your are using unique partition key this mean each entity is a unique partition, so your query will only return 1 entity with a continue token. And you need to fire another query with this continue token to retrieve the next entity. I don't think this is what you want.
So my suggestion is, select a reasonable partition key in any cases, even though it looks useless in your business, because it helps table service to optimize your data.