I'm new to AWS. Can I query the dynamo data on the same table where the Partition key(docType) equals to AA or BB or CC etc? I don't see "OR" or "IN" clauses in keyConditionExpression. Is there any way to achieve this? I'm using node.js Thanks.
No.
DDB Query only works for a single partition. Thus the requirement for a single partition key value.
You must provide the name of the partition key attribute and a single
value for that attribute.
Use the KeyConditionExpression parameter to provide a specific value
for the partition key.
Scan is the only read operation that will work across partitions. (optionally at the same time)
Using scan routinely is a very bad practice. As it reads the entire table, beginning to end, every time. (though it remains subject to DDB 1MB read limit)
Related
I have one table in AWS Dynamodb with 1 million records.is it possible to query array of primary key values in one query with additional sort key condition in dynamodb?I am using for my server side logic.
Here is the params
var params = {
TableName: "client_logs",
KeyConditionExpression: "#accToken = :value AND ts between :val1 and
:val2",
ExpressionAttributeNames: {
"#accToken": "acc_token"
},
ExpressionAttributeValues: {
":value": clientAccessToken,
":val1": parseInt(fromDate),
":val2": parseInt(toDate),
":status":confirmStatus
},
FilterExpression:"apiAction = :status"
};
Here acc_token is the primary key and I want to query array of access_token values in one single query.
No, it is not possible. A single query may search only one specific hash key value. (See DynamoDB – Query.)
You can, however, execute multiple queries in parallel, which will have the effect you desire.
Edit (2018-11-21)
Since you said there are 200+ hash keys that you are looking for, here are two possible solutions. These solutions do not require unbounded, parallel calls to DynamoDB, but they will cost you more RCU. They may be faster or slower, depending on the distribution of data in your table.
I don't know the distribution of your data, so I can't say which one is best for you. In all cases, we can't use acc_token as the sort key of the GSI because you can't use the IN operator in a KeyConditionExpression. (See DynamoDB – Condition.)
Solution 1
This strategy is based on Global Secondary Index Write Sharding for Selective Table Queries
Steps:
Add a new attribute to items that you write to your table. This new attribute can be a number or string. Let's call it index_partition.
When you write a new item to your table, give it a random value from 0 to N for index_partition. (Here, N is some arbitrary constant of your choice. 9 is probably an okay value to start with.)
Create a GSI with hash key of index_partition and a sort key of ts. You will need to project apiAction and acc_token to the GSI.
Now, you only need to execute N queries. Use a key condition expression of index_partition = :n AND ts between :val1 and :val2 and a filter expression of apiAction = :status AND acc_token in :acc_token_list
Solution 2
This solution is similar to the last, but instead of using random GSI sharding, we'll use a date based partition for the GSI.
Steps:
Add a new string attribute to items that you write to your table. Let's call it ts_ymd.
When you write a new item to your table, use just the yyyy-mm-dd part of ts to set the value of ts_ymd. (You could use any granularity you like. It depends on your typical query range for ts. If :val1 and :val2 are typically only an hour apart from each other, then a suitable GSI partition key could be yyyy-mm-dd-hh.)
Create a GSI with hash key of ts_ymd and a sort key of ts. You will need to project apiAction and acc_token to the GSI.
Assuming you went with yyyy-mm-dd for your GSI partition key, you only need to execute one query for every day that is within :val1 and :val2. Use a key condition expression of ts_ymd = :ymd AND ts between :val1 and :val2 and a filter expression of apiAction = :status AND acc_token in :acc_token_list
Solution 3
I don't know how many different values of apiAction there are and how those values are distributed, but if there are more than a few, and they have approximately equal distribution, you could partition a GSI based on that value. The more possible values you have for apiAction, the better this solution is for you. The limiting factor here is that you need to have enough values that you won't run into the 10GB partition limit for your GSI.
Steps:
Create a GSI with hash key of apiAction and a sort key of ts. You will need to project acc_token to the GSI.
You only need to execute one query. Use a key condition expression of apiAction = :status AND ts between :val1 and :val2" and a filter expression ofacc_token in :acc_token_list`.
For all of these solutions, you should consider how evenly the GSI partition key will be distributed, and the size of the typical range for ts in your query. You must use a filter expression on acc_token, so you should try to pick a solution that minimizes the total number of items the will match your key condition expression, but at the same time, you need to be aware that you can't have more than 10GB of data for one partition key (for the table or for a GSI). You also need to remember that a GSI can only be queried as an eventually consistent read.
You can efficiently do both query range of partition keys and apply additional condition on sort key with the help of PartiQL SELECT query. Official DDB documentation says:
To ensure that a SELECT statement does not result in a full table
scan, the WHERE clause condition must specify a partition key. Use the
equality or IN operator.
The documentation doesn't mention specifically sort key, but it says that additional filtration on non-key attribute still does NOT cause the full scan. So I am almost sure a condition on sort key with one of supported operators won't cause a table scan, executes fast and consumes as few capacity units as possible.
So your query may look like this:
SELECT * FROM client_logs WHERE acc_token IN (t1, t2, ...) AND ts BETWEEN t1 AND t2
Node.js examples of PartiQL API usage can be found here.
I created one collection with partition key as "/countryId", in that when i read multiple documents using SQL query i specified FeedOptions { EnableCrossPartitionQuery = true } and query like follows
select * from collection c where (c.countryId=1 or c.countryId=2 or c.countryId=3)
I would like to know how internally execute, i mean
I am specified countryId(partitionKey) in where condition,will it go to that particular partitions only for getting documents?
or
It will go to all partition of collection and check on each document countryId(partitionkey)
Thanks in Advance !!.
The DocumentDB query will execute against only the partitions that match the filter, not all partitions:
The DocumentDB SDK/gateway will retrieve the partition key metadata for the collection and know that the partition key is countryId, as well as the physical partitions, and what ranges for partition key hashes map to which physical partitions.
During query execution, the SDK/gateway will parse the SQL query and detect that there are filters against the partition key. It will hash the values, find the matching partitions based on its owning partition key ranges. For example, countries 1, 2, and 3 may be all in one physical partition, or three different partions.
The query will be executed in series or parallel by the SDK/gateway based on the configured degree of parallelism. If any post-processing like ORDER BY or aggregation is required, then it will be performed by the SDK/gateway.
So, I have a Cassandra CQL statement that looks like this:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID = ? AND DATA_SCHEMA = ?
This table is sorted by a timestamp column.
The functionality is fronted by a REST API, and one of the filter parameters that they can specify to get the most recent row, and then I appent "LIMIT 1" to the end of the CQL statement since it's ordered by the timestamp column in descending order. What I would like to do is allow them to specify multiple device id's to get back the latest entries for. So, my question is, is there any way to do something like this in Cassandra:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID IN ? AND DATA_SCHEMA = ?
and still use something like "LIMIT 1" to only get back the latest row for each device id? Or, will I simply have to execute a separate CQL statement for each device to get the latest row for each of them?
FWIW, the table's composite key looks like this:
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema), activity_timestamp)
) WITH CLUSTERING ORDER BY (activity_timestamp DESC);
IN is not recommended when there are a lot of parameters for it and under the hood it's making reqs to multiple partitions anyway and it's putting pressure on the coordinator node.
Not that you can't do it. It is perfectly legal, but most of the time it's not performant and is not suggested. If you specify limit, it's for the whole statement, basically you can't pick just the first item out from partitions. The simplest option would be to issue multiple queries to the cluster (every element in IN would become one query) and put a limit 1 to every one of them.
To be honest this was my solution in a lot of the projects and it works pretty much fine. Basically coordinator would under the hood go to multiple nodes anyway but would also have to work more for you to get you all the requests, might run into timeouts etc.
In short it's far better for the cluster and more performant if client asks multiple times (using multiple coordinators with smaller requests) than to make single coordinator do to all the work.
This is all in case you can't afford more disk space for your cluster
Usual Cassandra solution
Data in cassandra is suggested to be ready for query (query first). So basically you would have to have one additional table that would have the same partitioning key as you have it now, and you would have to drop the clustering column activity_timestamp. i.e.
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema))
double (()) is intentional.
Every time you would write to your table you would also write data to the latest_entry (table without activity_timestamp) Then you can specify the query that you need with in and this table contains the latest entry so you don't have to use the limit 1 because there is only one entry per partitioning key ... that would be the usual solution in cassandra.
If you are afraid of the additional writes, don't worry , they are inexpensive and cpu bound. With cassandra it's always "bring on the writes" I guess :)
Basically it's up to you:
multiple queries - a bit of refactoring, no additional space cost
new schema - additional inserts when writing, additional space cost
Your table definition is not suitable for such use of the IN clause. Indeed, it is supported on the last field of the primary key or the last field of the clustering key. So you can:
swap your two last fields of the primary key
use one query for each device id
I have an array of Job ID's.
[ '01', '02', '03', '04' ]
Currently I am looping through the array and executing a Query for each item in the array to get the job details.
Is there a way to use a single query to get all the jobs whose Partition Key is in the array ?
You can use Batch Get Item API to get multiple items from DynamoDB table based on key attributes (I.e. partition key and sort key if available).
There are a few options, each with some pros/cons:
BatchGetItem as #notionquest pointed out. This can fetch up to 100 items or 16MB of data in a single call, but you need to provide all of the key values (both partition and sort for each item, if your table uses a composite key schema).
TransactGetItems - this can retrieve up to 25 items in a single call and as the name implies, is transactional, so the entire operation will fail if there is another pending operation on any of the items being queries. This can be good or bad depending on your use case. Similar to BatchGetItem, you need to provide all key attributes.
Query, which you are already using. Query has high performance but only supports 1 key per request (partition key required, sort key optional). Will return up to 1MB of data at a time, and supports paginated results.
If your jobs table has a composite key (partition + sort), Query is the best option in terms of performance and no constraint on specifying the sort key values. If the table only has a partition key, then BatchGetItems is probably the best bet (assuming the each job item is relatively small in size, and you expect less than 100 total jobs to be returned). If either of those assumptions is incorrect, multiple Querys would be the best option.
You can use partiQL for this use case:
SELECT *
FROM <TABLE_NAME>
WHERE "Id" IN [ARRAY]
But do note that partiQL has length constraints: Minimum length of 1. Maximum length of 8192.
let statement = {
"Statement": "SELECT * \nFROM <TABLE_NAME> \nWHERE \"Id\" IN [ARRAY]"
}
let result = await dynamoDbClient.executeStatement(statement).promise();
My data set will only ever be directly queried (meaning I am looking up a specific item by some identifier) or will be queried in full (meaning return every item in the table). Given that, is there any reason to not use a unique partition key?
From what I have read (e.g.: https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#choosing-an-appropriate-partitionkey) the advantage of a non-unique partition key is being able to do transactional updates. I don't need transactional updates in this data set so is there any reason to partition by anything other than some unique thing (e.g., GUID)?
Assuming I go with a unique partition key per item, this means that each partition will have one row in it. Should I repeat the partition key in the row key or should I just have an empty string for a row key? Is a null row key allowed?
Zhaoxing's answer is essentially correct but I want to expand on it so you can understand a bit more why.
A table partition is defined as the table name plus the partition key. A single server can have many partitions, but a partition can only ever be on one server.
This fundamental design means that access to entities stored in a single partition cannot be load-balanced because partitions support atomic batch transactions. For this reason, the scalability target for an individual table partition is lower than for the table service as a whole. Spreading entities across many partitions allows Azure storage to scale your load much better.
Point queries are optimal which is great because it sounds like that's what you will be doing a lot of. If partition key has no logical meaning (ie, you won't want all the entities in a particular partition) you're best splitting out to many partition keys. Listing all entities in a table will always be slower because it's a scan. Azure storage will return continuation tokens if we hit timeout, 1000 entities, or a server boundary (as discussed above). Many of the storage client libraries have convenience methods which should help you handle this by automatically following these tokens as you iterate through the list.
TL;DR: With the information you've given I'd recommend a unique partition key per item. Null row keys are not allowed, but however else you'd like to construct the row key is fine.
Reading:
Azure Storage Table Design Guide
Azure Storage Performance Check List
If you don't need EntityGroupTransaction to update entities in batch, unique partition keys are good option to you.
Table service auto-scale feature may not work perfectly I think. When some of data in a partition are 'hot', table service will move them to another cluster to enhance performance. But since you have unique partition key, probably non of your entity will be determined as 'hot', while if you grouped them in partitions some partition will be 'hot' and moved. This problem below may also be there if you are using static partition key.
Besides, table service may returns partial entities of your query when
More than 1000 entities in result.
Partition boundary is crossed.
From your request you also need full query (return all entities). If your are using unique partition key this mean each entity is a unique partition, so your query will only return 1 entity with a continue token. And you need to fire another query with this continue token to retrieve the next entity. I don't think this is what you want.
So my suggestion is, select a reasonable partition key in any cases, even though it looks useless in your business, because it helps table service to optimize your data.