I have an array of Job ID's.
[ '01', '02', '03', '04' ]
Currently I am looping through the array and executing a Query for each item in the array to get the job details.
Is there a way to use a single query to get all the jobs whose Partition Key is in the array ?
You can use Batch Get Item API to get multiple items from DynamoDB table based on key attributes (I.e. partition key and sort key if available).
There are a few options, each with some pros/cons:
BatchGetItem as #notionquest pointed out. This can fetch up to 100 items or 16MB of data in a single call, but you need to provide all of the key values (both partition and sort for each item, if your table uses a composite key schema).
TransactGetItems - this can retrieve up to 25 items in a single call and as the name implies, is transactional, so the entire operation will fail if there is another pending operation on any of the items being queries. This can be good or bad depending on your use case. Similar to BatchGetItem, you need to provide all key attributes.
Query, which you are already using. Query has high performance but only supports 1 key per request (partition key required, sort key optional). Will return up to 1MB of data at a time, and supports paginated results.
If your jobs table has a composite key (partition + sort), Query is the best option in terms of performance and no constraint on specifying the sort key values. If the table only has a partition key, then BatchGetItems is probably the best bet (assuming the each job item is relatively small in size, and you expect less than 100 total jobs to be returned). If either of those assumptions is incorrect, multiple Querys would be the best option.
You can use partiQL for this use case:
SELECT *
FROM <TABLE_NAME>
WHERE "Id" IN [ARRAY]
But do note that partiQL has length constraints: Minimum length of 1. Maximum length of 8192.
let statement = {
"Statement": "SELECT * \nFROM <TABLE_NAME> \nWHERE \"Id\" IN [ARRAY]"
}
let result = await dynamoDbClient.executeStatement(statement).promise();
Related
I'm new to AWS. Can I query the dynamo data on the same table where the Partition key(docType) equals to AA or BB or CC etc? I don't see "OR" or "IN" clauses in keyConditionExpression. Is there any way to achieve this? I'm using node.js Thanks.
No.
DDB Query only works for a single partition. Thus the requirement for a single partition key value.
You must provide the name of the partition key attribute and a single
value for that attribute.
Use the KeyConditionExpression parameter to provide a specific value
for the partition key.
Scan is the only read operation that will work across partitions. (optionally at the same time)
Using scan routinely is a very bad practice. As it reads the entire table, beginning to end, every time. (though it remains subject to DDB 1MB read limit)
I have one table in AWS Dynamodb with 1 million records.is it possible to query array of primary key values in one query with additional sort key condition in dynamodb?I am using for my server side logic.
Here is the params
var params = {
TableName: "client_logs",
KeyConditionExpression: "#accToken = :value AND ts between :val1 and
:val2",
ExpressionAttributeNames: {
"#accToken": "acc_token"
},
ExpressionAttributeValues: {
":value": clientAccessToken,
":val1": parseInt(fromDate),
":val2": parseInt(toDate),
":status":confirmStatus
},
FilterExpression:"apiAction = :status"
};
Here acc_token is the primary key and I want to query array of access_token values in one single query.
No, it is not possible. A single query may search only one specific hash key value. (See DynamoDB – Query.)
You can, however, execute multiple queries in parallel, which will have the effect you desire.
Edit (2018-11-21)
Since you said there are 200+ hash keys that you are looking for, here are two possible solutions. These solutions do not require unbounded, parallel calls to DynamoDB, but they will cost you more RCU. They may be faster or slower, depending on the distribution of data in your table.
I don't know the distribution of your data, so I can't say which one is best for you. In all cases, we can't use acc_token as the sort key of the GSI because you can't use the IN operator in a KeyConditionExpression. (See DynamoDB – Condition.)
Solution 1
This strategy is based on Global Secondary Index Write Sharding for Selective Table Queries
Steps:
Add a new attribute to items that you write to your table. This new attribute can be a number or string. Let's call it index_partition.
When you write a new item to your table, give it a random value from 0 to N for index_partition. (Here, N is some arbitrary constant of your choice. 9 is probably an okay value to start with.)
Create a GSI with hash key of index_partition and a sort key of ts. You will need to project apiAction and acc_token to the GSI.
Now, you only need to execute N queries. Use a key condition expression of index_partition = :n AND ts between :val1 and :val2 and a filter expression of apiAction = :status AND acc_token in :acc_token_list
Solution 2
This solution is similar to the last, but instead of using random GSI sharding, we'll use a date based partition for the GSI.
Steps:
Add a new string attribute to items that you write to your table. Let's call it ts_ymd.
When you write a new item to your table, use just the yyyy-mm-dd part of ts to set the value of ts_ymd. (You could use any granularity you like. It depends on your typical query range for ts. If :val1 and :val2 are typically only an hour apart from each other, then a suitable GSI partition key could be yyyy-mm-dd-hh.)
Create a GSI with hash key of ts_ymd and a sort key of ts. You will need to project apiAction and acc_token to the GSI.
Assuming you went with yyyy-mm-dd for your GSI partition key, you only need to execute one query for every day that is within :val1 and :val2. Use a key condition expression of ts_ymd = :ymd AND ts between :val1 and :val2 and a filter expression of apiAction = :status AND acc_token in :acc_token_list
Solution 3
I don't know how many different values of apiAction there are and how those values are distributed, but if there are more than a few, and they have approximately equal distribution, you could partition a GSI based on that value. The more possible values you have for apiAction, the better this solution is for you. The limiting factor here is that you need to have enough values that you won't run into the 10GB partition limit for your GSI.
Steps:
Create a GSI with hash key of apiAction and a sort key of ts. You will need to project acc_token to the GSI.
You only need to execute one query. Use a key condition expression of apiAction = :status AND ts between :val1 and :val2" and a filter expression ofacc_token in :acc_token_list`.
For all of these solutions, you should consider how evenly the GSI partition key will be distributed, and the size of the typical range for ts in your query. You must use a filter expression on acc_token, so you should try to pick a solution that minimizes the total number of items the will match your key condition expression, but at the same time, you need to be aware that you can't have more than 10GB of data for one partition key (for the table or for a GSI). You also need to remember that a GSI can only be queried as an eventually consistent read.
You can efficiently do both query range of partition keys and apply additional condition on sort key with the help of PartiQL SELECT query. Official DDB documentation says:
To ensure that a SELECT statement does not result in a full table
scan, the WHERE clause condition must specify a partition key. Use the
equality or IN operator.
The documentation doesn't mention specifically sort key, but it says that additional filtration on non-key attribute still does NOT cause the full scan. So I am almost sure a condition on sort key with one of supported operators won't cause a table scan, executes fast and consumes as few capacity units as possible.
So your query may look like this:
SELECT * FROM client_logs WHERE acc_token IN (t1, t2, ...) AND ts BETWEEN t1 AND t2
Node.js examples of PartiQL API usage can be found here.
I have two questions about query results in Cassandra.
When I make a "full" select of a table in Cassandra (ie. select * from table) is it guaranteed that the results will be returned in increasing order of partition tokens?
For instance, having the following table:
create table users(id int, name text, primary key(id));
Is it guaranteed that the following query will return the results with increasing values in the token column?
select token(id), id from users;
If so, is it also guaranteed if the data is distributed to multiple nodes in the cluster?
If the anwer to the above question is 'yes', is it still valid if we use secondary index? For instance, if we would have the following index:
create index on users(name);
and we query the table by using the index:
select token(id), id from users where name = 'xyz';
is there any guarantee regarding the order of results?
The motivation for the above questions is if the token is the right thing to use in order in implement paging and/or resuming of broken longer "data exports".
EDIT: There are multiple resources on the net that state that the order matches the token order (eg. in description of partitioner results or this Datastax page):
Without a partition key specified in the WHERE clause, the actual order of the result set then becomes dependent on the hashed values of userid.
However the order of results is not specified in official Cassandra documentation, eg. of SELECT statement.
Is it guaranteed that the following query will return the results with increasing values in the token column?
Yes it is
If so, is it also guaranteed if the data is distributed to multiple nodes in the cluster?
The data distribution is orthogonal to the ordering of the retrieved data, no relationship
If the anwer to the above question is 'yes', is it still valid if we use secondary index?
Yes, even if you query data using a secondary index (be it SASI or the native implementation), the returned results will always be sorted by token order. Why ? The technical explanation is given in my blog post here: http://www.doanduyhai.com/blog/?p=13191#cluster_read_path
That's the main reason that explain why SASI is not a good fit if you want the search to return data ordered by some column values. Only a real search engine integration (like Datastax Enterprise Search) can yield you the correct ordering because it bypasses the cluster read path layer.
I created one collection with partition key as "/countryId", in that when i read multiple documents using SQL query i specified FeedOptions { EnableCrossPartitionQuery = true } and query like follows
select * from collection c where (c.countryId=1 or c.countryId=2 or c.countryId=3)
I would like to know how internally execute, i mean
I am specified countryId(partitionKey) in where condition,will it go to that particular partitions only for getting documents?
or
It will go to all partition of collection and check on each document countryId(partitionkey)
Thanks in Advance !!.
The DocumentDB query will execute against only the partitions that match the filter, not all partitions:
The DocumentDB SDK/gateway will retrieve the partition key metadata for the collection and know that the partition key is countryId, as well as the physical partitions, and what ranges for partition key hashes map to which physical partitions.
During query execution, the SDK/gateway will parse the SQL query and detect that there are filters against the partition key. It will hash the values, find the matching partitions based on its owning partition key ranges. For example, countries 1, 2, and 3 may be all in one physical partition, or three different partions.
The query will be executed in series or parallel by the SDK/gateway based on the configured degree of parallelism. If any post-processing like ORDER BY or aggregation is required, then it will be performed by the SDK/gateway.
So, I have a Cassandra CQL statement that looks like this:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID = ? AND DATA_SCHEMA = ?
This table is sorted by a timestamp column.
The functionality is fronted by a REST API, and one of the filter parameters that they can specify to get the most recent row, and then I appent "LIMIT 1" to the end of the CQL statement since it's ordered by the timestamp column in descending order. What I would like to do is allow them to specify multiple device id's to get back the latest entries for. So, my question is, is there any way to do something like this in Cassandra:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID IN ? AND DATA_SCHEMA = ?
and still use something like "LIMIT 1" to only get back the latest row for each device id? Or, will I simply have to execute a separate CQL statement for each device to get the latest row for each of them?
FWIW, the table's composite key looks like this:
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema), activity_timestamp)
) WITH CLUSTERING ORDER BY (activity_timestamp DESC);
IN is not recommended when there are a lot of parameters for it and under the hood it's making reqs to multiple partitions anyway and it's putting pressure on the coordinator node.
Not that you can't do it. It is perfectly legal, but most of the time it's not performant and is not suggested. If you specify limit, it's for the whole statement, basically you can't pick just the first item out from partitions. The simplest option would be to issue multiple queries to the cluster (every element in IN would become one query) and put a limit 1 to every one of them.
To be honest this was my solution in a lot of the projects and it works pretty much fine. Basically coordinator would under the hood go to multiple nodes anyway but would also have to work more for you to get you all the requests, might run into timeouts etc.
In short it's far better for the cluster and more performant if client asks multiple times (using multiple coordinators with smaller requests) than to make single coordinator do to all the work.
This is all in case you can't afford more disk space for your cluster
Usual Cassandra solution
Data in cassandra is suggested to be ready for query (query first). So basically you would have to have one additional table that would have the same partitioning key as you have it now, and you would have to drop the clustering column activity_timestamp. i.e.
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema))
double (()) is intentional.
Every time you would write to your table you would also write data to the latest_entry (table without activity_timestamp) Then you can specify the query that you need with in and this table contains the latest entry so you don't have to use the limit 1 because there is only one entry per partitioning key ... that would be the usual solution in cassandra.
If you are afraid of the additional writes, don't worry , they are inexpensive and cpu bound. With cassandra it's always "bring on the writes" I guess :)
Basically it's up to you:
multiple queries - a bit of refactoring, no additional space cost
new schema - additional inserts when writing, additional space cost
Your table definition is not suitable for such use of the IN clause. Indeed, it is supported on the last field of the primary key or the last field of the clustering key. So you can:
swap your two last fields of the primary key
use one query for each device id