how to query array of primary key values in dynamodb - node.js

I have one table in AWS Dynamodb with 1 million records.is it possible to query array of primary key values in one query with additional sort key condition in dynamodb?I am using for my server side logic.
Here is the params
var params = {
TableName: "client_logs",
KeyConditionExpression: "#accToken = :value AND ts between :val1 and
:val2",
ExpressionAttributeNames: {
"#accToken": "acc_token"
},
ExpressionAttributeValues: {
":value": clientAccessToken,
":val1": parseInt(fromDate),
":val2": parseInt(toDate),
":status":confirmStatus
},
FilterExpression:"apiAction = :status"
};
Here acc_token is the primary key and I want to query array of access_token values in one single query.

No, it is not possible. A single query may search only one specific hash key value. (See DynamoDB – Query.)
You can, however, execute multiple queries in parallel, which will have the effect you desire.
Edit (2018-11-21)
Since you said there are 200+ hash keys that you are looking for, here are two possible solutions. These solutions do not require unbounded, parallel calls to DynamoDB, but they will cost you more RCU. They may be faster or slower, depending on the distribution of data in your table.
I don't know the distribution of your data, so I can't say which one is best for you. In all cases, we can't use acc_token as the sort key of the GSI because you can't use the IN operator in a KeyConditionExpression. (See DynamoDB – Condition.)
Solution 1
This strategy is based on Global Secondary Index Write Sharding for Selective Table Queries
Steps:
Add a new attribute to items that you write to your table. This new attribute can be a number or string. Let's call it index_partition.
When you write a new item to your table, give it a random value from 0 to N for index_partition. (Here, N is some arbitrary constant of your choice. 9 is probably an okay value to start with.)
Create a GSI with hash key of index_partition and a sort key of ts. You will need to project apiAction and acc_token to the GSI.
Now, you only need to execute N queries. Use a key condition expression of index_partition = :n AND ts between :val1 and :val2 and a filter expression of apiAction = :status AND acc_token in :acc_token_list
Solution 2
This solution is similar to the last, but instead of using random GSI sharding, we'll use a date based partition for the GSI.
Steps:
Add a new string attribute to items that you write to your table. Let's call it ts_ymd.
When you write a new item to your table, use just the yyyy-mm-dd part of ts to set the value of ts_ymd. (You could use any granularity you like. It depends on your typical query range for ts. If :val1 and :val2 are typically only an hour apart from each other, then a suitable GSI partition key could be yyyy-mm-dd-hh.)
Create a GSI with hash key of ts_ymd and a sort key of ts. You will need to project apiAction and acc_token to the GSI.
Assuming you went with yyyy-mm-dd for your GSI partition key, you only need to execute one query for every day that is within :val1 and :val2. Use a key condition expression of ts_ymd = :ymd AND ts between :val1 and :val2 and a filter expression of apiAction = :status AND acc_token in :acc_token_list
Solution 3
I don't know how many different values of apiAction there are and how those values are distributed, but if there are more than a few, and they have approximately equal distribution, you could partition a GSI based on that value. The more possible values you have for apiAction, the better this solution is for you. The limiting factor here is that you need to have enough values that you won't run into the 10GB partition limit for your GSI.
Steps:
Create a GSI with hash key of apiAction and a sort key of ts. You will need to project acc_token to the GSI.
You only need to execute one query. Use a key condition expression of apiAction = :status AND ts between :val1 and :val2" and a filter expression ofacc_token in :acc_token_list`.
For all of these solutions, you should consider how evenly the GSI partition key will be distributed, and the size of the typical range for ts in your query. You must use a filter expression on acc_token, so you should try to pick a solution that minimizes the total number of items the will match your key condition expression, but at the same time, you need to be aware that you can't have more than 10GB of data for one partition key (for the table or for a GSI). You also need to remember that a GSI can only be queried as an eventually consistent read.

You can efficiently do both query range of partition keys and apply additional condition on sort key with the help of PartiQL SELECT query. Official DDB documentation says:
To ensure that a SELECT statement does not result in a full table
scan, the WHERE clause condition must specify a partition key. Use the
equality or IN operator.
The documentation doesn't mention specifically sort key, but it says that additional filtration on non-key attribute still does NOT cause the full scan. So I am almost sure a condition on sort key with one of supported operators won't cause a table scan, executes fast and consumes as few capacity units as possible.
So your query may look like this:
SELECT * FROM client_logs WHERE acc_token IN (t1, t2, ...) AND ts BETWEEN t1 AND t2
Node.js examples of PartiQL API usage can be found here.

Related

Cloud Spanner complex primary key and queries

I'm playing with Cloud Spanner and I created an imgur clone with the schema as follows:
CREATE TABLE Images (id STRING(36) NOT NULL, createdAt TIMESTAMP, caption STRING(1024), fileType STRING(10)) PRIMARY KEY (id, createdAt DESC)
The id is a version 4 UUID as the GCP documentation specifies so that I avoid hotspots. The createdAt is a timestamp when an image is first created. I have my PRIMARY KEY defined as (id, createdAt DESC) so that I can more easily query by latest added images.
What I don't understand is what happens if I want to get a single image using only SELECT * FROM Images WHERE id = 'some UUID? Will Spanner still search by key in an efficient way, meaning getting the information from the server that stores the specific key in its key range even though I only specified a part of the primary key?
In your simple example, yes. It will try to come up with an efficient execution plan which may include using an index (automatically created for PKs) even though your predicate is on just 1 of the 2-column composite PK because it is on the 1 column. If your predicate was just "...createdAt= then it will scan the table. It would be far more expensive to find matches for col2 in your composite PK of (col1, col2) than it is to just scan col2.
This assumes there's enough data to matter. For example, if you have 42 rows, it really won't matter how you execute the query or what predicates were provided; the number off I/O requests (often the most expensive part of a query) will be the same.
In general, Spanner tries to pick the index it thinks will be most efficient. The actual physical steps don't work like that but conceptually, it's a reasonable way to think about it.
Whether an index is helpful or not depends on a few things and whether it gets picked or not also has dependencies. Does it have statistics, are the statistics correct/fresh, is it making correct estimates on row counts, etc... Composite indexes/keys are a just a bit more interesting as noted above.
Just make sure you always test with enough data (closely matching your production environment if possible).

How do I query the range of a GSI of any item in DynamoDB (i.e. not partition key dependent)?

I am using Node.js with DynamoDB. I want to fetch the items in the table that are from the past month. I have a date GSI and the id is also linked to the date. I don't want to use scan because this table will grow. The main issue is that it is not item dependant, and query needs the item partition key which doesn't make sense in my case.
I have tried querying using the GSI on its own and querying the range with just the partition key and with the GSI on it's own. I don't know how to get this right.
TableName: interactionsTable,
IndexName: "interactionDate",
KeyConditionExpression: "interactionDate between :fDay and :lDay",
ExpressionAttributeValues: {
":fDay": firstDayStr,
":lDay": lastDatStr
},
}
I get an error saying that the Key Condition Expression is invalid. Is there a better way to address this problem?
You need to configure your GSI in another way.
Choose another field as the partition key and set interactionDate as a regular attribute for the GSI, then use a filtered query to use between in interactionDate
The error is due to that you can't use a "between" condition on a partition key, only on the sort key and attributes.

Cassandra get latest entry for each element contained within IN clause

So, I have a Cassandra CQL statement that looks like this:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID = ? AND DATA_SCHEMA = ?
This table is sorted by a timestamp column.
The functionality is fronted by a REST API, and one of the filter parameters that they can specify to get the most recent row, and then I appent "LIMIT 1" to the end of the CQL statement since it's ordered by the timestamp column in descending order. What I would like to do is allow them to specify multiple device id's to get back the latest entries for. So, my question is, is there any way to do something like this in Cassandra:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID IN ? AND DATA_SCHEMA = ?
and still use something like "LIMIT 1" to only get back the latest row for each device id? Or, will I simply have to execute a separate CQL statement for each device to get the latest row for each of them?
FWIW, the table's composite key looks like this:
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema), activity_timestamp)
) WITH CLUSTERING ORDER BY (activity_timestamp DESC);
IN is not recommended when there are a lot of parameters for it and under the hood it's making reqs to multiple partitions anyway and it's putting pressure on the coordinator node.
Not that you can't do it. It is perfectly legal, but most of the time it's not performant and is not suggested. If you specify limit, it's for the whole statement, basically you can't pick just the first item out from partitions. The simplest option would be to issue multiple queries to the cluster (every element in IN would become one query) and put a limit 1 to every one of them.
To be honest this was my solution in a lot of the projects and it works pretty much fine. Basically coordinator would under the hood go to multiple nodes anyway but would also have to work more for you to get you all the requests, might run into timeouts etc.
In short it's far better for the cluster and more performant if client asks multiple times (using multiple coordinators with smaller requests) than to make single coordinator do to all the work.
This is all in case you can't afford more disk space for your cluster
Usual Cassandra solution
Data in cassandra is suggested to be ready for query (query first). So basically you would have to have one additional table that would have the same partitioning key as you have it now, and you would have to drop the clustering column activity_timestamp. i.e.
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema))
double (()) is intentional.
Every time you would write to your table you would also write data to the latest_entry (table without activity_timestamp) Then you can specify the query that you need with in and this table contains the latest entry so you don't have to use the limit 1 because there is only one entry per partitioning key ... that would be the usual solution in cassandra.
If you are afraid of the additional writes, don't worry , they are inexpensive and cpu bound. With cassandra it's always "bring on the writes" I guess :)
Basically it's up to you:
multiple queries - a bit of refactoring, no additional space cost
new schema - additional inserts when writing, additional space cost
Your table definition is not suitable for such use of the IN clause. Indeed, it is supported on the last field of the primary key or the last field of the clustering key. So you can:
swap your two last fields of the primary key
use one query for each device id

Query Multiple Partition Keys At Same Time DynamoDB - Node

I have an array of Job ID's.
[ '01', '02', '03', '04' ]
Currently I am looping through the array and executing a Query for each item in the array to get the job details.
Is there a way to use a single query to get all the jobs whose Partition Key is in the array ?
You can use Batch Get Item API to get multiple items from DynamoDB table based on key attributes (I.e. partition key and sort key if available).
There are a few options, each with some pros/cons:
BatchGetItem as #notionquest pointed out. This can fetch up to 100 items or 16MB of data in a single call, but you need to provide all of the key values (both partition and sort for each item, if your table uses a composite key schema).
TransactGetItems - this can retrieve up to 25 items in a single call and as the name implies, is transactional, so the entire operation will fail if there is another pending operation on any of the items being queries. This can be good or bad depending on your use case. Similar to BatchGetItem, you need to provide all key attributes.
Query, which you are already using. Query has high performance but only supports 1 key per request (partition key required, sort key optional). Will return up to 1MB of data at a time, and supports paginated results.
If your jobs table has a composite key (partition + sort), Query is the best option in terms of performance and no constraint on specifying the sort key values. If the table only has a partition key, then BatchGetItems is probably the best bet (assuming the each job item is relatively small in size, and you expect less than 100 total jobs to be returned). If either of those assumptions is incorrect, multiple Querys would be the best option.
You can use partiQL for this use case:
SELECT *
FROM <TABLE_NAME>
WHERE "Id" IN [ARRAY]
But do note that partiQL has length constraints: Minimum length of 1. Maximum length of 8192.
let statement = {
"Statement": "SELECT * \nFROM <TABLE_NAME> \nWHERE \"Id\" IN [ARRAY]"
}
let result = await dynamoDbClient.executeStatement(statement).promise();

An Approach to Cassandra Data Model

Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.

Resources