How to get last inserted data from DynamoDb in lambda function - node.js

I am having table in Dynam0Db with,
timestamp as Partition Key,
Status as normal column.
I am inserting the timestamp and status in DynamoDb when ever new data comes.
I want to retrieve the last added data to the table(here we can refer timestamp).
So how can i do this(I am using lambda function with NodeJs language).
If any queries in question comment below, Thanks.

You can make a query on your table with these parameters :
Limit: 1,
ScanIndexForward : false
But it seems complicated to do the query because of the timestamp as a partition key.
The other way is to generate a stream at every new entry in your table that trigger a lambda function :

Stream->Lambda approach suggested in previous answer is OK, however that will not work if you need to retrieve latest entry more than once. Alternatively, because you always have exactly one newest item, you can simply create another table with exactly one item with constant hash key and overwrite it every time you do update. That second table will always contain 1 record which will be your newest item. You can update that second table via stream from your main table and lambda, or directly from your application that performs original write.
[Update] As noted in the comments, using only 1 record will limit throughput to 1 partition limits (1000 WCU). If you need more, use sharding approach: on write store newest data randomly in one of N records, and on read scan all N records and pick the newest one.

Related

Cassandra update query didn't change data and didn't throw an error

Hello I am facing a problem when I am trying to execute a really simple update query in cqlsh
update "Table" set "token"='111-222-333-444' where uid='123' and "key"='somekey';
It didn't throw any error, but the value of the token is still the same. However, if I try the same query for some other field it works just fine:
update "Table" set "otherfield"='value' where uid='123' and "key"='somekey';
Any ideas why Cassandra can prevent updates for some fields?
Most probably the entry was inserted by client with incorrect clocks, or something like. The data in Cassandra are "versioned" by write time that could be even in the future (depending on use case). And when reading, Cassandra compares the write time of all versions of the specified column (there could be multiple versions in the data files on the disk), and selects one with highest write time.
You need to check the write time of that column value (use the writetime function) and compare with your current time:
select writetime(token) from Table where uid='123' and "key"='somekey';
the resulting value is in the microseconds. You can remove last 3 digits, and use something like this site to convert it into human-understandable time.

Cassandra pagination and token function; selecting a partition key

I've been doing a lot of reading lately on Cassandra data modelling and best practices.
What escapes me is what the best practice is for choosing a partition key if I want an application to page through results via the token function.
My current problem is that I want to display 100 results per page in my application and be able to move on to the next 100 after.
From this post: https://stackoverflow.com/a/24953331/1224608
I was under the impression a partition key should be selected such that data spreads evenly across each node. That is, a partition key does not necessarily need to be unique.
However, if I'm using the token function to page through results, eg:
SELECT * FROM table WHERE token(partitionKey) > token('someKey') LIMIT 100;
That would mean that the number of results returned from my partition may not necessarily match the number of results I show on my page, since multiple rows may have the same token(partitionKey) value. Or worse, if the number of rows that share the partition key exceeds 100, I will miss results.
The only way I could guarantee 100 results on every page (barring the last page) is if I were to make the partition key unique. I could then read the last value in my page and retrieve the next query with an almost identical query:
SELECT * FROM table WHERE token(partitionKey) > token('lastKeyOfCurrentPage') LIMIT 100;
But I'm not certain if it's good practice to have a unique partition key for a complex table.
Any help is greatly appreciated!
But I'm not certain if it's good practice to have a unique partition key for a complex table.
It depends on requirement and Data Model how you should choose your partition key. If you have one key as partition key it has to be unique otherwise data will be upsert (overridden with new data). If you have wide row (a clustering key), then make your partition key unique (a key that appears once in a table) will not serve the purpose of wide row. In CQL “wide rows” just means that there can be more than one row per partition. But here there will be one row per partition. It would be better if you can provide the schema.
Please follow below link about pagination of Cassandra.
You do not need to use tokens if you are using Cassandra 2.0+.
Cassandra 2.0 has auto paging. Instead of using token function to
create paging, it is now a built-in feature.
Results pagination in Cassandra (CQL)
https://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/
Saving and reusing the paging state
You can use pagingState object that represents where you are in the result set when the last page was fetched.
EDITED:
Please check the below link:
Paging Resultsets in Cassandra with compound primary keys - Missing out on rows
I recently did a POC for a similar problem. Maybe adding this here quickly.
First there is a table with two fields. Just for illustration we use only few fields.
1.Say we insert a million rows with this
Along comes the product owner with a (rather strange) requirement that we need to list all the data as pages in the GUI. Assuming that there are hundred entries 10 pages each.
For this we update the table with a column called page_no.
Create a secondary index for this column.
Then do a one time update for this column with page numbers. Page number 10 will mean 10 contiguous rows updated with page_no as value 10.
Since we can query on a secondary index each page can be queried independently.
Code is self explanatory and here - https://github.com/alexcpn/testgo
Note caution on how to use secondary index properly abound. Please check it. In this use case I am hoping that i am using it properly. Have not tested with multiple clusters.
"In practice, this means indexing is most useful for returning tens,
maybe hundreds of results. Bear this in mind when you next consider
using a secondary index." From http://www.wentnet.com/blog/?p=77

How do I model a set of unique codes in Cassandra?

I have to store a list of unique sorted codes in Cassandra.
Once it is read, it needs to be discarded. Every time I read this table, a new code needs to be returned.
How do I model this ?
I tried adding this code to a clustering key, buti don't quite know how to discard it so my next query will not return it.
Setting a short ttl still results the code being returned in my select

Primary key : query & updates

Little problem here with cassandra. Basically my data has a status (INITIALIZED, PERFORMED, ENDED...), and I have different scheduled tasks that will query this data based on the status with an "IN" clause. So one scheduler will work with the data that is INITIALIZED, one with the PERFORMED, some with both, etc...
Once the data is retrieved, it is processed and the status changes accordingly (INITIALIZED -> PERFORMED -> ENDED).
The problem : in order to be able to use the IN clause, the status has to figure among the primary keys of my table. But when I update the status... it creates a new record in my table, since the UPSERT doesn't find any data with the primary keys given...
How do I solve that ?
Instead of including the status column in your primary key columns you can create a secondary index on the column. However, the IN clause is not (yet) supported for secondary index columns. But as you have a very limited number of values to look up you could use equality conditions in your WHERE clause and then merge the results client-side?
Beware that using secondary indexes comes at a cost. Check out "when not to use an index". In your case these points may apply:
On a frequently updated or deleted column. See Problems using an
index on a frequently updated or deleted column below.
To look for a
row in a large partition unless narrowly queried. See Problems using
an index to look for a row in a large partition unless narrowly
queried below.

How to retrieve a very big cassandra table and delete some unuse data from it?

I hava created a cassandra table with 20 million records. Now I want to delete the expired data decided by one none primary key column. But it doesn't support the operation on the column. So I try to retrieve the table and get the data line by line to delete the data.Unfortunately,it is too huge to retrieve. Otherwise,I couldn't delete the whole table, how could I achieve my goal?
Your question is actually, how to get the data from the table in bulks (also called pagination).
You can do that by selecting different slices from your primary key: For example, if your primary key is some sort of ID, select a range of IDs each time, process the results and do whatever you want to do with them, then get the next range, and so on.
Another way, which depends on the driver you're working with, will be to use fetch_size. You can see a Python example here and a Java example here.

Resources