Cassandra update query didn't change data and didn't throw an error - cassandra

Hello I am facing a problem when I am trying to execute a really simple update query in cqlsh
update "Table" set "token"='111-222-333-444' where uid='123' and "key"='somekey';
It didn't throw any error, but the value of the token is still the same. However, if I try the same query for some other field it works just fine:
update "Table" set "otherfield"='value' where uid='123' and "key"='somekey';
Any ideas why Cassandra can prevent updates for some fields?

Most probably the entry was inserted by client with incorrect clocks, or something like. The data in Cassandra are "versioned" by write time that could be even in the future (depending on use case). And when reading, Cassandra compares the write time of all versions of the specified column (there could be multiple versions in the data files on the disk), and selects one with highest write time.
You need to check the write time of that column value (use the writetime function) and compare with your current time:
select writetime(token) from Table where uid='123' and "key"='somekey';
the resulting value is in the microseconds. You can remove last 3 digits, and use something like this site to convert it into human-understandable time.

Related

How can i update the column to a particular value in a cassandra table?

Hi I am having a cassandra table. my table has around 200 records in it . later i have altered the table to add a new column named budget which is of type boolean . I want to set the default value to be true for that column . what should be the cql looks like.
I am trying the following command but it didnt work
cqlsh:Openmind> update mep_primecastaccount set budget = true ;
SyntaxException: line 1:46 mismatched input ';' expecting K_WHERE
appreciate any help
thank you
Any operation that would require a cluster wide read before write is not supported (as it wont work in the scale that Cassandra is designed for). You must provide a partition and clustering key for an update statement. If theres only 200 records a quick python script or can do this for you. Do a SELECT * FROM mep_primecastaccount and iterate through ResultSet. For each row issue an update. If you have a lot more records you might wanna use spark or hadoop but for a small table like that a quick script can do it.
Chris's answer is correct - there is no efficient or reliable way to modify a column value for each and every row in the database. But for a 200-row table that doesn't change in parallel, it's actually very easy to do.
But there's another way that can work also on a table of billions of rows:
You can handle the notion of a "default value" in your client code. Pre-existing codes will not have a value for "budget" at all: It won't be neither true, nor false, but rather it will be outright missing (a.k.a. "null"). You client code may, when it reads a row with a missing "budget" value, replace it by some default value of its choice - e.g., "true".

How does Apache spark structured streaming 2.3.0 let the sink know that a new row is an update of an existing row?

How does spark structured streaming let the sink know that a new row is an update of an existing row when run in an update mode? Does it look at all the values of all columns of the new row and an existing row for an equality match or does it compute some sort of hash?
Reading the documentation, we see some interesting information about update mode (bold formatting added by me):
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.
So, to use update mode there needs to be some kind of aggregation otherwise all data will simply be added to the end of the result table. In turn, to use aggregation the data need to use one or more coulmns as a key. Since a key is needed it is easy to know if a row has been updated or not - simply compare the values with the previous iteration of the table (the key tells you which row to compare with). In aggregations that contains a groupby, the columns being grouped on are the keys.
Simple aggregations that return a single value will not require a key. However, since only a single value is returned it will update if that value is changed. An example here could be taking the sum of a column (without groupby).
The documentation contains a picture that gives a good understanding of this, see the "Model of the Quick Example" from the link above.

How to get last inserted data from DynamoDb in lambda function

I am having table in Dynam0Db with,
timestamp as Partition Key,
Status as normal column.
I am inserting the timestamp and status in DynamoDb when ever new data comes.
I want to retrieve the last added data to the table(here we can refer timestamp).
So how can i do this(I am using lambda function with NodeJs language).
If any queries in question comment below, Thanks.
You can make a query on your table with these parameters :
Limit: 1,
ScanIndexForward : false
But it seems complicated to do the query because of the timestamp as a partition key.
The other way is to generate a stream at every new entry in your table that trigger a lambda function :
Stream->Lambda approach suggested in previous answer is OK, however that will not work if you need to retrieve latest entry more than once. Alternatively, because you always have exactly one newest item, you can simply create another table with exactly one item with constant hash key and overwrite it every time you do update. That second table will always contain 1 record which will be your newest item. You can update that second table via stream from your main table and lambda, or directly from your application that performs original write.
[Update] As noted in the comments, using only 1 record will limit throughput to 1 partition limits (1000 WCU). If you need more, use sharding approach: on write store newest data randomly in one of N records, and on read scan all N records and pick the newest one.

Why Cassandra does not allow udf in update statements?

I am new to Cassandra. I created a table, and I have inserted some data in it, now I want to select data from it, and in the output, I want some calculated columns.
I created a user defined function concat, which concatenates 2 strings and returns the result. Then I noticed that this function shows data correctly when I use it in SELECT statement. but it does not work when I use in UPDATE statement:
That is, this works;
select concat(prov,city), year,mnth,acno,amnt from demodb.budgets;
but this does not;
update demodb.budgets set extra=concat(prov,city) where prov='ON';
In addition, the UPDATE also does not work if we simply assign one column's value to another column of same type (without any calculations), as below;
update demodb.budgets set extra=city where prov='ON';
Also, even a simple arithmetic calculation doesn't work in Update statement;
that is, this too doesn't work;
update demodb.budgets set amnt = amnt + 20 where prov='ON';
here amnt is a simple double type column.
(when I saw this; all I could do is pull my hair hardly and say, I can't work with Cassandra, i don't just want it if it cannot do simple arithmetic)
Can someone please help how can I achieve the desired updates?
I think the basic answer to your question is Read-before-write is a huge anti-pattern in Cassandra.
The issue of concurrency in a distributed environment is a key point there.
More info.

How to retrieve a very big cassandra table and delete some unuse data from it?

I hava created a cassandra table with 20 million records. Now I want to delete the expired data decided by one none primary key column. But it doesn't support the operation on the column. So I try to retrieve the table and get the data line by line to delete the data.Unfortunately,it is too huge to retrieve. Otherwise,I couldn't delete the whole table, how could I achieve my goal?
Your question is actually, how to get the data from the table in bulks (also called pagination).
You can do that by selecting different slices from your primary key: For example, if your primary key is some sort of ID, select a range of IDs each time, process the results and do whatever you want to do with them, then get the next range, and so on.
Another way, which depends on the driver you're working with, will be to use fetch_size. You can see a Python example here and a Java example here.

Resources