I want to add a lastUpdated column on my cassandra tables, and have this column autopopulated by cassandra whenever a row is updated.
Does cassandra have triggers that could do this?
The goal is not to have to update code base, just a db migration to make this work.
Cassandra stores the write time for each column/row/partition. In fact, this is what the coordinators use to determine the latest version of the data and returns it to the clients.
Depending on what you're after, you can use this metadata for your purposes without having to store it separately. There is a built-in function WRITETIME() that returns the timestamp in microseconds for when the data was written (details here).
On the subject of database triggers, they exist in Cassandra but there hasn't been much work done for this functionality for 5 (maybe 7) years so I personally wouldn't recommend their use. I think triggers were always considered experimental and never moved on. In fact, they were deprecated in DataStax Enterprise around 5 years ago. Cheers!
Related
I have some table like this:
CREATE TABLE events (
id int,
eventdate timestamp,
PRIMARY KEY (id)
);
What I'm trying to do is conditional insert, which going to verify if eventdate is not older than 3 years and insert data if the condition is met.
In SQL something similar could be achieved by DATEADD
How to handle it in Cassandra?
select * from events and iterate (paging) through the result set. Issue an insert for everything older than 3 years. A quick python script and giving it a day or two to run will accomplish it in less time than more elaborate things. Particularly if this is a one off thing. If you need to do it regularly I would recommend writing a spark job to do it. You can get more efficient if you dont want to use spark and want to run it locally by splitting up token ranges on the select statement to be the ring boundaries.
Cassandra wont support large bulk operations that require reads before writes that must read entire data set. It wouldn't work on clusters its designed to support (think petabytes across many data centers).
I have requirement to update existing values in Cassandra 3.0 to Uppercase.
Its for all rows and for a single column of type text
I would need to do this update all the way to production and want the least intrusive/ iterative way of doing it as some of environment will be update by different team. This will be a one time task
Thanks for help,Appreciate your time
Neha
Cassandra Newbie
Cassandra does not offer native support for this kind of work. You will need to look at either writting your own application that reads and re-insert with upercase. For large scale cluster, you will want to look at using spark on cassandra. Spark is often use on top of Cassandra to perform some analytics, or data migration, like you are doing.
We are looking for a technology stack which will have the following criteria.
We will be having around 10 million customer.
Each customer will be having around 20MB+ of data.
Data of each user will be updated everyday.
We need to store the data for more than six months.
We may need to query on the data any time within the time span of six months.
Currently we are thinking to use Cassandra, but the limitation of maximum storage per node in Cassandra should be less than 3TB, we are looking for other alternatives to use with or without Cassandra.
Well, I don't know if my suggestion applies for your case. We had a similar case with one of our products. There was created a blob field to record binary data, as pdf documents, that made the database grew considerably.
The solution we made was to create a second database, as a repository for records older then one year. At the application server there's a service running which:
1) Copies the records, from specific tables, older then one year to this second database;
2) Deletes records from the main database, once we have a copy in the other side;
3) Queries that need data older then one year are directed to this second database;
Sure, we had to do some implementations on the code to adapt to this situation, but is running good so far.
You can try ScyllaDB. It's a C++ reimplementation of Cassandra at 10x the speed. Scylla supports 10TB/node and there are examples of larger amounts per node. Proper disclosure - I work there but am speaking from experience.
You can definitely consider just to store the metadata itself in the database and the blobs on a separate nodes outside but it's complex and Scylla can store it all altogether. Such a similar system is already in production and we hope that user will eventually open source it
I have a cassandra table that looks like the following:
create table position_snapshots_by_security(
securityCode text,
portfolioId int,
lastUpdated date,
units decimal,
primary key((securityCode), portfolioId)
)
And I would like to something like this:
update position_snapshots_by_security
set units = units + 12.3,
lastUpdated = '2017-03-02'
where securityCode = 'SPY'
and portfolioId = '5dfxa2561db9'
But it doesn't work.
Is it possible to do this kind of operation in Cassandra? I am using version 3.10, the latest one.
Thank you!
J
This is not possible in Cassandra (any version) because it would require a read-before-write (anti-)pattern.
You can try the counter columns if they suit your needs. You could also try to caching/counting at application level.
You need to issue a read at application level otherwise, killing your cluster performance.
Cassandra doesn't do a read before a write (except when using Lightweight Transactions) so it doesn't support operations like the one you're trying to do which rely on the existing value of a column. With that said, it's still possible to do this in your application code with Cassandra. If you'll have multiple writers possibly updating this value, you'll want to use the aforementioned LWT to make sure the value is accurate and multiple writers don't "step on" each other. Basically, the steps you'll want to follow to do that are:
Read the current value from Cassandra using a SELECT. Make sure you're doing the read with a consistency level of SERIAL or LOCAL_SERIAL if you're using LWTs.
Do the calculation to add to the current value in your application code.
Update the value in Cassandra with an UPDATE statement. If using a LWT you'll want to do UPDATE ... IF value = previous_value_you_read.
If using LWTs, the UPDATE will be rejected if the previous value that you read changed while you were doing the calculation. (And you can retry the whole series of steps again.) Keep in mind that LWTs are expensive operations, particularly if the keys you are reading/updating are heavily contended.
Hope that helps!
I am storing the time series data in cassandra on daily basis. We would like to archive/purge the data older than 2 days on daily basis. We are using Hector API to store the data. Can some one suggest me the approach to delete the cassandra data on daily basis where data is older than 2 days? Using TTL approach for cassandra row is not feasible, as the number of days to delete data is configurable. Right now there is no timestamp column in the table. we are planning to add timestamp column. But the problem is, timestamp alone cannot be used in where clause, as this new column is not part of primary key.
Please provide your suggestion.
TTL is the right answer, there is an internal timestamp attached to every mutation that is used so you don't need to add one. Manually purging almost never a good idea. You may need to work on your data model a bit, check the datastax academy examples for time series
Also thrift has been frozen for two years and is now officially deprecated (removal in 4.0). Hector and other thrift clients are not really maintained anymore (see here). Using CQL and java driver will give better results with more resources available to learn as well.
I don't see what is stopping you from using TTL approach.
TTL can be used, not only while defining schema,
but also while saving data into table using datastax cassandra driver.
So, in reality you can have separate TTL for each row, configured by your java code.
Also, as Chris already mentioned, TTL uses internal timestamp for this.
Strictly based on what you describe, I think the only solution is to add that timestamp column and add a secondary index on it.
However this is a huge indicator that your data model is far from being adapted to the situation.
Emphasising my initial comment:
Is your model adapted/designed to something else? Because this doesn't look like a timeseries data in Cassandra: a timestamp like column should be part of the clustering key.