Im planning a few ETLs that eventually will "fill" the same row in Cassandra, i.e. if a table is defined as:
CREATE TABLE MyTable (
key text,
column1 text,
column2 text,
column3 text,
column4 text,
PRIMARY KEY (key)
)
Than few ETLs will fill out the appropriate values in the columns 1-4 in a different times.
How well cassandra handles such operations? Should I read the row first, update in code and then write back or would an UPDATE call will do the trick?
I know that Cassandra is highly optimized for write throughput in that it never modifies data on disk, it only appends to existing files or creates new ones. knowing that, and without diving deeper into the implementation, It worries me that if an ETL will write column4 and 20 minutes later a different ETL will write column2, I will lose a lot of performance comparing to waiting for all the ETLs to finish and than save all the data in a bulk (which is not an easy implementation by itself).
Ideas?
All inserts/updates in Cassandra are Upserts, and Cassandra uses last-write-wins for conflict resolution. If your ETLs are updating different columns, there would be no issue. If they update the same column, the last update for a column will win. If this is an issue, you can add a timestamp column as a clustering key (allowing multiple values of the data), and during read, read the latest one. You could also add a TTL so older irrelevant values get cleared out.
If certain columns are updated, and others aren't, you'll effectively get null for those columns when querying.
I couldn't really understand your last paragraph. Could you please explain your concern?
Related
I have the following table on my Cassandra db, I want to find the delta difference in terms of cassandra query. For example, if I operate any insert,update,delete operation to the table I should be able to show which row/rows are getting impacted as my final result.
Let's say on first instance I have perform some 10 rows insertions so if I take the delta difference the output should only show that 10 rows are inserted. Same if we modify any number of rows or delete some rows then those changes should be captured.
Next time if we run the query it should idealy give 0 as we have not insert/modify/delete any row/rows
Here is the following table
CREATE TABLE datainv (
datainv_account_id uuid,
datainv_run_id uuid,
id uuid,
datainv_summary text,
json text,
number text,
PRIMARY KEY (datainv_account_id, datainv_run_id));
many things I have searched on internet but most of the solution are based on timeuuid,but in this case I have uuid columns only. So I'm not getting any solution that the same use-case can be achieved using uuid
It's not so easy to generate a diff between 2 table states in Cassandra, because you can't easily detect if you have inserted new partitions or not. You can implement something based on the timeuuid or on the timestamp as clustering column - in this case you'll able to filter out the data since latest change, as you have ordering of values that you don't have with uuid that is completely random. But it still requires that you perform the full scan of all the table. Plus it won't detect deletions...
Theoretically you can implement this with Spark as following:
read all primary key values & store this data in some other table/on disk;
next time, read all primary key values & find difference between original set of primary keys & new set - for example, do full outer join & use presence of None on left as addition, and presence of None on right as deletion;
store new set of the primary keys in a separate table/on disk, but previous version should be truncated.
but it will consume quite a lot of resources.
I modeled my Cassandra in a way that i have couple of tables with the same partition key - Uuid.
Each table has it's partition key and others column representing data for specific query i would like to ask.
For example - 1 table have Uuid and column regarding it's status (no other clustering keys in this table) and table 2 will contain the same Uuid (Also without clustering keys) but with different columns representing the data for this Uuid.
Is it the right modeling? Is it wrong to duplicate the same partition key around tables in order to group each table to hold relevant column for specific use case? or it preferred to use only 1 table and query them and taking the relevant data for the specific use case in the code?
There's nothing wrong with this modeling. Whether it is better, or worse, than the obvious alternative of having just one table with both pieces of data, depends on your workload:
For example, if you commonly need to read both status and data columns of the same uuid, then these reads will be more efficient if both things are in the same table, which only needs to be looked up once. If you always read just one but not both, then reads will be more efficient from separate tables. Also, if this workload is not read-mostly but rather write-mostly, then writing to just one table instead of two will be more efficient.
We store events in multiple tables depending on category.
Each event have an id but contains multiple subelements.
We have a lookup table to find events using the subelement_id.
Each subelement can participate at max in 7 events.
Hence the partition will hold max 7 rows.
We will have 30-50 BILLIONS of rows in eventlookup over a period of 5 years.
CREATE TABLE eventlookup (
subelement_id text,
recordtime timeuuid,
event_id text,
PRIMARY KEY ((subelement_id), recordtime)
)
Problem: How do we delete old data once we reach the 5 (or some other number) year mark.
We want to purge the "tail" at some specific intervals, say every week or month.
Approaches investigated so far:
TTL of X years (performs well, but TTL needs to be known before hand, 8 extra bytes for each column)
NO delete - simply ignore the problem (somebody else's problem :0)
Rate limited single row delete (do complete table scan and potentially billions of delete statements)
Split the table to multiple tables -> "CREATE TABLE eventlookupYYYY". Once a year is not needed, simply drop it. (Problem is every read should potentially query all tables)
Is there any other approaches we can consider ?
Is there a design decision we can make now ( we are not in production yet) that will mitigate the future problem?
If it's worth the extra space, track for ranges of recordtimes your subelement_id in a seperate table / columnfamiliy.
Then you can easily get the ids to delete for records having a specific age if you do not want to set a ttl a priori.
But keep in mind to make this tracking distribute well, just a single date will generate hotspots in your cluster and very wide rows, so think about some partition key like (date,chunk) where I uses a random number from 0-10 in the past for chunk. Also you might look at TimeWindowCompactionStrategy - here is a blog post about it: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
Your partition key is only set to subelement_id, so all tuples of 7 events for all recordtimes will be in one partition.
Given your table structure, you need to know all the subelement_id of all your data just to fetch a single row. So, with this assumption, your table structure can be improved a bit by sorting your data by recordtime DESC:
CREATE TABLE eventlookup (
subelement_id text,
recordtime timeuuid,
eventtype int,
parentid text,
partition bigint,
event_id text,
PRIMARY KEY ((subelement_id), recordtime)
)
WITH CLUSTERING ORDER BY (recordtime DESC);
Now all of your data is in descending order and this will give you a big advantage.
Suppose that you have multiple years of data (eg from 2000 to 2018). Assuming you need to keep only the last 5 years, you'd need to fetch data by something like:
SELECT * FROM eventlookup WHERE subelement_id = 'mysub_id' AND recordtime >= '2013-01-01';
This query is efficient because C* will retrieve your data and will stop scanning the partition exactly where you wanted to: 5 years ago. The big plus is that if you have tombstones after that point, well, they won't impact your reads at all. That means you can "safely" trim after that point safely by issuing a delete with
WHERE subelement_id = 'mysub_id' AND recordtime < '2013-01-01';
Beware that this delete will create tombstones that will be skipped by your reads, BUT they will be read during compactions, so keep it in mind.
Alternatively, you can simply skip the delete part if you don't need to reclaim your storage space, your system will always run smooth because you will always retrieve your data efficiently.
I am wondering how's column slicing in CQL WHERE clause affects read performance. Does Cassandra have some optimization, which is able to only fetch specific columns with the value or have to retrieve all the columns of a row and check one after another? e.g.: I have a primary key as (key1, key2), key2 is the clustering key. I only want to find columns that match a certain key2, say value2?
Cassandra saves the data as cells - each value for a key+column is a cell. If you save several values for the key in once they will be placed together in same file. Also, since cassandra writes to sstables, you can have several values saved for same key-column/cell in different files, and cassandra will read all of them and return you the last written one, until comperssion or repair is occured, and irrelevant values are deleted.
Good article about deletes/reads/tombstones:
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Say I have 2 rows of time-series data that have exactly the same timestamp etc for the primary keys. The only difference is the rest of the data are different.
So if set 1 has [timestamp, other keys...], value_col1, value_col2, then the value_col1 and value_col2 for set 2 will have different values than set 1.
Now if I put those sets into one batch to be inserted or very quickly using seperate inserts queries, the result I see in the database can be somewhat inconsistent: it can be that value_col1 from set 1 is combined with that of value_col2 in the final row.
It took me the whole evening to find out this is actually a bug (or maybe intended behaviour...) I have my workaround now using a slightly increased timestamp for set 2. The sympton won't be noticed in many cases, but in my case where col1 is the partial decoding key for col2, then I have a problem!
Does anyone have the same problem or knows where the problem actually lies?
I'm using cassandra-node drive on nodejs 5.0.0 with cassandra 2.0.14.
You should define your in Cassandra schema to avoid race conditions.
If you provide different data with same partition and clustering keys in the same instant, you will not be able to define which should be preserved.
If you want the 2 rows to be preserved, you should use a uuid or timeuuid datatypes.
See more information in the nodejs driver docs: http://docs.datastax.com/en/developer/nodejs-driver/2.2/nodejs-driver/reference/uuids-timeuuids.html