Perfomance related question on spark + cassandra (JAVA code) - apache-spark

i am using cassandra as my dumping ground on which i have multiple jobs running to process the data and update different system. below are the job related filters
Job 1. data filter based on active_flag and update_date_time and expiry_time and process the filtered data.
Job 2. data filter based on update_date_time process the data
Job 3. data filter based on created_date_time and active flag
db columns on which where condition would run are (one or many columns in one query)
Active -> yes/no
created_date -> timestamp
expiry_time -> timestamp
updated_date -> timestamp
My question on these conditions :-
how should i form my cassandra primary key? as i dont see any way to acheive uniqueness on this (id is present but thats not required for me to process data).
do i even need the primary key if i use the filtering on spark code using table scan?
considering this for millions of records processing.

Answering to your question - you need to have a primary key, even if it consists only of the partition key :-)
More detailed answer really depends on how often these jobs are running, how much data overall, how many nodes in the cluster, what hardware is used, etc. Usually, we're trying to push as much filtering to Cassandra as possible, so it will return only relevant data, not everything. The most effective this filtering happens on the first clustering column, for example, if I want to process only newly created entries, then I can use the table with following structure:
create table test.test (
pk int,
tm timestamp,
c2 int,
v1 int,
v2 int,
primary key(pk, tm, c2));
and then I can fetch only newly created entries by using:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("test", "test").load()
val filtered = data.filter("tm >= cast('2019-03-10T14:41:34.373+0000' as timestamp)")
Or I can fetch entries in the given time period:
val filtered = data.filter("""ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)
AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)""")
The effect of the filter pushdown could be checked by executing explain on the dataframe, and checking the PushedFilters section - conditions that are marked with * will be executed on Cassandra side...
But it's not always possible to design tables to match all queries, so you'll need to design primary key for jobs that are executed most often. In your case, update_date_time could be a good candidate for that, but if you put it as clustering column, then you'll need to take care when updating it - you'll need to perform change as batch, something like this:
begin batch
delete from table where pk = ... and update_date_time = old_timestamp;
insert into table (pk, update_date_time, ...) values (..., new_timestamp, ...);
apply batch;
or something like this.

Related

Data modelling to faciliate pruning/bulk update/delete in scylladb/cassandra

Lets say I have a table like below with a composite partition key.
CREATE TABLE heartrate (
pet_chip_id uuid,
date text,
time timestamp,
heart_rate int,
PRIMARY KEY ((pet_chip_id, date), time)
);
Lets say there is a batch job to prune all the data older than X. I can't do below query since its missing other partition key in the query.
DELETE FROM heartrate WHERE date < '2020-01-01';
How do you model your data such a way that this can be achieved in Scylla? I understand that internally scylla creates a partition based on partition keys but in this case its impossible to query all the list of pet_chip_id and do N queries to delete.
Just wanted to know how people do this outside RDBMS world.
The recommended way to delete old data automatically in Scylla is using the Time-to-live (TTL) feature:
When you write a row, you add "USING TTL 864000" is you want that data to be deleted automatically in 10 days. You can also specify a default TTL for a given table, so that every piece of data written to the table will get expired after (say) 10 days.
Scylla's TTL feature is separate from the data itself, so it doesn't matter which columns you used as partition keys or clustering keys - in particular the "date" column no longer needs to be a clustering key (or exist at all, for that matter) - unless you also need it for something else.
As #nadav-harel said in his answer if you can define a TTL that's always the best solution but if you can't, a possible solution is to create a materialized view to be able to list the primary keys of the main table based on the field that you need to use in the delete query. In the prune job you can first do a select from the MV and then delete from the main table using the values that you got from the MV.
Example:
CREATE TABLE my_table (
a uuid,
b text,
c text,
d int,
e timestamp
PRIMARY KEY ((a, b), c)
);
CREATE MATERIALIZED VIEW my_mv AS
SELECT a,
b,
c
FROM my_table
WHERE time IS NOT NULL
PRIMARY KEY (b, a, c);
Then in your prune job you could select from my_mv based on b and then delete from my_table based on the values returned from the select query.
Note that this solution might not be effective depending on your model and the amount of data you have, but keep in mind that deleting data is also a way of querying your data and your model should be defined based on your queries needs, i.e. before defining your model, you need to think about every way you will query it (including how you will prune your data).

Select row with highest timestamp

I have a table that stores events
CREATE TABLE active_events (
event_id VARCHAR,
number VARCHAR,
....
start_time TIMESTAMP,
PRIMARY KEY (event_id, number)
);
Now, I want to select an event with the highest start_time. It is possible? I've tried to create a secondary index, but no success.
This is a query I've created
select * from active_call order by start_time limit 1
But the error says ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Should I create some kind of materialized view? What should I do to execute my query?
This is an anti-pattern in Cassandra. To order the data you need to read all data and find the highest value. And this will require scanning of data on multiple nodes, and will be very long.
Materialized view also won't help much as order for data only exists inside an individual partition, so you will need to put all your data into a single partition that could be huge and data would be imbalanced.
I can only think of following workaround:
Have an additional table that will have all columns of the original table, but with a fake partition key and no clustering columns
You do inserts into that table in parallel to normal inserts, but use a fixed value for that fake partition key, and explicitly setting a timestamp for a record equal to start_time (don't forget to multiple by 1000 as timestamp uses microseconds). In this case it will guaranteed to be the value with the highest timestamp as Cassandra won't override it with other data with lower timestamp.
But this doesn't solve a problem with data skew, and all traffic will be handled by fixed number of nodes equal to RF.
Another alternative - use another database.
This type of query isn't valid in big data because it requires a full table scan and doesn't scale. It works in traditional relational databases because the dataset is smaller. Imagine you had billions of partitions each with thousands of rows spread across hundreds of nodes. A full table scan in a large cluster will take a very long time if it was allowed.
The error:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN
gets returned because you can only sort the results provided (a) the query is restricted to a partition key, and (b) the rows are ordered by a clustering column. You cannot sort the results based on a column that is not part of the clustering key. Cheers!

How to find the delta difference for a table in cassandra using uuid column type

I have the following table on my Cassandra db, I want to find the delta difference in terms of cassandra query. For example, if I operate any insert,update,delete operation to the table I should be able to show which row/rows are getting impacted as my final result.
Let's say on first instance I have perform some 10 rows insertions so if I take the delta difference the output should only show that 10 rows are inserted. Same if we modify any number of rows or delete some rows then those changes should be captured.
Next time if we run the query it should idealy give 0 as we have not insert/modify/delete any row/rows
Here is the following table
CREATE TABLE datainv (
datainv_account_id uuid,
datainv_run_id uuid,
id uuid,
datainv_summary text,
json text,
number text,
PRIMARY KEY (datainv_account_id, datainv_run_id));
many things I have searched on internet but most of the solution are based on timeuuid,but in this case I have uuid columns only. So I'm not getting any solution that the same use-case can be achieved using uuid
It's not so easy to generate a diff between 2 table states in Cassandra, because you can't easily detect if you have inserted new partitions or not. You can implement something based on the timeuuid or on the timestamp as clustering column - in this case you'll able to filter out the data since latest change, as you have ordering of values that you don't have with uuid that is completely random. But it still requires that you perform the full scan of all the table. Plus it won't detect deletions...
Theoretically you can implement this with Spark as following:
read all primary key values & store this data in some other table/on disk;
next time, read all primary key values & find difference between original set of primary keys & new set - for example, do full outer join & use presence of None on left as addition, and presence of None on right as deletion;
store new set of the primary keys in a separate table/on disk, but previous version should be truncated.
but it will consume quite a lot of resources.

Spark and cassandra, range query on clustering key

I have cassandra table with following structure:
CREATE TABLE table (
key int,
time timestamp,
measure float,
primary key (key, time)
);
I need to create a Spark job which will read data from previous table, within specified start and end timestamp do some processing, and flush results back to cassandra.
So my spark-cassandra-connector will have to do a range query on clustering cassandra table column.
Are there any performance differences if I do:
sc.cassandraTable(keyspace,table).
as(caseClassObject).
filter(a => a.time.before(startTime) && a.time.after(endTime).....
so what I am doing is loading all the data into Spark and applying filtering
OR if I do this:
sc.cassandraTable(keyspace, table).
where(s"time>$startTime and time<$endTime)......
which filters all the data in Cassandra and then loads smaller subset to Spark.
The selectivity of a range query will be around 1%
It is impossible to include partition key in the query.
Which of these two solutions is preferred?
sc.cassandraTable(keyspace, table).where(s"time>$startTime and time<$endTime)
Will be MUCH faster. You are basically doing a percentage (if you only pull 5% of the data 5% of the total work) of the full grab in the first command to get the same data.
In the first case you are
Reading all of the data from Cassandra.
Serializing every object and then moving it to Spark.
Then finally filtering everything.
In the second case you are
Reading only the data you actually want from C*
Serializing only this tiny subset
There is no step 3
As an additional comment you can also put your case class type right in the call
sc.cassandraTable[CaseClassObject](keyspace, table)

cassandra filtering on an indexed column isn't working

I'm using (the latest version of) Cassandra nosql dbms to model some data.
I'd like to get a count of the number of active customer accounts in the last month.
I've created the following table:
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
So because I want to filter by date, I create an index on the date column:
CREATE INDEX ON active_accounts (date);
When I insert some data, Cassandra automatically updates data on any existing primary key matches, so the following inserts only produce two records:
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer1', 'account1', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377414000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377415000);
This is exactly what I'd like - I won't get a huge table of data, and each entry in the table represents a unique customer account - so no need for a select distinct.
The query I'd like to make - is how many distinct customer accounts are active within the last month say:
Select count(*) from active_accounts where date >= 1418377411000 and date <= 1418397411000 ALLOW FILTERING;
In response to this query, I get the following error:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
What am I missing; isn't this the purpose of the Index I created?
Table design in Cassandra is extremely important and it must match the kind of queries that you are trying to preform. The reason that Cassandra is trying to keep you from performing queries on the date column, is that any query along that column will be extremely inefficient.
Table Design - Model your queries
One of the main reasons that Cassandra can be fast is that it partitions user data so that most( 99%)
of queries can be completed without contacting all of the nodes in the cluster. This means less network traffic, less disk access, and faster response time. Unfortunately Cassandra isn't able to determine automatically what the best way to partition data. The end user must determine a schema which fits into the C* datamodel and allows the queries they want at a high speed.
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
This schema will only be efficient for queries that look like
SELECT timestamp FROM active_accounts where customer_name = ? and account_name = ?
This is because on the the cluster the data is actually going to be stored like
node 1: [ ((Bob,1)->Monday), ((Tom,32)->Tuesday)]
node 2: [ ((Candice, 3) -> Friday), ((Sarah,1) -> Monday)]
The PRIMARY KEY for this table says that data should be placed on a node based on the hash of the combination of CustomerName and AccountName. This means we can only look up data quickly if we have both of those pieces of data. Anything outside of that scope becomes a batch job since it requires hitting multiple nodes and filtering over all the data in the table.
To optimize for different queries you need to change the layout of your table or use a distributed analytics framework like Spark or Hadoop.
An example of a different table schema that might work for your purposes would be something like
CREATE TABLE active_accounts
(
start_month timestamp,
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY (start_month, date, customer_name, account_name)
);
In this schema I would put the timestamp of the first day of the month as the partitioning key and date as the first clustering key. This means that multiple account creations that took place in the same month will end up in the same partition and on the same node. The data for a schema like this would look like
node 1: [ (May 1 1999) -> [(May 2 1999, Bob, 1), (May 15 1999,Tom,32)]
This places the account dates in order within each partition making it very fast for doing range slices between particular dates. Unfortunately you would have to add code on the application side to pull down the multiple months that a query might be spanning. This schema takes a lot of (dev) work so if these queries are very infrequent you should use a distributed analytics platform instead.
For more information on this kind of time-series modeling check out:
http://planetcassandra.org/getting-started-with-time-series-data-modeling/
Modeling in general:
http://www.slideshare.net/planetcassandra/cassandra-day-denver-2014-40328174
http://www.slideshare.net/johnny15676/introduction-to-cql-and-data-modeling
Spark and Cassandra:
http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/
Don't use secondary indexes
Allow filtering was added to the cql syntax to prevent users from accidentally designing queries that will not scale. The secondary indexes are really only for use by those do analytics jobs or those C* users who fully understand the implications. In Cassandra the secondary index lives on every node in your cluster. This means that any query that requires a secondary index necessarily will require contacting every node in the cluster. This will become less and less performant as the cluster grows and is definitely not something you want for a frequent query.

Resources