cassandra trigger on [copy from MAXBATCHSIZE] - cassandra

when I'm trying to run the following CQL, I found that the canssandra trigger is not ran by one record, but by one batch.
COPY XXX_Table FROM 'xxxx.csv' WITH MAXBATCHSIZE=10
for example, I hava 2000 thousand recoreds csv file, after the above CQL is ran, there is 2000,000 records in cassandra, but the trigger is ran only 200 thousand times.
why?

it's because your data in the CSV file, has some same partition key.
When importing data, the parent process reads from the input file(s) chunks with CHUNKSIZE rows and sends each chunk to a worker process. Each worker process then analyses a chunk for rows with common partition keys. If at least 2 rows with the same partition key are found, they are batched and sent to a replica that owns the partition. You can control the minimum number of rows with a new option, MINBATCHSIZE, but it is advisable to leave it set to 2. For rows that do not share any common partition key, they get batched with other rows whose partition key belong to a common replica. These rows are then split into batches of size MAXBATCHSIZE, currently 20 rows. These batches are sent to the replicas where the partitions are located. Batches are of type UNLOGGED in both cases.
base on:
link

Related

Optimize Partitionning for billions of distinct keys

I'm processing a file each day with PySpark for contaning information about device navigation through the web. At the end of each month I want to use window functions in order to have the navigation journey for each device. It's a very slow processing, even with many nodes, so I'm looking for ways to speed it up.
My idea was to partition the data but I have 2 billion distinct keys, so partitionBy does not seem appropriate. Even bucketBy might not be a good choice because I create n buckets each day, so the files are not appended but for each day there are x parts of files that are created.
Does anyone have a solution ?
So here is an example of the export for each day (inside of each parquet file we find 9 partitions):
And here is the partitionBy query that we launch at the beggining of each month (compute_visit_number and compute_session_number are two udf that i've created on the notebook):
You want to ensure that each devices data is in the same partition to prevent exchanges when you do your window function. Or at least minimise the number of partitions the data could be in.
To do this I would create a column called partitionKey when you write the data - which contained a mod on the mc_device column - where the number you mod by is the number of partitions you want. Base this number of the size of the cluster that will run the end of month query. (If mc_device is not an integer then create a checksum first).
You can create a secondary partition on the date column if still needed.
Your end of month query should change:
w = Windows.partitionBy('partitionKey', 'mc_device').orderBy(event_time')
If you kept the date as a secondary partition column then repartition the dataframe to partitionKey only:
df = df.repartition('partitionKey')
At this point each devices data will be in the same partition and no exchanges should be needed. The sort should be faster and your query will hopefully complete in a sensible time.
If it is still slow you need more partitions when writing the data.

Cassandra - Composite Partition Keys and Performance

I am working on the keyspace and tables for a Cassandra environment. I understand the size limitations of Cassandra and dealing with Partition keys to keep it optimized. However, I am having a disagreement with a developer regarding how to handle the keys. Is there any downside in having a key that would include a large number of data rather than a small amount of data. For example,
I have 100k records. I can create a key that will partition this into 10k; I could also create a key that will partition this into 10 records (by day). So either I store 10k and 10 partitions or 10 records and 10,000 partitions.
Keep in mind that having more columns in the key requires you to specify those columns in your select statements, which sometimes isn't desired. The more partitions the better - whether by picking a better single column or having multiple columns.
Cassandra reads data via the partition key, and can get help with performance if clustering columns are used. If you have a large partition, the entire partition must be read (memory and disk) and then merged for the output. If you have large partitions, this will definitely slow you down.

Partitioning the data into equal number of records for each group in spark data frame

We have 1 month of data and each day has data of size which falls in the range of 10 to 100GB. We will be writing this data set in a partitioned manner. Here in our case, we have DATE parameter using which we will be partitioning the data in the data frame (partition("DATE")). And we also apply repartition to this data frame to create single or multiple files. If we repartition to 1, it creates 1 file for each partition. If we set to 5 it creates 5 partition files and is good.
But what we are trying here is, we want to make sure is each group (partitioned data of date) is created with equal size files (either through a number of records or sizes of files).
We have used spark data frame option "maxRecordsPerFile" and set to 10Million records. And this is working as expected. for 10 days of data, if I am doing this in one go, it is eating up the execution time, as it is collecting all 10 days of data and trying to do some distribution.
If I don't set this parameter and if I don't set repartition to 1, then this activity is completing in 5 minutes, but if I just set partition("DATE") and maxRecrodsPerFile option it is taking almost an hour.
Looking forward to some help on this!
~Krish

Cassandra failure during read query

I have a Cassandra Table with ~500 columns and primary key ((userId, version, shredId), rowId) where shredId is used to distribute data evenly into different partitions. Table also has a default TTL of 2 days to expire data as data are for real-time aggregation. The compaction strategy is TimeWindowCompactionStrategy.
The workflow is:
write data to input table (with consistency EACH_QUORUM)
Run spark aggregation (on rows with same the userId and version)
write aggregated data to output table.
But I'm getting Cassandra failure during read query when size of data gets large; more specifically, once there are more than 210 rows in one partition, read queries fail.
How can I tune my database and change properties to fix this?
After investigation and research, the issued is caused by null values been inserted for some empty column. this creates large amount of tombstones and eventually timeout the query.

Cassandra - Data Modeling Time Series - Avoiding "Hot Spots"?

I'm working on a Cassandra data model to store records uploaded by users.
The potential problem is, some users may upload 50-100k rows in a 5 minute period, which can result in a "hot spot" for the partiton key (user_id). (Datastax recommendation is to rethink data model if more than 10k rows per partition).
How can I avoid having too many records on a partition key in a short amount of time?
I've tried using the Time Series suggestions from Datastax, but even if I had year, month, day, hour columns, a hot spot may still occur.
CREATE TABLE uploads (
user_id text
,rec_id timeuuid
,rec_key text
,rec_value text
,PRIMARY KEY (user_id, rec_id)
);
The use cases are:
Get all upload records by user_id
Search for upload records by date range
range
A few possible ideas:
Use a compound partition key instead of just user_id. The second part of the partition key could be a random number from 1 to n. For example if n were 5, then your uploads would be spread out over five partitions per user instead of just one. The downside is when you do reads, you have to repeat them n times to read all the partitions.
Have a separate table to handle incoming uploads using the rec_id as the partition key. This would spread the load of uploads equally across all the available nodes. Then to get that data into the table with user_id as the partition key, periodically run a spark job to extract new uploads and add them to the user_id based table at a rate the the single partitions can handle.
Modify your front end to throttle the rate at which an individual user can upload records. If only a few users are uploading at a high enough rate to cause a problem, it may be easier to limit them rather than modify your whole architecture.

Resources