Why varying blob size gives different performance? - cassandra

My cassandra table looks like this -
CREATE TABLE cs_readwrite.cs_rw_test (
part_id bigint,
s_id bigint,
begin_ts bigint,
end_ts bigint,
blob_data blob,
PRIMARY KEY (part_id, s_id, begin_ts, end_ts)
) WITH CLUSTERING ORDER BY (s_id ASC, begin_ts DESC, end_ts DESC)
When I insert 1 million row per client, with 8 kb blob per row and test the speed of insertions from different client hosts the speed is almost constant at ~100 mbps. But with the same table definition, from same client hosts if I insert rows with 16 bytes of blob data then my speed numbers are dramatically low ~4 to 5 mbps. Why is there such a speed difference? I am only measuring write speeds for now. My main concern is not speed (though some inputs will help) when I add more clients I see speed is almost constant for bigger blob size but for 16 bytes blob the speed is increasing only by 10-20% per added client before it becomes constant.
I have also looked at bin/nodetool tablehistograms output and do adjust number of partitions in my test data so no partition is > 100 mb.
Any insights/ links for documentation would be helpful. Thanks!

I think you are measuring the throughput in the wrong way. The throughput should be measured in transactions per second and not in data written per second.
Even though the amount of data written can play a role in determining the write throughput of a system but usually it depends on many other factors.
Compaction Strategy like STCS is write-optimized whereas LOCS is
read-optimized.
Connection speed and latency between the client and the cluster, and
between machines in the cluster
CPU usage of the node which is processing data, sending data to other
replicas and waiting for their acknowledgment.
Most of the writes are immediately written in memory instead of being written directly in the disk which basically makes the impact of the amount of data being written on final write throughput almost negligible whereas other fixed things like Network delay, CPU to coordinate the processing of data across nodes, etc have a bigger impact.
The way you should see it is that with 8KB of payload you are getting X transactions per second and with 16 Bytes you are getting Y transactions per second. Y will always be better than X but it will not be linearly proportional to the size difference.
You can find how writes are handled in cassandra explained in detail here.

Theres management overhead in Cassandra per row/partition, the more data (in bytes) you have in each row the less that overhead impacts throughput in bytes/sec. The reverse is true if you look at rows per sec as a metric of throughput. The larger the payloads the worse your rows/sec throughput would get.

Related

Cassandra : optimal partition size

I plan to have a simple table like this (simple key/value use case) :
CREATE TABLE my_data (
id bigint,
value blob,
PRIMARY KEY (id)
)
With the following caracteristics :
as you can see, one partition = one blob (value)
each value is always accessed by the corresponding key
each value is a blob of 1MB max (average also 1 MB)
with 1MB blob, it give 60 millions partitions
What do you think about the 1MB blob ? Is that OK for Cassandra ?
Indeed, I can divide my data further, to work with 1ko blob, but in that case, it will lead to many more partitions on Cassandra (more than 600 millions ?), and many more partitions to retreive the data for a same client side query..
Thanks
The general recommendation is to stay as close to 100MB partition sizes although this isn't a hard limit. There are some edge cases were partitions can get beyond 1GB and still be acceptable for some workloads as long as you're willing to accept the tradeoffs.
However in your case, 1MB blobs is a strong recommendation but again not a hard limit. You will notice a significant performance hit for larger blob sizes if you were to do a reasonable load test.
600 million partitions is not a problem at all. Cassandra is designed to handle billions, trillions of partitions and beyond. Cheers!

ThroughPut Unit and Partition Count

I have question regarding Partition Count with relate to TUs. We have a below configuration and 3 Tus for the NameSpace, than will it have an impact based on no of partition for each eventhub, also should we just create partition count as 32 for better performance?. FYI we are using standard plan and kept partition count higher for first one as it receives more messages. We also use batch method to send messages to evenhub.
There is a potential issue if having 3 TUs. if the namespace has 3 TUs, then in a minute, the maximum size of ingress is 1M * 60 * 3 = 180M/minute, but in the table you posted, the total size is larger than 180M(109+58+39).
And for TU and partition count, you should take a look at How many partitions do I need?, Partitions. And you can follow the guide below from the above articles:
We recommend that you balance 1:1 throughput units and partitions to achieve optimal scale. A single partition has a guaranteed ingress and egress of up to one throughput unit. While you may be able to achieve higher throughput on a partition, performance is not guaranteed. This is why we strongly recommend that the number of partitions in an event hub be greater than or equal to the number of throughput units.
Plan as max 1 MB/sec per partition. In other words, think each partition as an individual stream which can process 1 MB/sec traffic at most. That said, your current configuration looks alright to me. However, you can still consider increasing partition counts depending on your traffic growth trajectory.

Slow bulk insert in Spanner

I'm trying to load a dataset of 1TB in spanner and I can't get further than 5 MB/s with 3 nodes.
Main issue is that the dataset I want to load is mainly composed by integers + some nulls columns, so the commit size is too small, lower than 500K.
I've followed the rules for bulk insert defined at https://cloud.google.com/spanner/docs/bulk-loading. (partitions, workers, etc...) If I add a text column to make the commit size bigger (2.5MB), I can reach a throughput of 70MB/s.
However, with a dataset composed by integer and nulls I don't know how to make the importer faster than 5MB/s.
Table definition:
id INT64 NOT NULL,
date DATE NOT NULL,
type STRING(16) NOT NULL,
category STRING(3) NOT NULL,
quadkey STRING(18) NOT NULL,
subcategory STRING(2) NOT NULL,
txn INT64,
accouunts INT64,
acct_cnt FLOAT64,
avg_freq FLOAT64,
avg_spend_amt FLOAT64,
avg_ticket FLOAT64,
txn_amt FLOAT64,
txn_cnt FLOAT64,
) PRIMARY KEY (id)
To improve throughput, the recommendation is to commit 1 MB to 5 MB at a time: https://cloud.google.com/spanner/docs/bulk-loading#commit-size
Perhaps you could try batching more rows together in a single transaction.
Also see https://cloud.google.com/spanner/docs/bulk-loading#test-measure and https://medium.com/google-cloud/cloud-spanner-maximizing-data-load-throughput-23a0fc064b6d
Hope this helps!
There are many factors that contribute to a maximized throughput when loading data in Spanner.
1- The first step is to ensure that there isn’t any network capacity or added latencies due to the datasets being loaded into Spanner from an inefficient storage solution. Having your datasets in a Cloud Storage bucket in the same region as your Spanner instance and loading from there is a safe way to do so.
2- Then Spanner needs to partition the data by primary key, so that the splits are equally distributed and the nodes have equal workload to insure best performances. Spanner organizes rows lexicographically, so here are the recommendations on how to optimize the splits. This will avoid “hotspots”, where some nodes have higher workload than others. This can be assessed by the CPU utilization - high priority graph accessible via the console. Trying to get as close to 65% as possible for that metric (the recommended limit) is a good way to insure hotspots are not happening.
3- Having many splits can sometimes seem to increase efficiency, but this benefit is reduced by the coordination overhead created. This is where appropriate ordered batches will find the right balance to maximize throughput. This blog post covers the concept with a detailed example, I highly suggest you read it in full.
As mentioned in the documentation, it is possible to achieve bulk writes of 10-20 MB/s, but since the commits are not in the 1-5 MB at a time, this might not be possible even following the best practices. In this case, the number of operations/s, rows/s or requests/s might be more relevant, and 6MB/s could be in fact a good throughput considering the table schemas. These metrics can be accessed via the console or Stackdriver Monitoring for more advanced insights.

Cassandra partitioning strategy for systems with skewed traffic

Please bear with me for slightly longer problem description.
I am a newbie to Cassandra world and I am trying to migrate my current product from oracle based data layer to Cassandra.
In order to support range queries I have created an entity like below:
create table if not exists my_system.my_system_log_dated(
id uuid,
client_request_id text,
tenant_id text,
vertical_id text,
channel text,
event text,
event_type text,
created_date date,
primary key((created_date, tenant_id, vertical_id, channel, event),
event_type, client_request_id, id)
) with clustering order by (created_date desc);
Now, I have come across several documentation/resources/blogs that mentions that I should be keeping my partition size less than 100 mb for an optimally performing cluster. With the volume of traffic my system handles per day for a certain combinations of partitioning key, there is no way i can keep it less than 100 mb with above partitioning key.
To fix this i introduced a new factor called bucket_id and was thinking of assigning it hour of the day value to further break partitions into smaller chunks and keep them less than 100 mb(Even though this means i have to do 24 reads to serve traffic details for one day, but i am fine with some inefficiency in reads). Here is the schema with bucket id
create table if not exists my_system.my_system_log_dated(
id uuid,
client_request_id text,
tenant_id text,
vertical_id text,
channel text,
event text,
bucket_id int,
event_type text,
created_date date,
primary key((created_date, tenant_id, vertical_id, channel, event,
bucket_id), event_type, client_request_id, id)
) with clustering order by (created_date desc);
Even with this, couple of combinations of
goes more than 100 mb while all other volume sits comfortably within the range.
With this situation in mind I have below questions:
Is it an absolute blunder to have few of your partitions go beyond 100 mb limit?
Though with even smaller bucket say 15 min window, I get all combinations of partition key under 100 mb but that too creates heavily skewed partitions, meaning that high volume combinations of partition key goes up till 80 mb while remaining once are well under 15 mb. Is this something that will adversely impact performance of my cluster?
Is there a better way to solve this problem?
Here is some more info that I thought may be useful:
Avg row size for this entity is around 200 bytes
I am also considering a load future proofing factor of 2 and estimating for double the load.
Peak load for a specific combination of partition key is around 2.8 Million records in a day
the same combination has peak traffic hour of about 1.4 million records
and the same in 15 min window is around 550,000 records.
Thanks in advance for your inputs!!
Your approach with the bucket id looks good. Answering your questions:
No, it's not a hard limit, and actually, it might be too low taking into account hardware improvements over the last few years. I have seen partitions of 2 GB and 5 GB (though they can give you a lot of headaches when doing repairs), but those are extreme cases. Don't go near those values. Bottom line, if you don't go WAY above those 100 MB, you will be fine. If you have at least 15 GB of ram, use G1GC and you're golden.
A uniform distribution on the partition sizes is important to keep the data load balanced throughout the cluster, and it's also good so that you're confident that your queries will be close to an average latency (because they will be reading the approximate same sizes of data), but it's not something that will give performance issues on its own.
The approach looks good, but if that's a time series, which I think it is taking into account what you said, then I recommend that you use TWCS (Time Window Compaction Strategy) in my_system.my_system_log_dated. Check how to configure this compaction strategy, because the time window you set will be very important.
I was able to device bucketisation that prevents any risks to cluster health due to any unexpected traffic spike. Same has been described here https://medium.com/walmartlabs/bucketisation-using-cassandra-for-time-series-data-scans-2865993f9c00

Cassandra read performance degrade as we increase data on nodes

DB used: Datastax cassandra community 3.0.9
Cluster: 3 x (8core 15GB AWS c4.2xlarge) with 300GB io1 with 3000iops.
Write consistency: Quorum , read consistency: ONE Replication
factor: 3
Problem:
I loaded our servers with 50,000 users and each user had 1000 records initially and after sometime, 20 more records were added to each users. I wanted to fetch the 20 additional records that were added later(Query : select * from table where userID='xyz' and timestamp > 123) here user_id and timestamp are part of primary key. It worked fine when I had only 50,000 users. But as soon as I added another 20GB of dummy data, the performance for same query i.e. fetch 20 additional records for 50,000 users dropped significantly. Read performance is getting degraded with increase in data. As far as I have read, this should not have happened as keys get cached and additional data should not matter.
what could be possible cause for this? CPU and RAM utilisation is negligible and I cant find out what is causing the query time to increase.
I have tried changing compaction strategy to "LeveledCompaction" but that didn't work either.
EDIT 1
EDIT 2
Heap size is 8GB. The 20GB data is added in a way similar to the way in which the initial 4GB data was added (the 50k userIDs) and this was done to simulate real world scenario. "userID" and "timestamp" for the 20GB data is different and is generated randomly. Scenario is that I have 50k userIDs with 1020 rows where 1000 rows were added first and then additional 20 rows were added after some timestamp, I am fetching these 20 messages. It works fine if only 50k userIDs are present but once I have more userIDs (additional 20GB) and I try to fetch those same 20 messages (for initial 50k userIDs), the performance degrades.
EDIT 3
cassandra.yaml
Read performance is getting degraded with increase in data.
This should only happen when your add a lot of records in the same partition.
From what I can understand your table may looks like:
CREATE TABLE tbl (
userID text,
timestamp timestamp,
....
PRIMARY KEY (userID, timestamp)
);
This model is good enough when the volume of the data in a single partition is "bound" (eg you have at most 10k rows in a single partition). The reason is that the coordinator gets a lot of pressure when dealing with "unbound" queries (that's why very large partitions are a big no-no).
That "rule" can be easily overlooked and the net result is an overall slowdown, and this could be simply explained as this: C* needs to read more and more data (and it will all be read from one node only) to satisfy your query, keeping busy the coordinator, and slowing down the entire cluster. Data grow usually means slow query response, and after a certain threshold the infamous read timeout error.
That being told, it would be interesting to see if your DISK usage is "normal" or something is wrong. Give it a shot with dstat -lrvn to monitor your servers.
A final tip: depending on how many fields you are querying with SELECT * and on the amount of retrieved data, being served by an SSD may be not a big deal because you won't exploit the IOPS of your SSDs. In such cases, preferring an ordinary HDD could lower the costs of the solution, and you wouldn't incur into any penalty.

Resources