Batch insert overflow - cassandra

I am using Cassandra 3.10 and am trying to follow best practice by having a table per query so I am using the Batch insert proncipal to insert into multiple tables as a single transaction however I get the following error in the cassandra log.
Batch for [zed.payment, zed.trade_party_b_ref, zed.trade_product_type, zed.trade, zed.fx_variance_swap, zed.trade_party_a_ref, zed.trade_party_b_trade_id, zed.market_value] is of size 5.926KiB, exceeding specified threshold of 5.000KiB by 0.926KiB.

The log is saying that you are sending a batch of almost 6MB when the limit is 5MB.
You should send smaller batches of data to avoid going over that batch size limit.
You can also change the batch size limit in cassandra.yaml, but I would not recommend to change it.

Thanks for the info, the parameter in cassandra.yaml is
Log WARN on any multiple-partition batch size exceeding this value. 5kb per batch by default.
Caution should be taken on increasing the size of this threshold as it can lead to node instability.
batch_size_warn_threshold_in_kb: 5
which is in KB, not MB so my batch statement is really 6KB not 6MB.
After 30 years working with Oracle, this is my first venture into Cassandra so I have tried to follow the guidelines of having a separate table for each query so where I have a financial trade table which has to be queried in up to 8 different ways I have 8 tables. That then implies that an insert into the tables must be done in a batch to create what would be a single transaction in Oracle. The master table of the eight has a significant number of sibling tables which must also be included in the batch so here is my point:
If cassandra does not support transactions but relies on the batch functionality to achieve the same effect it must not impose a limit on the size of the batch. If this is not possible then cassandra is really limited to applications with VERY simple data structures.

Related

Is RethinkDB a good fit for a generic Real-time aggregation platform? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I need your help to verify if RethinkDB fits my use case.
Use case
My team is building a generic Real-time aggregation platform which needs to:
join data from a lot of Kafka topics
Joins need to be done on raw data
Topics have the same key
Data in topics is sometimes a “snapshot” (updatable) and sometimes en “event” (non-updatable)
The destination of the joined data will be some analytical OLAP DB. Clickhouse, Druid, etc. Depending on the case. These systems work with “deltas” (SCDs). Because of “snapshots”, I need stateful processing.
Updates for snapshots can come up to 7 days later
Topics receive around 20k msg/s with peaks up to 200k msg/s
Data in topics is json from 100 Bytes to 5kB
Data in topics can have duplicates
Duplicates are deduplicated with “version” json field which is part of every topic. Data should be processed only if new_version > old_version. Or if old_version didn't exist.
I already have a POC with Cassandra with five stages:
Cassandra Inserter - consumes from.all Kafka topics. Doing insert only for all topics in the same Cassandra table. Sharding is done on column which has the key as all the Kafka topics. So all the messages with the same key end-up in the same shard.
For every Cassandra insert an InsertEvent is produced to Kafka
Delta calculator - consumes InsertEvents and queries Cassandra by the sharding key. Gets all raw data and then deduplicates and creates deltas. The state is saved in another Cassandra cluster. By saving all the processed “versions”. Next time a new InsertEvent comes, we use the saved state “version” to get only two events: previous and current so we can create a DeltaEvent
DeltaEvent is produced to Kafka
ClickHouse / Druid ingest the data
So it's basically a 50/50 insert/read workload without updates to Cassandra.
With 14 Cassandra data nodes and 8 state nodes nodes it works OK up to 20k InsertEvent/s. With 25k InsertEvent/s the system begins to lag.
Nodes have 16GB Ram and disks are network storage backed by SSD (not ideal, I know, but can't change it now). Network 10 Gbit.
RethinkDB idea
I would like to do a new POC to try RethinkDB and use changefeeds to create deltas and to deduplicate. For this I would use a single table. Primary key / sharding key would be the Kafka key and all Kafka data from all topics with the same key would be joined/upserted in a single document.
The workload would be probably 10/90 insert/update. I would use squash: true, to avoid excessive reads and reduce the amount of DeltaEvents.
Do you think this is a good use case for RethinkDB?
Will it scale up to 200k msg/s which would be 20k inserts/s, 180k updates/s and around 150 k/reads via changefeeds?
I will need to delete data older than 7 days, how it will affect the insert/update/query workload?
do you have a proposal for a system which would be a better fit for this use case?
Thanks a lot,
Davor
PS: if you prefer reading a document, here it is: RethinkDB use case question.
IMHO, RehinkDB is good fit in your use case.
From RethinkDB docs
...RethinkDB scales to perform 1.3 million individual reads per second. ...RethinkDB performs well above 100 thousand operations per second in a mixed 50:50 read/write workload - while at the full level of durability and data integrity guarantees. ...performed all benchmarks across a range of cluster sizes, scaling up from one to 16 nodes.
Folks at RethinkDB have tested similar scenario using workloads from the YCSB benchmark suite and reported their results.
We found that in a mixed read/write workload, RethinkDB with two servers was able to perform nearly 16K queries per second (QPS) and scaled to almost 120K QPS while in a 16-node cluster. Under a read only workload and synchronous read settings, RethinkDB was able to scale from about 150K QPS on a single node up to over 550K QPS on 16 nodes. Under the same workload, in an asynchronous “outdated read” setting, RethinkDB went from 150K QPS on one server to 1.3M in a 16-node cluster.
Selecting workloads and hardware
...Out of the YCSB workload options, we chose to run workload A which comprises 50% reads and 50% update operations, and workload C which performs strictly read operations. All documents stored by the YCSB tests contain 10 fields with randomized 100 byte strings as values, with each document totaling about 1 KB in size.
Given the ease of scaling RethinkDB clusters across multiple instances, we deemed it necessary to observe performance when moving from a single RethinkDB instance to a larger cluster. We tested all of our workloads on a single instance of RethinkDB up to a 16-node cluster in varying increments of cluster size.
Additionally, I suggest reading through limitations on RethinkDB. I've copied some here.
There is a hard limit of 64 shards.
While there is no hard limit on the size of a single document, there is a recommended limit of 16MB for memory performance reasons.
The maximum size of a JSON query is 64M.
Primary keys are limited to 127 characters.
Secondary indexes do not store objects or null values.
Primary key strings may not include the null codepoint (U+0000).
By default, arrays on the RethinkDB server have a size limit of 100,000 elements. This can be changed on a per-query basis with the arrayLimit (or array_limit) option to run.
RethinkDB does not support Unicode collations, and does not normalize for identical characters with multiple codepoints (i.e, \u0065\u0301 and \u00e9 both represent the character “é” but RethinkDB treats them, and sorts them as, distinct characters).
Since yours is real-time system, RethinkDB memory requirements and crash recovery are also worth a read.
Furthermore, delete performance benchmark is missing.

Cassandra read performance degrade as we increase data on nodes

DB used: Datastax cassandra community 3.0.9
Cluster: 3 x (8core 15GB AWS c4.2xlarge) with 300GB io1 with 3000iops.
Write consistency: Quorum , read consistency: ONE Replication
factor: 3
Problem:
I loaded our servers with 50,000 users and each user had 1000 records initially and after sometime, 20 more records were added to each users. I wanted to fetch the 20 additional records that were added later(Query : select * from table where userID='xyz' and timestamp > 123) here user_id and timestamp are part of primary key. It worked fine when I had only 50,000 users. But as soon as I added another 20GB of dummy data, the performance for same query i.e. fetch 20 additional records for 50,000 users dropped significantly. Read performance is getting degraded with increase in data. As far as I have read, this should not have happened as keys get cached and additional data should not matter.
what could be possible cause for this? CPU and RAM utilisation is negligible and I cant find out what is causing the query time to increase.
I have tried changing compaction strategy to "LeveledCompaction" but that didn't work either.
EDIT 1
EDIT 2
Heap size is 8GB. The 20GB data is added in a way similar to the way in which the initial 4GB data was added (the 50k userIDs) and this was done to simulate real world scenario. "userID" and "timestamp" for the 20GB data is different and is generated randomly. Scenario is that I have 50k userIDs with 1020 rows where 1000 rows were added first and then additional 20 rows were added after some timestamp, I am fetching these 20 messages. It works fine if only 50k userIDs are present but once I have more userIDs (additional 20GB) and I try to fetch those same 20 messages (for initial 50k userIDs), the performance degrades.
EDIT 3
cassandra.yaml
Read performance is getting degraded with increase in data.
This should only happen when your add a lot of records in the same partition.
From what I can understand your table may looks like:
CREATE TABLE tbl (
userID text,
timestamp timestamp,
....
PRIMARY KEY (userID, timestamp)
);
This model is good enough when the volume of the data in a single partition is "bound" (eg you have at most 10k rows in a single partition). The reason is that the coordinator gets a lot of pressure when dealing with "unbound" queries (that's why very large partitions are a big no-no).
That "rule" can be easily overlooked and the net result is an overall slowdown, and this could be simply explained as this: C* needs to read more and more data (and it will all be read from one node only) to satisfy your query, keeping busy the coordinator, and slowing down the entire cluster. Data grow usually means slow query response, and after a certain threshold the infamous read timeout error.
That being told, it would be interesting to see if your DISK usage is "normal" or something is wrong. Give it a shot with dstat -lrvn to monitor your servers.
A final tip: depending on how many fields you are querying with SELECT * and on the amount of retrieved data, being served by an SSD may be not a big deal because you won't exploit the IOPS of your SSDs. In such cases, preferring an ordinary HDD could lower the costs of the solution, and you wouldn't incur into any penalty.

Cassandra batch size

I have 6 Cassandra nodes mainly used for writing (95%).
What's the best approach to inserting data - individual inserts or batches? reason says batches are to be used, while keeping the "batch size" under 5kb to avoid node instability:
https://issues.apache.org/jira/browse/CASSANDRA-6487
Do these 5kb concern the size of the queries, as in number of chars * bytes_per_char? are there any performance drawbacks to fully running individual inserts?
Batch will increase performance if used for a single partition. You are able to get more through put of data inserted.

batch size of prepared statement in spring data cassandra

I'm getting this warning in the log:
WARN [Native-Transport-Requests:17058] 2014-07-29 13:58:33,776 BatchStatement.java (line 223) Batch of prepared statements for [keyspace.tablex] is of size 10924, exceeding specified threshold of 5120 by 5804.
Is there a way in spring data cassandra to specify the size?
Cassandra 2.0.9 and spring data cassandra 1.0.0-RELEASE
This is just a warning, informing you that the query size exceeds certain limit.
The query is still being processed. The reasoning behind is that bigger batched queries are expensive and may cause cluster imbalance. Therefore warning you (the developer) beforehand.
Look for batch_size_warn_threshold_in_kb in cassandra.yaml to adjust when should this warning be produced.
Here is the ticket where it was introduced: https://issues.apache.org/jira/browse/CASSANDRA-6487
I have done extensive performance testing and tuning on Cassandra, working closely withe DataStax Support.
That is why I created the ingest() methods in SDC*, which are super fast in 1.0.4.RELEASE and higher.
This method caches the PreparedStatement for you, and then loops over the individual Bind values and calls executeAsync for each insert. This sounds counter intuitive, but is the fastest (and most balanced) way to insert into Cassandra.

What is the maximum number of keyspaces in Cassandra?

What is the maximum number of keyspaces allowed in a Cassandra cluster? The wiki page on limitations doesn't mention one. Is there such a limit?
A keyspace is basically just a Map entry to Cassandra... you can have as many as you have memory for. Millions, easily.
ColumnFamilies are more expensive, since Cassandra will reserve a minimum of 1MB for each CF's memtable: http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-performance
You should have a look to : https://community.datastax.com/questions/12579/limit-on-number-of-cassandra-tables.html
We recommend a maximum of 200 tables total per cluster across all
keyspaces (regardless of the number of keyspaces). Each table uses 1MB
of memory to hold metadata about the tables so in your case where 1GB
is allocated to the heap, 500-600MB is used just for table metadata
with hardly any heap space left for other operations.
It is a recommendation and there is no hard-limit on the number of tables you can create in a cluster. You can create thousands if you were so inclined.
More importantly, applications take a long time to startup since the
drivers request the cluster metadata (including the schema) during the
initialisation/discovery phase. Retrieving the schema for 200 tables
is significantly less than it would take to load 500, 1000 or 3000.
This may not be important to you but there are lots of use cases where
short startup times are crucial, most notably for short-lived
serverless functions where execution time costs money and reducing
execution where possible results in thousands of dollars in savings.

Resources