Is it okay to have large number of partitions in cassandra?
Will heap memory bloat?
Mathematically speaking, Cassandra can support (Murmur3 partitioner) +/- 2^63 partitions. That comes out to a total of about 18.4 quintillion. So no worries there, that’s perfectly fine.
Will heap memory bloat?
No. One, there’s a pre-set size for your heap, and it won’t get any bigger than that. Two, Cassandra doesn’t keep all data or even all keys in memory.
You can configure how many keys/rows are cached on a per-table basis. Just make sure not to cache more than the heap-newgen or the heap can “thrash” (constantly running GC).
Related
I have been searching some docs online to get good understanding of how to tackle large partitions in cassandra.
I followed a document on the below link:
https://www.safaribooksonline.com/library/view/cassandra-high-performance/9781849515122/ch13s10.html.
Regarding "LARGE ROWS WITH COMPACTION LIMITS", below is metioned:
"The default value for in_memory_compaction_limit_in_mb is 64. This value is set in conf/cassandra.yaml. For use cases that have fixed columns, the limit should never be exceeded. Setting this value can work as a sanity check to ensure that processes are not inadvertently writing to many columns to the same key.
Keys with many columns can also be problematic when using the row cache because it requires the entire row to be stored in memory."
In the /conf/cassandra.yaml, I did find a configuration named "in_memory_compaction_limit_in_mb".
The Definition in the cassandra.yaml goes as below:
In Cassandra 2.0:
in_memory_compaction_limit_in_mb
(Default: 64) Size limit for rows being compacted in memory. Larger rows spill to disk and use a slower two-pass compaction process. When this occurs, a message is logged specifying the row key. The recommended value is 5 to 10 percent of the available Java heap size.
In Cassandra 3.0: (No such entries found in cassandra.yaml)
compaction_large_partition_warning_threshold_mb
(Default: 100) Cassandra logs a warning when compacting partitions larger than the set value
I have searching lot on what exactly the setting in_memory_compaction_limit_in_mb does.
It mentions some compaction is done in memory and some compaction is done on disk.
As per my understanding goes, When Compaction process runs:
SSTABLE is being read from disk---->(compared,tombstones removed,stale data removed) all happens in memory--->new sstable written to disk-->old table being removed
This operations accounts to high Disc space requirements and Disk I/O(Bandwidth).
Do help me with,if my understanding of compaction is wrong. Is there anything in compaction that happens in memory.
In my environment the
in_memory_compaction_limit_in_mb is set to 800.
I need to understand the purpose and implications.
Thanks in advance
in_memory_compaction_limit_in_mb is no longer necessary since the size doesn't need to be known before writing. There is no longer a 2 pass compaction so can be ignored. You don't have to do the entire partition at once, just a row at a time.
Now the primary cost is in deserializing the large index at the beginning of the partition that occurs in memory. You can increase the column_index_size_in_kb to reduce the size of that index (at cost of more IO during reads, but likely insignificant compared to the deserialization). Also if you use a newer version (3.11+) the index is lazy loaded after exceeding a certain size which improves things quite a bit.
I have seen this warning everywhere but cannot find any detailed explanation on this topic.
For starters
The maximum number of cells (rows x columns) in a single partition is
2 billion.
If you allow a partition to grow unbounded you will eventually hit this limitation.
Outside that theoretical limit, there are practical limitations tied to the impacts large partitions have on the JVM and read times. These practical limitations are constantly increasing from version to version. This practical limitation is not fixed but variable with data model, query patterns, heap size, and configurations which makes it hard to be give a straight answer on whats too large.
As of 2.1 and early 3.0 releases, the primary cost on reads and compactions comes from deserializing the index which marks a row every column_index_size_in_kb. You can increase the key_cache_size_in_mb for reads to prevent unnecessary deserialization but that reduces heap space and fills old gen. You can increase the column index size but it will increase worst case IO costs on reads. Theres also many different settings for CMS and G1 to tune the impact of a huge spike in object allocations when reading these big partitions. There are active efforts on improving this so in the future it might no longer be the bottleneck.
Repairs also only go down to (in best case scenario) the partition level. So if say you are constantly appending to a partition, and a hash of that partition on 2 nodes are compared at not an exact time (distributed system essentially guarantees this), the entire partition must be streamed over to ensure consistency. Incremental repairs can reduce impact of this, but your still streaming massive amounts of data and fluctuating disk significantly which will then need to be compacted together unnecessarily.
You can probably keep adding onto this of corner cases and scenarios that have issues. Many times large partitions are possible to read, but the tuning and corner cases involved in them are not really worth it, better to just design data model to be friendly with how Cassandra expects it. I would recommend targeting 100mb but you can go far beyond that comfortably. Into the Gbs and you will need to start consider tuning for it (depending on data model, use case etc).
I ended up digging into Data Skew after noticing that our cluster runs hot on 2-3 leaves after a lot of insert operations (we have 30 leaves with 32G ram each). Basically our memory reaches almost 100% on those nodes causing a cluster blockage.Restarting those leaves did not free up the memory (in table memory reaches the maximum allocated size). What helped at that stage was allocating more memory to those 2-3 leaves (they are aws instances). However this is not a desired approach- it was a desperate workaround. Strange is that except these 2-3 leaves that are running out of memory, the other leaves are at around 20-30% memory consumption.
Checking this https://docs.memsql.com/docs/data-skew and running those queries i have noticed that all values for row_skew are < 10 % but memory_skew values are for some tables > 40% .
So i was wondering if there is anything that needs to be checked , improved , optimized?
Low row skew but high memory skew could be because you have variable-size data types (such as strings like varchar), and for some reason there is skew in that some partitions have many more bigger strings.
First, you can look for which specific fields are highly skewed - maybe something like select partition_id(), avg(length(s)) from t group by partition_id(). Then depending on what you find you may want to see if there is unexpected problems with the data, or if you need to change the shard key.
Check for orphan partition on the nodes with higher memory use (SHOW CLUSTER STATUS or EXPLAIN CLEAR ORPHAN PARTITIONS).
There is an In-Memory option introduced in the Cassandra by DataStax Enterprise 4.0:
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/inMemory.html
But with 1GB size limited for an in-memory table.
Anyone know the consideration why limited it as 1GB? And possible extend to a large size of in-memory table, such as 64GB?
To answer your question: today it's not possible to bypass this limitation.
In-Memory tables are stored within the JVM Heap, regardless the amount of memory available on single node allocating more than 8GB to JVM Heap is not recommended.
The main reason of this limitation is that Java Garbage Collector slow down when dealing with huge memory amount.
However if you consider Cassandra as a distributed system 1GB is not the real limitation.
(nodes*allocated_memory)/ReplicationFactor
allocated_memory is max 1GB -- So your table may contains many GB in memory allocated in different nodes.
I think that in future something will improve but dealing with 64GB in memory it could be a real problem when you need to flush data on disk. One more consideration that creates limitation: avoid TTL when working with In-Memory tables. TTL creates tombstones, a tombstone is not deallocated until the GCGraceSeconds period passes -- so considering a default value of 10 days each tombstone will keep the portion of memory busy and unavailable, possibly for long time.
HTH,
Carlo
What is the maximum number of keyspaces allowed in a Cassandra cluster? The wiki page on limitations doesn't mention one. Is there such a limit?
A keyspace is basically just a Map entry to Cassandra... you can have as many as you have memory for. Millions, easily.
ColumnFamilies are more expensive, since Cassandra will reserve a minimum of 1MB for each CF's memtable: http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-performance
You should have a look to : https://community.datastax.com/questions/12579/limit-on-number-of-cassandra-tables.html
We recommend a maximum of 200 tables total per cluster across all
keyspaces (regardless of the number of keyspaces). Each table uses 1MB
of memory to hold metadata about the tables so in your case where 1GB
is allocated to the heap, 500-600MB is used just for table metadata
with hardly any heap space left for other operations.
It is a recommendation and there is no hard-limit on the number of tables you can create in a cluster. You can create thousands if you were so inclined.
More importantly, applications take a long time to startup since the
drivers request the cluster metadata (including the schema) during the
initialisation/discovery phase. Retrieving the schema for 200 tables
is significantly less than it would take to load 500, 1000 or 3000.
This may not be important to you but there are lots of use cases where
short startup times are crucial, most notably for short-lived
serverless functions where execution time costs money and reducing
execution where possible results in thousands of dollars in savings.