Difference between eviction policies in Hazelcast - hazelcast

I'm going over the documentation for Hazelcast and I'm noticing the differences in eviction policies and I noticed one that I didn't fully understand.
map_size_per_jvm: Max map size per JVM.
partitions_wide_map_size: Partitions (default 271) wide max map size.
I'm assuming both of these are talking about entries and not size in terms of storage space. Isn't a partition going to rest on 1 JVM? To me this would see like these are the same option, can anyone help me make sense of the difference between these 2?

Firstly, yes, the max sizes map_size_per_jvm, cluster_wide_map_size and partitions_wide_map_size are per-entry (not size in terms of storage space).
Secondly, these max sizes are hard limits, and whilst similar they are in fact different to the eviction policies (being LRU, LFU or NONE).
Here's how they work:
cluster_wide_map_size - this is the total map entries across all hazelcast nodes.
map_size_per_jvm - this is essentially the number of map entries per hazelcast node.
So if you are running 2 nodes using this policy with max size = 10 (and backupCount = 0, see below), you have max 20 map entries across all nodes. Adding another hazelcast node increases the total max map size.
partitions_wide_map_size - this one is a little unpredictable, since it depends on the distribution of partitions across your nodes.
A cluster node reaches it's maximum when it reaches it's proportion of (owned partitions / total partitions) of the max size. Code: MaxSizePartitionsWidePolicy
Please note that all of these max sizes include backups, so backupCount = 1 effectively reduces the real max map size in half.
The other max size settings, used_heap_size and used_heap_percentage seem clear in their usage.
I hope this helps, good luck!

Related

What exactly is a partition size in cassandra?

I am new to Cassandra, I have a cassandra cluster with 6 nodes. I am trying to find the partition size,
Tried to fetch it with this basic command
nodetool tablehistograms keyspace.tablename
Now, I am wondering how is it calculated and why the result has only 5 records other than min, max, while the number of nodes are 6. Does node size and number of partitions for a table has any relation?
Fundamentally, what I know is partition key is used to hash and distribute data to be persisted across various nodes
When exactly should we go for bucketing? I am assuming that Cassandra has got a partitioner that take care of distributed persistence across nodes.
The number of entries in this column is not related to the number of nodes. It shows the distribution of the values - you have min, max, and percentiles (50/75/95/98/99).
Most of the nodetool commands doesn't show anything about other nodes - they are tools for providing information about current node only.
P.S. This document would be useful in explaining how to interpret this information.
As the name of the command suggests, tablehistograms reports the distribution of metadata for the partitions held by a node.
To add to what Alex Ott has already stated, the percentiles (not percentages) provide an insight on the range of metadata values. For example:
50% of the partitions for the given table have a size of 74KB or less
95% are 263KB or less
98% are 455KB or less
These metadata don't have any correlation with the number of partitions or the number of nodes in your cluster.
You are correct in that the partition key gets hashed and the resulting value determines where the partition (and its associated rows) get stored (distributed among nodes in the cluster). If you're interested, I've explained in a bit more detail with some examples in this post -- https://community.datastax.com/questions/5944/.
As far as bucketing is concerned, you would typically do that to reduce the number of rows in a partition and therefore reducing its size. The general recommendation is to keep your partition sizes less than 100MB for optimal performance but it's not a hard rule -- you can have larger partitions as long as you are aware of the tradeoffs.
In your case, the larges partition is only 455KB so size is not a concern. Cheers!

ThroughPut Unit and Partition Count

I have question regarding Partition Count with relate to TUs. We have a below configuration and 3 Tus for the NameSpace, than will it have an impact based on no of partition for each eventhub, also should we just create partition count as 32 for better performance?. FYI we are using standard plan and kept partition count higher for first one as it receives more messages. We also use batch method to send messages to evenhub.
There is a potential issue if having 3 TUs. if the namespace has 3 TUs, then in a minute, the maximum size of ingress is 1M * 60 * 3 = 180M/minute, but in the table you posted, the total size is larger than 180M(109+58+39).
And for TU and partition count, you should take a look at How many partitions do I need?, Partitions. And you can follow the guide below from the above articles:
We recommend that you balance 1:1 throughput units and partitions to achieve optimal scale. A single partition has a guaranteed ingress and egress of up to one throughput unit. While you may be able to achieve higher throughput on a partition, performance is not guaranteed. This is why we strongly recommend that the number of partitions in an event hub be greater than or equal to the number of throughput units.
Plan as max 1 MB/sec per partition. In other words, think each partition as an individual stream which can process 1 MB/sec traffic at most. That said, your current configuration looks alright to me. However, you can still consider increasing partition counts depending on your traffic growth trajectory.

Where does the idea of a 10MB partition size come from?

I'm doing some data modelling for time series data in Cassandra, and I've decided to implement buckets to regulate my partition sizes and maintain reasonable distribution on my cluster.
I decided to bucketise such that my partitions would not exceed a size of 10MB, as I've seen numerous sources that state this as an ideal partition size, but I can't find any information on why 10MB was chosen. On top of this I can't find anything from DataStax or Apache that mentions this soft 10MB limit at all.
Our data can be requested for large periods of time, meaning lots of partitions will be required to service 1 request if the partition sizes remain at 10MB. I'd rather increase the size of the partitions, and have fewer partitions required to service these requests.
Where does this idea of a 10MB partition size come from? Is it still relevant? What would be so bad if my partitions were 20MB in size? Or even 50MB?
With 10MB referenced in so many places, I feel like there must be something to it. Any information would be appreciated. Cheers.
I think that many of these advises are coming from old time, when support for wide partitions weren't very good - it was a lot of pressure on heap when we read data, etc.. Since Cassandra 3.0 the situation heavily improved, but it's still recommended to keep the size on the disk under 100Mb.
For example, DataStax planning guide says in section "Estimating partition size":
a good rule of thumb is to keep the maximum number of rows below 100,000 items and the disk size under 100 MB
In recent versions of Cassandra we can go beyond this recommendation, but it still not advised, although it heavily depends on the access patterns. You can find more information in the following blog post, and this video.
I have seen users with 60+Gb partitions - system still works, but the data distribution is not ideal, so nodes are becoming "hot", and performance may suffer.

Cassandra Compacting wide rows large partitions

I have been searching some docs online to get good understanding of how to tackle large partitions in cassandra.
I followed a document on the below link:
https://www.safaribooksonline.com/library/view/cassandra-high-performance/9781849515122/ch13s10.html.
Regarding "LARGE ROWS WITH COMPACTION LIMITS", below is metioned:
"The default value for in_memory_compaction_limit_in_mb is 64. This value is set in conf/cassandra.yaml. For use cases that have fixed columns, the limit should never be exceeded. Setting this value can work as a sanity check to ensure that processes are not inadvertently writing to many columns to the same key.
Keys with many columns can also be problematic when using the row cache because it requires the entire row to be stored in memory."
In the /conf/cassandra.yaml, I did find a configuration named "in_memory_compaction_limit_in_mb".
The Definition in the cassandra.yaml goes as below:
In Cassandra 2.0:
in_memory_compaction_limit_in_mb
(Default: 64) Size limit for rows being compacted in memory. Larger rows spill to disk and use a slower two-pass compaction process. When this occurs, a message is logged specifying the row key. The recommended value is 5 to 10 percent of the available Java heap size.
In Cassandra 3.0: (No such entries found in cassandra.yaml)
compaction_large_partition_warning_threshold_mb
(Default: 100) Cassandra logs a warning when compacting partitions larger than the set value
I have searching lot on what exactly the setting in_memory_compaction_limit_in_mb does.
It mentions some compaction is done in memory and some compaction is done on disk.
As per my understanding goes, When Compaction process runs:
SSTABLE is being read from disk---->(compared,tombstones removed,stale data removed) all happens in memory--->new sstable written to disk-->old table being removed
This operations accounts to high Disc space requirements and Disk I/O(Bandwidth).
Do help me with,if my understanding of compaction is wrong. Is there anything in compaction that happens in memory.
In my environment the
in_memory_compaction_limit_in_mb is set to 800.
I need to understand the purpose and implications.
Thanks in advance
in_memory_compaction_limit_in_mb is no longer necessary since the size doesn't need to be known before writing. There is no longer a 2 pass compaction so can be ignored. You don't have to do the entire partition at once, just a row at a time.
Now the primary cost is in deserializing the large index at the beginning of the partition that occurs in memory. You can increase the column_index_size_in_kb to reduce the size of that index (at cost of more IO during reads, but likely insignificant compared to the deserialization). Also if you use a newer version (3.11+) the index is lazy loaded after exceeding a certain size which improves things quite a bit.

Cassandra configs

Recently I began to study Cassandra. Please help me understand what effect these settings (I need your interpretation, I read the file cassandra.yaml):
memtable_flush_writers
memtable_flush_queue_size
thrift_framed_transport_size_in_mb
in_memory_compaction_limit_in_mb
slised_buffer_size_in_kb
thrift_max_message_length_in_mb
binary_memtable_throughput_in_mb
column_index_size_in_kb
I know it's very late to answer.But I am answering it as it might help someone else.
The most of the parameters you have mentioned above are related to the Cassandra write operation.
memtable_flush_writers :
It Sets the number of memtable flush writer threads. These threads are blocked by disk I/O, and each one holds a memtable in memory while blocked. If your data directories are backed by SSD, increase this setting to the number of cores.
memtable_flush_queue_size :
The number of full memtables to allow pending flush (memtables waiting for a write thread). At a minimum, set to the maximum number of indexes created on a single table
in_memory_compaction_limit_in_mb : Size limit for rows being compacted in memory. Larger rows spill to disk and use a slower two-pass compaction process. When this occurs, a message is logged specifying the row key. The recommended value is 5 to 10 percent of the available Java heap size.
thrift_framed_transport_size_in_mb : Frame size (maximum field length) for Thrift. The frame is the row or part of the row that the application is inserting.
thrift_max_message_length_in_mb: The maximum length of a Thrift message in megabytes, including all fields and internal Thrift overhead (1 byte of overhead for each frame). Message length is usually used in conjunction with batches. A frame length greater than or equal to 24 accommodates a batch with four inserts, each of which is 24 bytes. The required message length is greater than or equal to 24+24+24+24+4 (number of frames).
You can find more details at Datastax Cassandra documentation

Resources