row skew and memory skew values in memsql - singlestore

I ended up digging into Data Skew after noticing that our cluster runs hot on 2-3 leaves after a lot of insert operations (we have 30 leaves with 32G ram each). Basically our memory reaches almost 100% on those nodes causing a cluster blockage.Restarting those leaves did not free up the memory (in table memory reaches the maximum allocated size). What helped at that stage was allocating more memory to those 2-3 leaves (they are aws instances). However this is not a desired approach- it was a desperate workaround. Strange is that except these 2-3 leaves that are running out of memory, the other leaves are at around 20-30% memory consumption.
Checking this https://docs.memsql.com/docs/data-skew and running those queries i have noticed that all values for row_skew are < 10 % but memory_skew values are for some tables > 40% .
So i was wondering if there is anything that needs to be checked , improved , optimized?

Low row skew but high memory skew could be because you have variable-size data types (such as strings like varchar), and for some reason there is skew in that some partitions have many more bigger strings.
First, you can look for which specific fields are highly skewed - maybe something like select partition_id(), avg(length(s)) from t group by partition_id(). Then depending on what you find you may want to see if there is unexpected problems with the data, or if you need to change the shard key.

Check for orphan partition on the nodes with higher memory use (SHOW CLUSTER STATUS or EXPLAIN CLEAR ORPHAN PARTITIONS).

Related

Cassandra : how to prevent and debug node going Out Of Memory?

I have Cassandra nodes that go regularly out of memory, and it is difficult to find out why.
Questions
could you list the things I have to check to avoid a node going out of memory ?
how to debug when a node go out of memory ?
Thank you
It is not possible to tell exact root cause without heap dump or error logs please set up heap dump
follow link then only we can get actual reason .
Some possible reason
Your rows are probably growing too big to fit in RAM when it comes time to compact them. A compaction requires the entire row to fit in RAM.
There's also a hard limit of 2 billion columns per row but in reality you shouldn't ever let rows grow that wide. Bucket them by adding a day or server name or some other value common across your dataset to your row keys.
For a "write-often read-almost-never" workload you can have very wide rows but you shouldn't come close to the 2 billion column mark. Keep it in millions with bucketing.
For a write/read mixed workload where you're reading entire rows frequently even hundreds of columns may be too much.

Cassandra CPU imbalance in Azure

We have a 30+ node Cassandra cluster (3.11.2) in 4 data centers. One of the centers consists of 8 nodes in Azure running on Standard DS12 v2 (4cpu, 28gb) nodes with a 500GB premium SSD drive. All in the same data center (central US).
We are seeing a dramatic CPU imbalance in the node activity when pushed to the max. We have a keyspace with about 200 million records, and we're running a process to check and refresh the records if necessary from another data stream.
What's happening, is we have 4 nodes that are running at 70-90% CPU compared to 15-25% of the other 4. The measurement of the CPU is being done in the nodes themselves, because Azure's own metrics is broken and never represents what is actually happening.
Digging into a pair of nodes (one low CPU and one high) the difference is the iowait% of the two. The data in the keyspace is balanced (within reason - they are all within 5% of another in record count and size). It looks like the number of reads is balanced, and even the read latency as reported by Cassandra is similar.
When I do an iostat compare of the nodes, the high CPU node is reporting a much higher (by 50 to 100%) rKB/s numbers... which is likely leading to the difference in iowait% time.
These nodes are 100% configured the same, running the same version of everything (OS, libraries, everything) that I can think to look. I cannot figure out why some nodes are deciding to do more disk reads that the others, resulting in the cluster as a whole slowing down.
Anybody have any suggestions on where I can look for differences?
The only thing that is a pattern, is the nodes that are slower are the 4 nodes that were added later in our expansion. We started with 4 nodes for a while and added 4 more when we needed space. All the appropriate repairs and other tasks required with node additions were done - the fact that the records and physical size of the data files on disk being equal should attest to that.
When we shut down our refresh process, the all the nodes settle down to a even 5% or less CPU across the board. No compaction or any other maintenance is happening that would indicate something different.
plz help... :)
Our final solution for this - to fix ONLY the unbalanced problem was to cleanup, full repair and compact. At that point the nodes are relatively equally used. We suspect expanding the cluster (adding nodes) may have left elements of data on the older nodes that were not compacted out based on regular compaction events.
We are still working to try to solve the load issue; but now at least all the nodes are feeling the same CPU crunch.

Why is it so bad to have large partitions in Cassandra?

I have seen this warning everywhere but cannot find any detailed explanation on this topic.
For starters
The maximum number of cells (rows x columns) in a single partition is
2 billion.
If you allow a partition to grow unbounded you will eventually hit this limitation.
Outside that theoretical limit, there are practical limitations tied to the impacts large partitions have on the JVM and read times. These practical limitations are constantly increasing from version to version. This practical limitation is not fixed but variable with data model, query patterns, heap size, and configurations which makes it hard to be give a straight answer on whats too large.
As of 2.1 and early 3.0 releases, the primary cost on reads and compactions comes from deserializing the index which marks a row every column_index_size_in_kb. You can increase the key_cache_size_in_mb for reads to prevent unnecessary deserialization but that reduces heap space and fills old gen. You can increase the column index size but it will increase worst case IO costs on reads. Theres also many different settings for CMS and G1 to tune the impact of a huge spike in object allocations when reading these big partitions. There are active efforts on improving this so in the future it might no longer be the bottleneck.
Repairs also only go down to (in best case scenario) the partition level. So if say you are constantly appending to a partition, and a hash of that partition on 2 nodes are compared at not an exact time (distributed system essentially guarantees this), the entire partition must be streamed over to ensure consistency. Incremental repairs can reduce impact of this, but your still streaming massive amounts of data and fluctuating disk significantly which will then need to be compacted together unnecessarily.
You can probably keep adding onto this of corner cases and scenarios that have issues. Many times large partitions are possible to read, but the tuning and corner cases involved in them are not really worth it, better to just design data model to be friendly with how Cassandra expects it. I would recommend targeting 100mb but you can go far beyond that comfortably. Into the Gbs and you will need to start consider tuning for it (depending on data model, use case etc).

transaction_buffer Maximum value

What is the maximum value that one can set for transaction_buffer inside memsql cnf? I assume there is a correlation with RAM allocated on the server. My leaves have 32G each and at the moment we have transaction_buffer set to 0. We are passed designing phase on our cluster and we would like to do some performance tuning and one parameter that needs to be set up accordingly is this one.
The transaction_buffer size is an amount of memory reserved per database partition - i.e. each leaf node will need transaction_buffer size * partitions per leaf * number of databases memory. The default is 128 MB and this should be sufficient generally.
Basically, it's a balancing act - data in transaction_buffer will exist in memory before being written to disk. A transaction_buffer of 0 may save you some memory, but it's not taking full advantage of the speed of being in memory. If you have a lot of databases that are updated infrequently a low transaction_buffer may be the right balance as it is a per database cost (keeping in mind that each partition is a database itself).
Transaction_buffer may also be valuable for you as a "get out of jail free" card - since if your workload becomes more and more memory intensive it's possible to get into a situation where your OS is killing MemSQL too frequently to reduce memory consumption. Once you get stuck in a vicious cycle like that, restarting with a reduced transaction buffer can reduce memory overhead enough to keep the system from being OOM-killed long enough to troubleshoot and correct the issue on your end.
Eventually, it might become adaptive, and you'll be left without that easy way to get some wiggle-room. Which is why it is essential to make sure the maximum_memory is low enough that your system doesn't begin to OOM kill processes. https://docs.memsql.com/docs/memory-management

Does having 1000's of CF's will lead to OOM in Cassandra

I am having a cluster with multiple CF's (around 1000 maybe more). And I get OOM errors time to time from different nodes. We have three Cassandra nodes? Is it an expected behavior in cassandra?
Each table (columnfamily) requires a minimum of 1MB of heap memory, so it's quite possible this is causing some pressure for you.
The best solution is to redesign your application to use less tables; most of the time I've seen this it's because someone designed it to have "one table per X" where X is a customer or a data source or even a time period. Instead, combine tables with a common schema and add a column to the primary key with the distinguishing element.
In the short term, you probably need to increase your heap size.

Resources