Cassandra : how to prevent and debug node going Out Of Memory? - cassandra

I have Cassandra nodes that go regularly out of memory, and it is difficult to find out why.
Questions
could you list the things I have to check to avoid a node going out of memory ?
how to debug when a node go out of memory ?
Thank you

It is not possible to tell exact root cause without heap dump or error logs please set up heap dump
follow link then only we can get actual reason .
Some possible reason
Your rows are probably growing too big to fit in RAM when it comes time to compact them. A compaction requires the entire row to fit in RAM.
There's also a hard limit of 2 billion columns per row but in reality you shouldn't ever let rows grow that wide. Bucket them by adding a day or server name or some other value common across your dataset to your row keys.
For a "write-often read-almost-never" workload you can have very wide rows but you shouldn't come close to the 2 billion column mark. Keep it in millions with bucketing.
For a write/read mixed workload where you're reading entire rows frequently even hundreds of columns may be too much.

Related

Memsql columnstore data not deleted from disk after TRUNCATE or DROP TABLE

I created a columnstore table in memsql and populated it with around 10 million records after which I started running several update scenarios. I noticed that the size of the data in /var/lib/memsql/leaf-3307/data/columns keeps increasing constantly and nothing there seems to be deleted. Initially the size of that folder is a couple hundred Mb but it quickly jumps to a couple of Gb after some full table updates. The "Columnstore Disk Usage" reported by memsql-ops also increases but at a very slow pace (far from what I see on disk).
This makes me think that data is never actually deleted from disk. The documentation states that running the OPTIMIZE commands should compact the row segment groups and that deleted rows would be removed:
Delete - Deleting a row in a columnstore index causes the row to be marked as deleted in the segment meta data leaving the data in place within the row segment. Segments which only contain deleted rows are removed, and the optimization process covered below will compact segments that require optimization.
Running the OPTIMIZE command didn't help. I also tried truncating the table and even dropping it but nothing helped. The data in the columns folder is still there. The only way I could find of cleaning that up is to DROP the entire database.
This doesn't seem like the desired behavior and I can't find any documentation justifying it. Can anybody explain why this is happening, if it should happen or point me to some relevant documentation?
Thanks in advance
MemSQL will keep around columnstore_window_size bytes of deleted columnstore data on disk per partition database. This is part of the implementation of columnstore replication (it keeps some old files around in case slaves are behind). If you lower the value of that system variable you'll see the disk usage drop. If your not using redundancy 2 there is no harm in lowering it.

row skew and memory skew values in memsql

I ended up digging into Data Skew after noticing that our cluster runs hot on 2-3 leaves after a lot of insert operations (we have 30 leaves with 32G ram each). Basically our memory reaches almost 100% on those nodes causing a cluster blockage.Restarting those leaves did not free up the memory (in table memory reaches the maximum allocated size). What helped at that stage was allocating more memory to those 2-3 leaves (they are aws instances). However this is not a desired approach- it was a desperate workaround. Strange is that except these 2-3 leaves that are running out of memory, the other leaves are at around 20-30% memory consumption.
Checking this https://docs.memsql.com/docs/data-skew and running those queries i have noticed that all values for row_skew are < 10 % but memory_skew values are for some tables > 40% .
So i was wondering if there is anything that needs to be checked , improved , optimized?
Low row skew but high memory skew could be because you have variable-size data types (such as strings like varchar), and for some reason there is skew in that some partitions have many more bigger strings.
First, you can look for which specific fields are highly skewed - maybe something like select partition_id(), avg(length(s)) from t group by partition_id(). Then depending on what you find you may want to see if there is unexpected problems with the data, or if you need to change the shard key.
Check for orphan partition on the nodes with higher memory use (SHOW CLUSTER STATUS or EXPLAIN CLEAR ORPHAN PARTITIONS).

Does having 1000's of CF's will lead to OOM in Cassandra

I am having a cluster with multiple CF's (around 1000 maybe more). And I get OOM errors time to time from different nodes. We have three Cassandra nodes? Is it an expected behavior in cassandra?
Each table (columnfamily) requires a minimum of 1MB of heap memory, so it's quite possible this is causing some pressure for you.
The best solution is to redesign your application to use less tables; most of the time I've seen this it's because someone designed it to have "one table per X" where X is a customer or a data source or even a time period. Instead, combine tables with a common schema and add a column to the primary key with the distinguishing element.
In the short term, you probably need to increase your heap size.

Cassandra datastore size

I am using Cassandra to store my parsed site logs. I have two column families with multiple secondary indices. The log data by itself is around 30 gb in size. However, the size of the cassandra data dir is ~91g. Is there any way I can reduce the size of this store? Also, will having multiple secondary indices have a big impact on the datastore size?
Potentially, the secondary indices could have a big impact, but obviously it depends what you put in them! If most of your data entries appear in one or more indexes, then the indexes could form a significant proportion of your storage.
You can see how much space each column family is using JConsole and/or 'nodetool cfstats'.
You can also look at the sizes of the disk data files to get some idea of usage.
It's also possible that data isn't being flushed to disk often enough - this can result in lots of commitlog files being left on disk for a long time, occupying extra space. This can happen if some of your column families are only lightly loaded. See http://wiki.apache.org/cassandra/MemtableThresholds for parameters to tune this.
If you have very large numbers of small columns, then the column names may use a significant proportion of the storage, so it may be worth shortening them where this makes sense (not if they are timestamps or other meaningful data!).

Cassandra multiget performance

I've got a cassandra cluster with a fairly small number of rows (2 million or so, which I would hope is "small" for cassandra). Each row is keyed on a unique UUID, and each row has about 200 columns (give or take a few). All in all these are pretty small rows, no binary data or large amounts of text. Just short strings.
I've just finished the initial import into the cassandra cluster from our old database. I've tuned the hell out of cassandra on each machine. There were hundreds of millions of writes, but no reads. Now that it's time to USE this thing, I'm finding that read speeds are absolutely dismal. I'm doing a multiget using pycassa on anywhere from 500 to 10000 rows at a time. Even at 500 rows, the performance is awful sometimes taking 30+ seconds.
What would cause this type of behavior? What sort of things would you recommend after a large import like this? Thanks.
Sounds like you are io-bottlenecked. Cassandra does about 4000 reads/s per core, IF your data fits in ram. Otherwise you will be seek-bound just like anything else.
I note that normally "tuning the hell" out of a system is reserved for AFTER you start putting load on it. :)
See:
http://spyced.blogspot.com/2010/01/linux-performance-basics.html
http://www.datastax.com/docs/0.7/operations/cache_tuning
Is it an option to split up the multi-get into smaller chunks? By doing this you would be able to spread your get across multiple nodes, and potentially increase your performance, both by spreading the load across nodes and having smaller packets to deserialize.
That brings me to the next question, what is your read consistency set to? In addition to an IO bottleneck as #jbellis mentioned, you could also have a network traffic issue if you are requiring a particularly high level of consistency.

Resources