Cassandra datastore size - cassandra

I am using Cassandra to store my parsed site logs. I have two column families with multiple secondary indices. The log data by itself is around 30 gb in size. However, the size of the cassandra data dir is ~91g. Is there any way I can reduce the size of this store? Also, will having multiple secondary indices have a big impact on the datastore size?

Potentially, the secondary indices could have a big impact, but obviously it depends what you put in them! If most of your data entries appear in one or more indexes, then the indexes could form a significant proportion of your storage.
You can see how much space each column family is using JConsole and/or 'nodetool cfstats'.
You can also look at the sizes of the disk data files to get some idea of usage.
It's also possible that data isn't being flushed to disk often enough - this can result in lots of commitlog files being left on disk for a long time, occupying extra space. This can happen if some of your column families are only lightly loaded. See for parameters to tune this.
If you have very large numbers of small columns, then the column names may use a significant proportion of the storage, so it may be worth shortening them where this makes sense (not if they are timestamps or other meaningful data!).


Cassandra : how to prevent and debug node going Out Of Memory?

I have Cassandra nodes that go regularly out of memory, and it is difficult to find out why.
could you list the things I have to check to avoid a node going out of memory ?
how to debug when a node go out of memory ?
Thank you
It is not possible to tell exact root cause without heap dump or error logs please set up heap dump
follow link then only we can get actual reason .
Some possible reason
Your rows are probably growing too big to fit in RAM when it comes time to compact them. A compaction requires the entire row to fit in RAM.
There's also a hard limit of 2 billion columns per row but in reality you shouldn't ever let rows grow that wide. Bucket them by adding a day or server name or some other value common across your dataset to your row keys.
For a "write-often read-almost-never" workload you can have very wide rows but you shouldn't come close to the 2 billion column mark. Keep it in millions with bucketing.
For a write/read mixed workload where you're reading entire rows frequently even hundreds of columns may be too much.

Spark count dataframe to estimate output partitions, then write, efficiently without caching?

As my spark program runs on more data, I think I am crashing because I'm picking up the default number of output partitions for aggregation - namely the 200. I've learned how to control this, but it seems ideally, I would set the number of output partitions based on the amount of data I'm writing. Here in lies the conundrum - I need to first call count() on the dataframe, and then write it. That means I may re-ready it from S3 twice. I could cache and then count, but I've seen spark crash when I cache this data, caching seems to use the most resources, whereas if I just write it - it can do something more optimal.
So my questions are, if you think this is a decent approach - doing a count first (the count is a proxy to the size on disk) or should you just hard code some numbers, change them when you need? And if I am going to count first, is their some clever way to optimize things so that the count and write share work? Other than caching the whole dataframe?
Yes the count approach is actually correct way to go. Ideally you want your rdd partitions to be some considerable size like 50MB before writing. Otherwise you will end up with "small file problem".
Now if you have large data caching in memory could be hard. You could try MEMORY_AND_DISK but then the data will spill to disk and cause slowdown.
I have faced this predicament multiple times and every time I have chosen a "magic number" for the number of partitions. The number is parameterized so when I need to change I don't need to change the code, rather pass the different parameter.
If you know your datasize is generally in a particular range you could set the partition number hard coded. It is not ideal but gets the job done.
Also you could pump the metrics like size of the data in s3 and if that breaches some threshold raise an alarm then someone could change the partition number manually.
In generally if you keep the partition number moderately high like 5000 for approximately 500GB data that works for a large range i.e from 300GB to 1.2TB amount of data. This means probably you don't need to change the partition number too often if you have moderate inflow of data.

Why is it so bad to have large partitions in Cassandra?

I have seen this warning everywhere but cannot find any detailed explanation on this topic.
For starters
The maximum number of cells (rows x columns) in a single partition is
2 billion.
If you allow a partition to grow unbounded you will eventually hit this limitation.
Outside that theoretical limit, there are practical limitations tied to the impacts large partitions have on the JVM and read times. These practical limitations are constantly increasing from version to version. This practical limitation is not fixed but variable with data model, query patterns, heap size, and configurations which makes it hard to be give a straight answer on whats too large.
As of 2.1 and early 3.0 releases, the primary cost on reads and compactions comes from deserializing the index which marks a row every column_index_size_in_kb. You can increase the key_cache_size_in_mb for reads to prevent unnecessary deserialization but that reduces heap space and fills old gen. You can increase the column index size but it will increase worst case IO costs on reads. Theres also many different settings for CMS and G1 to tune the impact of a huge spike in object allocations when reading these big partitions. There are active efforts on improving this so in the future it might no longer be the bottleneck.
Repairs also only go down to (in best case scenario) the partition level. So if say you are constantly appending to a partition, and a hash of that partition on 2 nodes are compared at not an exact time (distributed system essentially guarantees this), the entire partition must be streamed over to ensure consistency. Incremental repairs can reduce impact of this, but your still streaming massive amounts of data and fluctuating disk significantly which will then need to be compacted together unnecessarily.
You can probably keep adding onto this of corner cases and scenarios that have issues. Many times large partitions are possible to read, but the tuning and corner cases involved in them are not really worth it, better to just design data model to be friendly with how Cassandra expects it. I would recommend targeting 100mb but you can go far beyond that comfortably. Into the Gbs and you will need to start consider tuning for it (depending on data model, use case etc).

One bigger partition or few smaller but more distributed partitions for Range Queries in Cassandra?

We have a table that stores our data partitioned by files. One file is 200MB to 8GB in json - but theres a lot of overhead obviously. Compacting the raw data will lower this drastically. I ingested about 35 GB of json data and only one node got slightly more than 800 MB data. This is possibly due to "write hotspots" -- but we only write once and read only. We do not update data. Currently, we have one partition per file.
By using secondary indexes, we search for partitions in the database that contain a specific geolocation (= first query) and then take the result of this query to range query a time range of the found partitions (= second query). This might even be the whole file if needed but in 95% of the queries only chunks of a partition are queried.
We have a replication factor of 2 on a 6 node cluster. Data is fairly even distributed, every node owns 31,9% to 35,7% (effective) data according to nodetool status *tablename*.
Good read performance is key for us.
My questions:
How big is too big for a partition in terms of volume or row size? Is there a rule of thumb for this?
For Range Query performance: Is it better to split up our "big" partitions to have more smaller partitions? We built our schema with "big" partitions because we thought that when we do range queries on a partition, it would be good to have it all on one node so data can be fetched easily. Note that the data is also available on one replica due to RF 2.
C* supports very huge rows, but it doesn't mean it is a good idea to go to that level. The right limit depends on specific use cases, but a good ballpark value could be between 10k and 50k. Of course, everything is a compromise, so if you have "huge" (in terms of bytes) rows then heavily limit the numbers of rows in each partition. If you have "small" (in terms of bytes) rows them you can relax that limit a bit. This is because one partition means one node only due to your RF=1, so all your query for a specific partition will hit only one node.
Range queries should ideally go to one partition only. A range query means a sequential scan on your partition on the node getting the query. However, you will limit yourself to the throughput of that node. If you split your range queries between more nodes (that is you change the way you partition your data by adding something like a bucket) you need to get data from different nodes as well performing parallel queries, directly increasing the total throughput. Of course you'd lose the order of your records within different buckets, so if the order in your partition matters, then that could not be feasible.

Does having 1000's of CF's will lead to OOM in Cassandra

I am having a cluster with multiple CF's (around 1000 maybe more). And I get OOM errors time to time from different nodes. We have three Cassandra nodes? Is it an expected behavior in cassandra?
Each table (columnfamily) requires a minimum of 1MB of heap memory, so it's quite possible this is causing some pressure for you.
The best solution is to redesign your application to use less tables; most of the time I've seen this it's because someone designed it to have "one table per X" where X is a customer or a data source or even a time period. Instead, combine tables with a common schema and add a column to the primary key with the distinguishing element.
In the short term, you probably need to increase your heap size.
