Does compaction processes in C* influence on Spark jobs? - apache-spark

I`m using cassandra 2.1.5 (.469) with spark 1.2.1.
I had performed a migration job with spark on big C* table (2,034,065,959 rows)- migrating it to another schema table (new_table), using:
some_mapped_rdd.saveToCassandra("keyspace", "new_table", writeConf=WriteConf(parallelismLevel = 50))
I can see in OpsCenter/Activities that C* doing some compaction tasks on the new_table, and it is going on for few days.
In addition, I`m trying to run another job, while the compaction tasks is still on, using:
//join with cassandra
val rdd = some_array.map(x => SomeClass(x._1,x._2)).joinWithCassandraTable(keyspace, some_table)
//get only the jsons and create rdd temp table
val jsons = rdd.map(_._2.getString("this"))
val jsonSchemaRDD = sqlContext.jsonRDD(jsons)
jsonSchemaRDD.registerTempTable("this_json")
and it takes much longer then usual (usually I don`t perform huge migration tasks) to finish.
So does the compaction processes in C* influence on Spark jobs?
EDIT:
My table configured to SizeTieredCompactionStrategy (default) compaction strategy and I have 2882~ of 20M~ (and smaller, on 1 node out of 3) SSTable files, so I guess I should change the compaction_throughput_mb_per_sec parameter to higher value and go for DateTieredCompactionStrategy compaction strategy as my data is time series data.

In terms of compaction potentially using a lot of system resources it can influence your Spark Jobs from a performance standpoint. You can control how much throughput compactions can perform at a time via compaction_throughput_mb_per_sec.
On the other hand, reducing compaction throughput will make your compactions take longer to complete.
Additionally, the fact that compaction is happening could mean that how your data is distributed among sstables is not optimal. So it could be that compaction is a symptom of the issue, but not the actual issue. In fact it could be the solution to your problem (over time as it makes more progress).
I'd recommend taking a look the cfhistograms output of your tables you are querying to see how many SSTables are being hit per read. That could be a good indicator that something is unoptimal - such as needing to change your configuration (i.e. memtable flush rates) or optimize or change your compaction strategy.
This answer provides a good explanation on how to read cfhistograms output.

Related

Why is forcing major compaction on a table not ideal?

Consider a scenario where a table partitions with thousands of deleted rows. When reading from the table, Cassandra has to scan over thousands of deleted rows before it gets to the live rows.
A common workaround is to manually run a compaction on a node to forcibly get rid of tombstones.
What are the downsides of forcing major compaction on a table (with nodetool compact) and what is the best practice recommendation?
Background
When forcing a major compaction on a table configured with the SizeTieredCompactionStrategy (STCS), all the SSTables on the node get compacted together into a single large SSTable. Due to its size, the resulting SSTable will likely never get compacted out since similar-sized SSTables are not available as compaction candidates. This creates additional issues for the nodes since tombstones do not get evicted and keep accumulating, affecting the cluster's performance.
Caveats
We understand that cluster administrators use major compaction as a way of evicting tombstones which have accumulated as a result of high-delete workloads which in most cases is due to an incorrect data model.
The recommendation in this post does NOT constitute a solution to the underlying issue users face. It should not be considered a long-term fix to the data model problem.
Recommendation
In Apache Cassandra 2.2, CASSANDRA-7272 introduced a huge improvement which splits the output of nodetool compact into multiple files which are 50% then 25% then 12.5% of the original table size until the smallest chunk is 50MB for tables using STCS.
When using major compaction as a last resort for evicting tombstones, use the --split-output (or shorthand -s) to take advantage of this new feature:
$ nodetool compact --split-output -- <keyspace> <table>
NOTE - This feature is only available from Cassandra 2.2 and newer versions.
Also see How to split large SSTables on another server. Cheers!

is it more efficient to cache a dataframe in on partition or more partitions

I'm persisting a dataFrame, and in the spark interface i see that this dataframe is partitioned in my 7 nodes.
My spark job have transformations with wide dependencies.
Could it be more performant to force the cache in only 1 partition ?
To avoid shuffle?
Thanks
There is a balance between number of partitions and therefore concurrency. Dare I say it, you are a little off-beam here. Meaning:
Too much partitioning makes no sense --> too much overhead.
Just one partition would mean a coalesce or re-partition and would lack parallel processing of what Spark offers to get the job done quicker, e.g. many workers in parallel loading supermarket shelves is faster than just you and I doing it on our own.
The truth is somewhere in between in terms of number of partitions which at scale needs to be estimated and trialled, and, shuffling can rarely be avoided unless you base the partitioning on what you read in from HDFS/Hadoop Source (e.g. KUDU) or S3, or from JDBC.

Does nodetool compact move everything into one SSTable

The Cassandra compaction process reduces the number of SSTables (data files on disk) used to store data. Minor compactions occur automatically. You can tell Cassandra to perform a major compaction using the nodetool compact command.
Does running nodetool compact merely perform one round of compaction, reducing the number of SSTables, but perhaps still resulting in there being several SSTables? Or does it always compact all the SSTables (of a column family) into one SSTable?
It would depend on the compaction strategy you set for the table.
For DateTieredCompactionStrategy and LeveledCompactionStrategy, by definition I don't think even a major compaction would combine all the SSTables since that would go against the structure of SSTables they aim to create.
For the default SizeTieredCompactionStrategy, anecdotally it appears a major compaction will combine the SSTables into a single table. I ran cassandra-stress -write and watched the SSTables for a while. I could see the minor compactions combining SSTables of similar sizes, but not collapsing dissimilar sizes into one.
Then when I'd run a nodetool compact on the table, it would combine SSTables of dissimilar sizes into a single table. I'm not sure if that would be true in all cases.
Taking a quick look at the source, in CompactionManager.java it calls cfStore.getCompactionStrategy().getMaximalTask(gcBefore), which returns a list of tasks that it executes, so that kind of implies it will compact everything, but I didn't drill down any deeper than that.

Do we need to run manual compaction with Leveled compaction strategy and SIzeTiered compaction strategy

We have a couple of tables with Leveled compaction strategy and SizeTiered compaction strategy. How often do we need to run compaction? Thanks in advance
TL;DR
Compaction runs on its own (as long as you did not disable autocompaction in the yaml).
Compaction - what is it?
Per the cassandra write path, we flush memtables to disk periodically into SSTables (sorted string tables) which are immutable. When you update an existing cell, it eventually gets written in an sstable. Possibly a different one than the original record. When we read, sometimes C* has to scan across various sstables (with some optimizations, see bloom filters) to find the latest version of a cell. In Cassandra, last write wins.
Compaction takes sstables and compacts them together removing duplicate data, to optimize reads. This is an automatic operation, though you can tune compactions to run more or less often.
Some useful details on Compaction
Size tiered compaction is the default compaction strategy in cassandra, it looks for sstables that are the same size and compacts them together when it finds enough (4 by default). Size tiered is less IO intensive than leveled and will work better in general when you have smaller boxes and rotational drives.
Leveled compaction is optimized for reads, when you have read heavy workloads or tight read SLA's with lots of updates leveled may make sense. Leveled compaction is more IO and CPU intensive because you are spending more cycles optimizing for reads, but the reads themselves should be faster and hit fewer SStables. Keep an eye on io wait and on pending compactions in nodetool compaction stats when you first enable these or if your workload grows.
Compaction Tunables / Levers
multi threaded compaction - turn it off, the overhead is bigger than the benefit. To the point where it's been removed in C* 2.1.
concurrent compactors - now defaults to 2, used to default to number of cores which is a bad default. If you're on the 2.0 branch and not running the latest DSE check this default and consider decreasing it to 2. this is the number of simultaneous compaction tasks you can run (different column families).
Compaction throttling - a way of limiting the amount of resources that compactions take up. You can tune this on the fly with nodetool getcompactionthreshold and nodetool setcompactionthreshold. You want to tune this to a point where you are not accumulating pending tasks. 0 --> unlimited. Unlimited is, unintuitively, not usually the fastest setting as the system may get bogged down.

Leveled Compaction Strategy with low disk space

We have Cassandra 1.1.1 servers with Leveled Compaction Strategy.
The system works so that there are read and delete operations. Every half a year we delete approximately half of the data while new data comes in. Sometimes it happens that disk usage goes up to 75% while we know that real data take about 40-50% other space is occupied by tombstones. To avoid disk overflow we force compaction of our tables by dropping all SSTables to Level 0. For that we remove .json manifest file and restart Cassandra node. (gc_grace option does not help since compaction starts only after level is filled)
Starting from Cassandra 2.0 the manifest file was moved to sstable file itself: https://issues.apache.org/jira/browse/CASSANDRA-4872
We are considering migration to Cassandra 2.x while we afraid we won't have such a possibility as forcing leveled compaction any more.
My question is: how could we achieve that our table has a disk space limit e.g. 150GB? (When the limit is exceeded it triggers compaction automatically). The question is mostly about Cassandra 2.x. While any alternative solutions for Cassandra 1.1.1 are also welcome.
It seems like I've found the answers myself.
There is tool sstablelevelreset starting from 2.x version which does similar level reset as deletion of manifest file. The tool is located in tools directory of Cassandra distribution e.g. apache-cassandra-2.1.2/tools/bin/sstablelevelreset.
Starting from Cassandra 1.2 (https://issues.apache.org/jira/browse/CASSANDRA-4234) there is tombstone removal support for Leveled Compaction Strategy which supports tombstone_threshold option. It gives the possibility of setting maximal ratio of tombstones in a table.

Resources