I have 4 cassandra nodes in cluster and one column family which has 10 columns where row cannot grow very wide (maybe max 1000 columns).
I have "peak" writes where I insert up to 500 000 records in 5-10 minutes range.
I use node.js driver: node-cassandra-cql
3 nodes are working fine but one node crashes every time on heavy writes.
All nodes currently have around 1.5 GB data size and problematic node has 1.9 GB data size.
All nodes have max heap space at 1GB (I have 4 GB RAM on machines so default cassandra config file calculated this amount of heap)
I use default cassandra configuration except I increased write/read timeouts.
Question: Does anyone knows what could be reason for this?
Is heap size really that small?
What and how to configure cassandra cluster for this use case (heavy writes at small time range and other time actually doing nothing or just small writes)
I haven't tried to increase heap size manually, first I would like to know if maybe there is something other to configure instead just increasing it.
Related
Question:
Spark seems to be able to manage partitions that are bigger than the executor size. How does it do that?
What I have tried so far:
I picked up a CSV with: Size on disk - 12.3 GB, Size in memory deserialized - 3.6 GB, Size in memory serialized - 1964.9 MB. I got these sizes from caching the data in memory deserialized and serialized both and 12.3 GB is the size of the file on the disk.
To check if spark can handle partitions larger than the executor size, I created a cluster with just one executor with spark.executor.memory equal to 500mb. Also, I set executor cores (spark.executor.cores) to 2 and, increased spark.sql.files.maxPartitionBytes to 13 GB. I also switched off Dynamic allocation and adaptive for good measure. The entire session configuration is:
spark = SparkSession.builder.\
config("spark.dynamicAllocation.enabled",False).\
config("spark.executor.cores","2").\
config("spark.executor.instances","1").\
config("spark.executor.memory","500m").\
config("spark.sql.adaptive.enabled", False).\
config("spark.sql.files.maxPartitionBytes","13g").\
getOrCreate()
I read the CSV and checked the number of partitions that it is being read in by df.rdd.getNumPartitions(). Output = 2. This would be confirmed later on as well in the number of tasks
Then I run df.persist(storagelevel.StorageLevel.DISK_ONLY); df.count()
Following are the observations I made:
No caching happens till the data for one batch of tasks (equal to number of cpu cores in case you have set 1 cpu core per task) is read in completely. I conclude this since there is no entry that shows up in the storage tab of the web UI.
Each partition here ends up being around 6 GB on disk. Which should, at a minimum, be around 1964.9 MB/2 (=Size in memory serializez/2) in memory. Which is around 880 MB. There is no spill. Below is the relevant snapshot of the web UI from when around 11 GB of the data has been read in. You can see that Input has been almost 11GB and at this time there was nothing in the storage tab.
Questions:
Since the memory per executor is 300 MB (Execution + Storage) + 200 MB (User memory). How is spark able to manage ~880 MB partitions that too 2 of them in parallel (one by each core)?
The data read in does not show up in the Storage, is not (and, can not be) in the executor and, there is no spill as well. where exactly is that read in data?
Attaching a SS of the web UI post that job completion in case that might be useful
Attaching a SS of the Executors tab in case that might be useful:
Does anybody have any tips when moving Spark execution from a few large nodes to many, smaller nodes?
I am running a system with 4 executors, each executor has 24Gb of ram and 12 cores. If I try to scale that out to 12 executors, 4 cores each and 8 Gb of ram (Same total RAM, same total cores, just distributed differently) I run into out of memory errors:
Container killed by YARN for exceeding memory limits. 8.8 GB of 8.8 GB physical memory used.
I have increased the number partitions by a factor of 3 to create more (yet smaller) partitions, but this didn't help.
Does anybody have any tips & tricks when trying to scale spark horizontally?
This is a pretty broad question, executor sizing in Spark is a very complicated kind of black magic, and the rules of thumb which were correct in 2015 for example are obsolete now, as will whatever I say be obsolete in 6 months with the next release of Spark. A lot comes down to exactly what you are doing and avoiding key skew in your data.
This is a good place to start to learn and develop your own understanding:
https://spark.apache.org/docs/latest/tuning.html
There are also a multitude of presentations on Slideshare about tuning Spark, try and read / watch the most recent ones. Anything older than 18 months be sceptical of, and anything older than 2 years just ignore.
I will make the assumption that you are using at least Spark 2.x.
The error you're encountering is indeed because of poor executor sizing. What is happening is that your executors are attempting to do too much at once, and running themselves into the ground as they run out of memory.
All other things being equal these are the current rules of thumb as I apply them:
The short version
3 - 4 virtual (hyperthreaded) cores and 29GB of RAM is a reasonable default executor size (I will explain why later). If you know nothing else, partition your data well and use that.
You should normally aim for a data partition size (in memory) on the order of ~100MB to ~3GB
The formulae I apply
Executor memory = number of executor cores * partition size * 1.3 (safety factor)
Partition size = size on disk of data / number of partitions * deser ratio
The deserialisation ratio is the ratio between the size of the data on disk and the size of data in memory. The Java memory representation of the same data tends to be a decent bit larger than on disk.
You also need to account for whether your data is compressed, many common formats like Parquet and ORC use compression like gzip or snappy.
For snappy compressed text data (very easily compressed), I use ~10X - 100X.
For snappy compressed data with a mix of text, floats, dates etc I see between 3X and 15X typically.
number of executor cores = 3 to 4
Executor cores totally depends on how compute vs memory intensive your calculation is. Experiment and see what is best for your use case. I have never seen anyone informed on Spark advocate more than 6 cores.
Spark is smart enough to take advantage of data locality, so the larger your executor, the better chance that your data is PROCESS_LOCAL
More data locality is good, up to a point.
When a JVM gets too large > 50GB, it begins to operate outside what it was originally designed to do, and depending on your garbage collection algorithm, you may begin to see degraded performance and high GC time.
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
There also happens to be a performance trick in Java that if your JVM is smaller than 32GB, you can use 32 bit compressed pointers rather than 64 bit pointers, which saves space and reduces cache pressure.
https://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
It also so happens that YARN adds 7% or 384MB of RAM (whichever is larger) to your executor size for overhead / safety factor, which is where 29GB rule of thumb comes from: 29GB + 7% ~= 32GB
You mentioned that you are using 12 core, 24GB RAM executors. This sends up a red flags for me.
Why?
Because every "core" in an executor is assigned one "task" at time. A task is equivalent to the work required to compute the transformation of one partition from "stage" A to "stage" B.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-taskscheduler-tasks.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-DAGScheduler-Stage.html
If your executor has 12 cores, then it is going to try and do 12 tasks simulatenously with a 24GB memory budget. 24GB / 12 cores = 2GB per core. If your partitions are greater than 2GB, you will get an out of memory error. If the particular transformation doubles the size of the input (even intermediately), then you need to account for that as well.
We have a 32 node Cassandra cluster with around 100Gb per node using Murmur3 partitioner. It has time series data and we have build secondary indexes on two columns to perform range queries. Currently, the cluster is stable with all the data bulk loaded and all the secondary indexes rebuilt. The issue occurs when we are performing range queries using cql client or hector, just the query for count of rows takes a huge amount of time and it most cases causes nodes to fail due to memory issues. The nodes have 8gb memory, Cassandra MAX Heap is allotted to 4 GB. Has anyone else faced such an issue ? Is there a better way to do count queries ?
I've had similar issues and most often this can be solved by redesigning the schema bearing in mind the queries that you plan to execute against the data in Cassandra. For a timeseries data it is better to have wide tables with granularity depending on your queries. If your query requires data at a granularity of 1 hour, then it is best to have a wide table with all timestamped data points stored within a single row for every hour so you can get all the required data for 1 hour by reading just 1 row.
Since you say the data is bulk loaded, I am assuming that you may have put all the data into a single table which is why the get_count query is taking an enormous amount of time. We have a a cluster with 8GB RAM but have set the heap size to 3 GB because at 4GB, the RAM utilization is almost always at 8GB [full utilization].
We are experimenting a bit with Cassandra by trying out some of the long running test cases (stress test) and we are experiencing some memory issues on one node of the cluster at any given time (It could be any machine on the cluster!)
We am running DataStax Community with Cassandra 1.1.6 on a machine with Windows Server 2008 and 8 GB of RAM. Also, we have configured the Heap size to be 2GB as against the default value of 1GB.
A snippet from the logs:
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid2440.hprof ...
Heap dump file created [1117876234 bytes in 11.713 secs]
ERROR 22:16:56,756 Exception in thread Thread[CompactionExecutor:399,1,main]
java.lang.OutOfMemoryError: Java heap space
at org.apache.cassandra.io.util.FastByteArrayOutputStream.expand(FastByteArrayOutputStream.java:104)
at org.apache.cassandra.io.util.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:220)
at java.io.DataOutputStream.write(Unknown Source)
Any pointers/help to investigate and fix this.??
You're doing the right thing by long-running your load tests but in a production use case you wouldn't be writing data like this.
Your rows are probably growing too big to fit in RAM when it comes time to compact them. A compaction requires the entire row to fit in RAM.
There's also a hard limit of 2 billion columns per row but in reality you shouldn't ever let rows grow that wide. Bucket them by adding a day or server name or some other value common across your dataset to your row keys.
For a "write-often read-almost-never" workload you can have very wide rows but you shouldn't come close to the 2 billion column mark. Keep it in millions with bucketing.
For a write/read mixed workload where you're reading entire rows frequently even hundreds of columns may be too much.
If you treat Cassandra right you'll easily handle thousands of reads and writes per second per node. I'm seeing about 2.5k reads and writes concurrently per node on my main cluster.
I’m using Cassandra 1.2.1, and I am using COPY command to insert millions of rows. Each row is 100 bytes long. The issue is that the insertion happens rather slowly, at rate of 1500 rows per second. We have 3 node cluster with 50 GB disk space each, and 4 GB RAM each. Cassandra process is running with max heap size of 1 GB. We are storing commit logs and data files on the same disk. What could be the cause of this behaviour? Any help would be appreciated.
Apparently as of now, they are not planning to improve the speed of COPY.
See https://issues.apache.org/jira/browse/CASSANDRA-4588