I want to make a simple query over approximately 10 mio rows.
I have 32GB RAM (20GB is free). And Cassandra is using so much memory, that the available RAM is used to a maximum, and the process is killed.
How can I optimize Cassandra? I have read about "Tuning Java resources" and changing the Java heap sizing, but I still have no solution.
Cassandra will use up as much memory as is available to it on the system. It's a greedy process and will use any available memory for caching, similar to the way the kernel page cache works. Don't worry if Cassandra is using all your hosts memory, it will just be in cache and will be released to other processes if necessary.
If your query is suffering from timeouts this will probably be from reading too much data from a single partition so that the query doesn't return in under read_request_timeout_in_ms. If this is the case you should look at making your partition sizes smaller.
Related
I am not sure if this is the right place to post this but let me know if not. So, I have a Cassandra ring with 16 nodes which contains ~1.1 billion records. I would like to know how can I evaluate this cluster. What kind of metrics are important to collect and how (i.e. memory consumption during the inserts?)? For example writing/reading speed? Compression ratio? Compacted partitions? Should I also use Jconsole somehow?
Feel free to post any documentation or links.
Important metrics to watch for on an Apache Cassandra cluster:
Java heap usage
reads/sec
writes/sec
current compactions
disk space
disk latency
disk IOPS
nodes down
CPU (although, I wouldn't alert on it)
That should be a good enough list to help you get started. Check out the Monitoring page from the official docs for more information and additional metrics.
This question already has answers here:
How to set Apache Spark Executor memory
(13 answers)
Closed 3 years ago.
I am new to spark framework and i would like to know what is driver memory and executor memory? what is the effective way to get the maximum performance from both of them?
Spark need a driver to handle the executors. So the best way to understand is:
Driver
The one responsible to handle the main logic of your code, get resources with yarn, handle the allocation and handle some small amount of data for some type of logic. The Driver Memory is all related to how much data you will retrieve to the master to handle some logic. If you retrieve too much data with a rdd.collect() your driver will run out of memory. The memory for the driver usually is small 2Gb to 4Gb is more than enough if you don't send too much data to it.
Worker
Here is where the magic happens, the worker will be the one responsible to execute your job. The amount of memory depends of what you are going to do. If you just going to do a map function where you just going to transform the data with no type of aggregation, you usually don't need much memory. But if you are going to run big aggregations, a lot of steps and etc. Usually you will use a good amount of memory. And it is related to the size of your files that you will read.
Tell you a proper amount of memory for each case all depends of how your job will work. You need to understand what is the impact of each function and monitor to tune your memory usage for each job. Maybe 2Gb per worker is what you need, but sometimes 8Gb per workers is what you need.
I am trying to join two large spark dataframes and keep running into this error:
Container killed by YARN for exceeding memory limits. 24 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
This seems like a common issue among spark users, but I can't seem to find any solid descriptions of what spark.yarn.executor.memoryOverheard is. In some cases it sounds like it's a kind of memory buffer before YARN kills the container (e.g. 10GB was requested, but YARN won't kill the container until it uses 10.2GB). In other cases it sounds like it's being used to to do some kind of data accounting tasks that are completely separate from the analysis that I want to perform. My questions are:
What is the spark.yarn.executor.memoryOverhead being using for?
What is the benefit of increasing this kind of memory instead of
executor memory (or the number of executors)?
In general, are there things steps I can take to reduce my
spark.yarn.executor.memoryOverhead usage (e.g. particular
datastructures, limiting the width of the dataframes, using fewer executors with more memory, etc)?
Overhead options are nicely explained in the configuration document:
This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).
This also includes user objects if you use one of the non-JVM guest languages (Python, R, etc...).
I have few questions with regards to the In-Memory feature in Cassandra
1.) I have a 4 node datacenter and in Opscenter, under memory usage , it shows there is 100GB of in-memory available. Does it mean that each of the 4 nodes have 100GB memory available or is the 100Gb the total in memory capacity for my datacenter?
2.) If really 100GB is available for In-Memory for a datacenter, is it advisable to use the full capacity? Do I need to factor replication factor as well? Say I have a 15GB data which I want to store it in In-Memory, if the replication factor is 2, will it be like we have 30GB of data in In-memory for the datacenter?
3.) In dse.yaml file, there is a property which has the value like percentage of system memory "max_memory_to_lock_fraction" and by default it is 20%. As per the guidelines from Datastax Cassandra, we need to ensure that the in memory usage does not exceed 45% of total available system memory for each node. Is this "max_memory_to_lock_fraction" the parameter that needs to be set for 45%?
4.) Datastax documentation says compression needs to be removed for In-memory table. If compression is indeed set, will it affect the read/write performance?
5.) Output of dsetool inmemorystatus has a parameter called "Current Total memory not able to lock". Is the value present in this parameter denote the available memory. Like say if the value is 1024MB, does it mean that still 1GB In-memory is available for use.
I am using DSE 4.8.11 version. Please help me as I am trying to understand this feature so as to leverage it best.
Thanks in advance.
1) It depends on how you configure it it can be per cluster (all of the available memory) or you can view graphs of individual nodes
2) Yes, replication factor increases data by factor times in total. You will have to factor that in on the cluster level. Very nice tool to help you start: https://www.ecyrd.com/cassandracalculator/
3) Yes max_memory_to_lock_fraction is what you are looking for
4) It will increase processing time, since writes in cassandra are actually cpu bound this might not be best performance wise idea.
5) Yes this means there is still memory (of specified amount), but due to settings cassandra is unable to lock it.
Currently my DSE Cassandra uses up all of the memory. And therefore after some time and increasing data amount the whole system crashes. But spark and ops center and agent etc also needs several G memory. I am now trying to only allocate half of the memory to cassandra but not sure if that will work.
This is my error message:
kernel: Out of memory: Kill process 31290 (java) score 293 or sacrifice child
By default DSE sets the Executor memory to (Total Ram)*(.7) - Ram Used By C*. This should be ok for most systems. With this setup it should Spark shouldn't be able to OOM C* or Vice Versa. If you want to change that multipler (.7) it's set in the dse.yaml file as
initial_spark_worker_resources: 0.7
If I was going for minimum memory for the system it would be 16GB but I would recommend at least 32GB if you are serious. This should be increased even more if you are doing a lot of in-memory caching.