When I run the spark driver, the machine's memory grows so much that it cannot run, has anyone encountered this problem?
Use the MAT tool to see what the problem is, but still have no idea.
Related
This question already has answers here:
How to set Apache Spark Executor memory
(13 answers)
Closed 3 years ago.
I am new to spark framework and i would like to know what is driver memory and executor memory? what is the effective way to get the maximum performance from both of them?
Spark need a driver to handle the executors. So the best way to understand is:
Driver
The one responsible to handle the main logic of your code, get resources with yarn, handle the allocation and handle some small amount of data for some type of logic. The Driver Memory is all related to how much data you will retrieve to the master to handle some logic. If you retrieve too much data with a rdd.collect() your driver will run out of memory. The memory for the driver usually is small 2Gb to 4Gb is more than enough if you don't send too much data to it.
Worker
Here is where the magic happens, the worker will be the one responsible to execute your job. The amount of memory depends of what you are going to do. If you just going to do a map function where you just going to transform the data with no type of aggregation, you usually don't need much memory. But if you are going to run big aggregations, a lot of steps and etc. Usually you will use a good amount of memory. And it is related to the size of your files that you will read.
Tell you a proper amount of memory for each case all depends of how your job will work. You need to understand what is the impact of each function and monitor to tune your memory usage for each job. Maybe 2Gb per worker is what you need, but sometimes 8Gb per workers is what you need.
I have been wanting to find a good way to profile a spark application's executor when its run from a jupyter notebook interface. I basically want to see details like what is the heap memory usage, young and perm gen memory usage etc through time for a particular executor(ones that fail atleast).
I see many solutions out there but nothing that seems mature and easy to install/use.
Are there any good tools that let me do this easily?
So basically I have a python spark job that reads some simple json files, and then tries to write them as orc files partitioned by one field. The partition is not very balanced, as some keys are really big, and other really small.
I had memory issues when doing something like this:
events.write.mode('append').partitionBy("type").save("s3n://mybucket/tofolder"), format="orc")
Adding memory to the executors didn't seem to have any effect, but I solved it increasing the driver memory. Does this mean that all the data is being send to the driver for it to write? Can't each executor write its own partition? Im using Spark 2.0.1
Even if you partition dataset and then write it on storage there is no possibility that records are sent to the driver. You should look at logs of memory issues (if they occur on driver on or executors) to figure out exact reason of failing.
Probably your driver has too low memory to handle this write because of previous computations. Try decreasing spark.ui.retainedJobs and spark.ui.retainedStages to save memory on old jobs and stages metadata. If this won't help, connect to driver with jvisualvm to find job/stage than consumes large heap fragments and try to optimize.
I was watching a video on apache spark here . Where the speaker Paco Nathan says the following
"If you have 128 GB of RAM, you are not going to throw them all at once at the jvm.That will just cause a lot of garbage collection. And so one of the things with spark is, use more sophisticated ways to leverage the memory space, do more off-heap."
I am not able to understand what he says with regard to how spark efficiently handles this scenario.
also more specifically i completely did not understand the statement
"If you have 128 GB of RAM you are not going to throw them all at once at the jvm.That will just cause of lot of garbage collection"
Can someone explain what the reasoning actually is behind these statements ?
"If you have 128 GB of RAM you are not going to throw them all at once
at the jvm.That will just cause of lot of garbage collection"
This means that you will not assign all the memory to the JVM only when there is memory requirement for other stuff like garbage collection, off-heap operations, etc.
Spark does this by assigning fractions of the memory(that you have assigned to Spark executors) for such operations as shown in image below(for Spark 1.5.0):
I want to make a simple query over approximately 10 mio rows.
I have 32GB RAM (20GB is free). And Cassandra is using so much memory, that the available RAM is used to a maximum, and the process is killed.
How can I optimize Cassandra? I have read about "Tuning Java resources" and changing the Java heap sizing, but I still have no solution.
Cassandra will use up as much memory as is available to it on the system. It's a greedy process and will use any available memory for caching, similar to the way the kernel page cache works. Don't worry if Cassandra is using all your hosts memory, it will just be in cache and will be released to other processes if necessary.
If your query is suffering from timeouts this will probably be from reading too much data from a single partition so that the query doesn't return in under read_request_timeout_in_ms. If this is the case you should look at making your partition sizes smaller.