Ideal Spark configuration - apache-spark

am using Apache spark on HDFS with MapR in our project. We are facing issue running spark Jobs, as its failing after a small increase in the data. We are reading data from csv file, doing some trasnformation, aggreation and then storing in HBase.
Current Data size = 3TB
Available resources:
Total nodes : 14
Memory Available : 1TB
Total VCores : 450
Total Disk : 150 TB
Spark Conf:
executorCores : 2
executorInstance : 50
executorMemory: 40GB
minPartitions: 600
please suggest, if the above configuration looks fine, because the error am getting looks like it going outOfMemory.

Can you say a bit about how the jobs are failing? Without a bit more information, it will be very hard to say. It would help if you were to say which version of Spark and whether you are running under Yarn or with a standalone Spark cluster (or even on Kubernetes)
Even without any information, however, it seems likely that there is a configuration issue here. What may be happening is that Spark is being told contradictory things about how much memory is available so that when it tries to use memory it thinks it is allowed to use, the system says no.

Related

Spark jobs seem to only be using a small amount of resources

Please bear with me because I am still quite new to Spark.
I have a GCP DataProc cluster which I am using to run a large number of Spark jobs, 5 at a time.
Cluster is 1 + 16, 8 cores / 40gb mem / 1TB storage per node.
Now I might be misunderstanding something or not doing something correctly, but I currently have 5 jobs running at once, and the Spark UI is showing that only 34/128 vcores are in use, and they do not appear to be evenly distributed (The jobs were executed simultaneously, but the distribution is 2/7/7/11/7. There is only one core allocated per running container.
I have used the flags --executor-cores 4 and --num-executors 6 which doesn't seem to have made any difference.
Can anyone offer some insight/resources as to how I can fine tune these jobs to use all available resources?
I have managed to solve the issue - I had no cap on the memory usage so it looked as though all memory was allocated to just 2 cores per node.
I added the property spark.executor.memory=4G and re-ran the job, it instantly allocated 92 cores.
Hope this helps someone else!
The Dataproc default configurations should take care of the number of executors. Dataproc also enables dynamic allocation, so executors will only be allocated if needed (according to Spark).
Spark cannot parallelize beyond the number of partitions in a Dataset/RDD. You may need to set the following properties to get good cluster utilization:
spark.default.parallelism: the default number of output partitions from transformations on RDDs (when not explicitly set)
spark.sql.shuffle.partitions: the number of output partitions from aggregations using the SQL API
Depending on your use case, it may make sense to explicitly set partition counts for each operation.

Spark Memory Usage Concentrated on Driver / Master

I'm currently developing a Spark (v 2.2.0) Streaming application and am running into issues with the way Spark seems to be allocating work across the cluster. This application is submitted to AWS EMR using client mode, so there is a driver node and a couple of worker nodes. Here is a screenshot of Ganglia that shows memory usage in the last hour:
The left-most node is the "master" or "driver" node, and the other two are worker nodes. There are spikes in the memory usage for all three nodes that correspond to workloads coming in through the stream, but the spikes are not equal (even when scaled to % memory usage). When a large workload comes in, the driver node appears to be overworked, and the job will crash with an error regarding memory:
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x000000053e980000, 674234368, 0) failed; error='Cannot allocate memory' (errno=12)
I've also run into this:
Exception in thread "streaming-job-executor-10" java.lang.OutOfMemoryError: Java heap space when the master runs out of memory, which is equally confusing, as my understanding is that "client" mode would not use the driver / master node as an executor.
Pertinent details:
As mentioned earlier, this application is submitted in client mode: spark-submit --deploy-mode client --master yarn ....
Nowhere in the program am I running collect or coalesce
Any work that I suspect of being run on a single node (jdbc reads mainly) is repartition'd after completion.
There are a couple of very, very small datasets persist into memory.
1 x Driver specs: 4 cores, 16GB RAM (m4.xlarge instance)
2 x Worker specs: 4 cores, 30.5GB RAM (r3.xlarge instance)
I have tried both allowing Spark to choose executor size / cores and specifying them manually. Both cases behave the same. (I manually specified 6 executors, 1 core, 9GB RAM)
I'm certainly at a loss here. I'm not sure what is going on in the code to be triggering the driver to hog the workload like this.
The only suspect I can think of is a code snippet similar to the following:
val scoringAlgorithm = HelperFunctions.scoring(_: Row, batchTime)
val rawScored = dataToScore.map(scoringAlgorithm)
Here, a function is being loaded from a static object, and used to map over the Dataset. It is my understanding that Spark will serialize this function across the cluster (re: http://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#passing-functions-to-spark). However perhaps I am mistaken and it is simply running this transformation on the driver.
If anyone has any insight to this issue, I would love to hear it!
I ended up solving this issue. Here's how I addressed it:
I made an incorrect assertion in stating the problem: there was a collect statement at the beginning of the Spark program.
I had a transaction that required collect() to run as it was designed. My assumption was that calling repartition(n) on the resulting data would split the data back amongst the executors in the cluster. From what I can tell, this strategy does not work. Once I re-wrote this line, Spark started behaving as I expected and farming jobs out to worker nodes.
My advice to any lost soul who stumbles across this issue: don't collect unless it's the end of your Spark program. You can not recover from it. Find another way to perform your task. (I ended up switching a SQL transaction from where col in (,,,) syntax to a join on the database.)

Spark + Elastic search write performance issue

Seeing low # of writes to elasticsearch using spark java.
Here are the Configurations
using 13.xlarge machines for ES cluster
4 instances each have 4 processors.
Set refresh interval to -1 and replications to '0' and other basic
configurations required for better writing.
Spark :
2 node EMR cluster with
2 Core instances
- 8 vCPU, 16 GiB memory, EBS only storage
- EBS Storage:1000 GiB
1 Master node
- 1 vCPU, 3.8 GiB memory, 410 SSD GB storage
ES index has 16 shards defined in mapping.
having below config when running job,
executor-memory - 8g
spark.executor.instances=2
spark.executor.cores=4
and using
es.batch.size.bytes - 6MB
es.batch.size.entries - 10000
es.batch.write.refresh - false
with this configuration, I try to load 1Million documents (each document has a size of 1300 Bytes) , so it does the load at 500 records/docs per ES nodes.
and in the spark log am seeing each task
-1116 bytes result sent to driver
Spark Code
JavaRDD<String> javaRDD = jsc.textFile("<S3 Path>");
JavaEsSpark.saveJsonToEs(javaRDD,"<Index name>");
Also when I look at the In-Network graph in ES cluster it is very low, and I see EMR is not sending huge data over a network. Is there a way I can tell Spark to send a right number of data to make write faster?
OR
Is there any other config that I am missing to tweak.
Cause I see 500docs per sec per es instance is lower. Can someone please guide what am missing with this settings to improve my es write performance
Thanks in advance
You may have an issue here.
spark.executor.instances=2
You are limited to two executors, where you could have 4 based on your cluster configuration. I would change this to 4 or greater. I might also try executor-memory = 1500M, cores=1, instances=16. I like to leave a little overhead in my memory, which is why I dropped from 2G to 1.5G(but you can't do 1.5G so we have to do 1500M). If you are connecting via your executors this will improve performance.
Would need some code to debug further. I wonder if you are connected to elastic search only in your driver, and not in your worker nodes. Meaning you are only getting one connection instead of one for each executor.

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result.
Now I'm trying to process a ~5GB bzip2 CSV file on this cluster but I'm receiving this error:
16/11/23 17:29:53 WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I'm confused as to why I'm getting a ~10.5GB memory limit on a ~75GB cluster (15GB per 3m.xlarge instance)...
Here is my EMR config:
[
{
"classification":"spark-env",
"properties":{
},
"configurations":[
{
"classification":"export",
"properties":{
"PYSPARK_PYTHON":"python34"
},
"configurations":[
]
}
]
},
{
"classification":"spark",
"properties":{
"maximizeResourceAllocation":"true"
},
"configurations":[
]
}
]
From what I've read, setting the maximizeResourceAllocation property should tell EMR to configure Spark to fully utilize all resources available on the cluster. Ie, I should have ~75GB of memory available... So why am I getting a ~10.5GB memory limit error?
Here is the code I'm running:
def sessionize(raw_data, timeout):
# https://www.dataiku.com/learn/guide/code/reshaping_data/sessionization.html
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp"))
diff = (pyspark.sql.functions.lag(raw_data.timestamp, 1)
.over(window))
time_diff = (raw_data.withColumn("time_diff", raw_data.timestamp - diff)
.withColumn("new_session", pyspark.sql.functions.when(pyspark.sql.functions.col("time_diff") >= timeout.seconds, 1).otherwise(0)))
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp")
.rowsBetween(-1, 0))
sessions = (time_diff.withColumn("session_id", pyspark.sql.functions.concat_ws("_", "user_id", "site_id", pyspark.sql.functions.sum("new_session").over(window))))
return sessions
def aggregate_sessions(sessions):
median = pyspark.sql.functions.udf(lambda x: statistics.median(x))
aggregated = sessions.groupBy(pyspark.sql.functions.col("session_id")).agg(
pyspark.sql.functions.first("site_id").alias("site_id"),
pyspark.sql.functions.first("user_id").alias("user_id"),
pyspark.sql.functions.count("id").alias("hits"),
pyspark.sql.functions.min("timestamp").alias("start"),
pyspark.sql.functions.max("timestamp").alias("finish"),
median(pyspark.sql.functions.collect_list("foo")).alias("foo"),
)
return aggregated
spark_context = pyspark.SparkContext(appName="process-raw-data")
spark_session = pyspark.sql.SparkSession(spark_context)
raw_data = spark_session.read.csv(sys.argv[1],
header=True,
inferSchema=True)
# Windowing doesn't seem to play nicely with TimestampTypes.
#
# Should be able to do this within the ``spark.read.csv`` call, I'd
# think. Need to look into it.
convert_to_unix = pyspark.sql.functions.udf(lambda s: arrow.get(s).timestamp)
raw_data = raw_data.withColumn("timestamp",
convert_to_unix(pyspark.sql.functions.col("timestamp")))
sessions = sessionize(raw_data, SESSION_TIMEOUT)
aggregated = aggregate_sessions(sessions)
aggregated.foreach(save_session)
Basically, nothing more than windowing and a groupBy to aggregate the data.
It starts with a few of those errors, and towards halting increases in the amount of the same error.
I've tried running spark-submit with --conf spark.yarn.executor.memoryOverhead but that doesn't seem to solve the problem either.
I feel your pain..
We had similar issues of running out of memory with Spark on YARN. We have five 64GB, 16 core VMs and regardless of what we set spark.yarn.executor.memoryOverhead to, we just couldn't get enough memory for these tasks -- they would eventually die no matter how much memory we would give them. And this as a relatively straight-forward Spark application that was causing this to happen.
We figured out that the physical memory usage was quite low on the VMs but the virtual memory usage was extremely high (despite the logs complaining about physical memory). We set yarn.nodemanager.vmem-check-enabled in yarn-site.xml to false and our containers were no longer killed, and the application appeared to work as expected.
Doing more research, I found the answer to why this happens here: http://web.archive.org/web/20190806000138/https://mapr.com/blog/best-practices-yarn-resource-management/
Since on Centos/RHEL 6 there are aggressive allocation of virtual memory due to OS behavior, you should disable virtual memory checker or increase yarn.nodemanager.vmem-pmem-ratio to a relatively larger value.
That page had a link to a very useful page from IBM: https://web.archive.org/web/20170703001345/https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en
In summary, glibc > 2.10 changed its memory allocation. And although huge amounts of virtual memory being allocated isn't the end of the world, it doesn't work with the default settings of YARN.
Instead of setting yarn.nodemanager.vmem-check-enabled to false, you could also play with setting the MALLOC_ARENA_MAX environment variable to a low number in hadoop-env.sh. This bug report has helpful information about that: https://issues.apache.org/jira/browse/HADOOP-7154
I recommend reading through both pages -- the information is very handy.
If you're not using spark-submit, and you're looking for another way to specify the yarn.nodemanager.vmem-check-enabled parameter mentioned by Duff, here are 2 other ways:
Method 2
If you're using a JSON Configuration file (that you pass to the AWS CLI or to your boto3 script), you'll have to add the following configuration:
[{
"Classification": "yarn-site",
"Properties": {
"yarn.nodemanager.vmem-check-enabled": "false"
}
}]
Method 3
If you use the EMR console, add the following configuration:
classification=yarn-site,properties=[yarn.nodemanager.vmem-check-enabled=false]
See,
I had the same problem in a huge cluster that I'm working now. The problem will not be solved to adding memory to the worker. Sometimes in process aggregation spark will use more memory than it has and the spark jobs will start to use off-heap memory.
One simple example is:
If you have a dataset that you need to reduceByKey it will, sometimes, agregate more data in one worker than other, and if this data exeeds the memory of one worker you get that error message.
Adding the option spark.yarn.executor.memoryOverhead will help you if you set for 50% of the memory used for the worker (just for test, and see if it works, you can add less with more tests).
But you need to understand how Spark works with the Memory Allocation in the cluster:
The more common way Spark uses 75% of the machine memory. The rest goes to SO.
Spark has two types of memory during the execution. One part is for execution and the other is the storage. Execution is used for Shuffles, Joins, Aggregations and Etc. The storage is used for caching and propagating data accross the cluster.
One good thing about memory allocation, if you are not using cache in your execution you can set the spark to use that sotorage space to work with execution to avoid in part the OOM error. As you can see this in documentation of spark:
This design ensures several desirable properties. First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.
But how can we use that?
You can change some configurations, Add the MemoryOverhead configuration to your job call but, consider add this too: spark.memory.fraction change for 0.8 or 0.85 and reduce the spark.memory.storageFraction to 0.35 or 0.2.
Other configurations can help, but it need to check in your case. Se all these configuration here.
Now, what helps in My case.
I have a cluster with 2.5K workers and 2.5TB of RAM. And we were facing OOM error like yours. We just increase the spark.yarn.executor.memoryOverhead to 2048. And we enable the dynamic allocation. And when we call the job, we don't set the memory for the workers, we leave that for the Spark to decide. We just set the Overhead.
But for some tests for my small cluster, changing the size of execution and storage memory. That solved the problem.
Try repartition. It works in my case.
The dataframe was not so big at the very beginning when it was loaded with write.csv(). The data file amounted to be 10 MB or so, as may required say totally several 100 MB memory for each processing task in executor.
I checked the number of partitions to be 2 at the time.
Then it grew like a snowball during the following operations joining with other tables, adding new columns. And then I ran into the memory exceeding limits issue at a certain step.
I checked the number of partitions, it was still 2, derived from the original data frame I guess.
So I tried to repartition it at the very beginning, and there was no problem anymore.
I have not read many materials about Spark and YARN yet. What I do know is that there are executors in nodes. An executor could handle many tasks depending on the resources. My guess is one partition would be atomically mapped to one task. And its volume determines the resource usage. Spark could not slice it if one partition grows too big.
A reasonable strategy is to determine the nodes and container memory first, either 10GB or 5GB. Ideally, both could serve any data processing job, just a matter of time. Given the 5GB memory setting, the reasonable row for one partition you find, say is 1000 after testing (it won't fail any steps during the processing), we could do it as the following pseudo code:
RWS_PER_PARTITION = 1000
input_df = spark.write.csv("file_uri", *other_args)
total_rows = input_df.count()
original_num_partitions = input_df.getNumPartitions()
numPartitions = max(total_rows/RWS_PER_PARTITION, original_num_partitions)
input_df = input_df.repartition(numPartitions)
Hope it helps!
I had the same issue on small cluster running relatively small job on spark 2.3.1.
The job reads parquet file, removes duplicates using groupBy/agg/first then sorts and writes new parquet. It processed 51 GB of parquet files on 4 nodes (4 vcores, 32Gb RAM).
The job was constantly failing on aggregation stage. I wrote bash script watch executors memory usage and found out that in the middle of the stage one random executor starts taking double memory for a few seconds. When I correlated time of this moment with GC logs it matched with full GC that empties big amount of memory.
At last I understood that the problem is related somehow to GC. ParallelGC and G1 causes this issue constantly but ConcMarkSweepGC improves the situation. The issue appears only with small amount of partitions. I ran the job on EMR where OpenJDK 64-Bit (build 25.171-b10) was installed. I don't know the root cause of the issue, it could be related to JVM or operating system. But it is definitely not related to heap or off-heap usage in my case.
UPDATE1
Tried Oracle HotSpot, the issue is reproduced.

Tuning Spark (YARN) cluster for reading 200GB of CSV files (pyspark) via HDFS

I'm working with an 11-node cluster running on AWS right now (1 master, 10 workers - c3.4xlarge) and I'm attempting to read in ~200GB of .csv files (only about 10 actual .csv files) from HDFS.
This process is going really slowly. I'm watching spark at the command line and it looks like this..
[Stage 0:> (30 + 2) / 2044]
With an increase of +2 units (meaning 30+2 going to 32+2 to 34+2, etc..) of progress maybe every 20 seconds. So this is in serious need of improvement, or else we'll be here for about 11 hours before the files are even done reading.
This is the code to this point..
# AMAZON AWS EMR
def sparkconfig():
conf = SparkConf()
conf.setMaster("yarn-client) #client gets output to terminals
conf.set("spark.default.parallelism",340)
conf.setAppName("my app")
conf.set("spark.executor.memory", "20g")
return conf
sc = SparkContext(conf=sparkconfig(),
pyFiles=['/home/hadoop/temp_files/redis.zip'])
path = 'hdfs:///tmp/files/'
all_tx = sc.textFile(my_path).coalesce(1024)
... more code for processing
Now clearly 1024 for partitions may not be correct, that was just from googling and trying different things. I'm really at a loss when it comes to tuning this job.
The worker nodes # AWS are c3.4xlarge instances (I have 10 in the cluster) consist of 30GB of RAM with 16 vCPU's each. The HDFS partition is made up of the local storage for each node in the cluster, which is 2x160GB SSD's, so I believe we're looking at (2*160GB * 10nodes / 3 replication) = ~1TB of HDFS.
The .csv files themselves range from 5GB to 90GB in size.
To clarify in case it is relevant, the Hadoop cluster is the same as the spark cluster as far as the nodes go. I'm allocating 20GB of the 30GB per node to the spark executor, leaving 10GB per node for OS + Hadoop/YARN, etc.. The name-node / spark parent node is a m3.xlarge, which has 4 vcpu's and 16GB of RAM.
Does anyone have suggestions on tuning options (or anything, really) that I might try to speed up this file-read process?
Shameless Plug (Author) try Sparklens https://github.com/qubole/sparklens
Most of the time the real question is not if the application is slow, but will it scale. And for most of the applications, answer is upto a limit.
The structure of spark application puts important constraints on its scalability. Number of tasks in a stage, dependencies between stages, skew and amount of work done on the driver side are the main constraints.

Resources