AWS Glue: Data Skewed or not Skewed?

AWS Glue: Data Skewed or not Skewed? - apache-spark

I have a job in AWS Glue that fails with:
An error occurred while calling o567.pyWriteDynamicFrame. Job aborted due to stage failure: Task 168 in stage 31.0 failed 4 times, most recent failure: Lost task 168.3 in stage 31.0 (TID 39474, ip-10-0-32-245.ec2.internal, executor 102): ExecutorLostFailure (executor 102 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 22.2 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
The main message is Container killed by YARN for exceeding memory limits. 22.2 GB of 22 GB physical memory used.
I have used broadcasts for the small dfs and salt technique for bigger tables.
The input consists of 75GB of JSON files to process.
I have used a a grouping of 32MB for the input files:
additional_options={
'groupFiles': 'inPartition',
'groupSize': 1024*1024*32,
},
The output file is written with 256 partitions:
output_df = output_df.coalesce(256)
In AWS Glue I launch the job with 60 G.2X executors = 60 x (8 vCPU, 32 GB of memory, 128 GB disk).
Below is the plot representing the metrics for this job. From that, the data don't look skewed... Am I wrong?
Any advice to successfully run this is welcome!

Try to use repartition instead of coalesce. The latter one will do the complete execution based on the number of the partitions you have provided. In your case it tries to process all the input data with the 256 partitions, when it can't handle the input data volume you will get the error.

Related

How to extend the memory limit of PySpark running locally on Windows 10 / JVM 64bit

I try to make PySpark operations in Jupyter Notebook, and it seems that there is (a rather low) threshold of working memory when it halts with an error message. The laptop has 16GB RAM (out of which 50% is free when running the script), so the physical memory shouldn't be a problem. Spark runs on JVM (64bit) 1.8.0_301. The Jupyter Notebook runs on Python 3.9.5.
The dataframe consists only of 360K rows and two 'long' type columns (i.e. ca. 3.8MB only). The script works properly if I reduce the size of the dataframe 1.5MB memory usage (49,200 rows). But above that, the script collapses using the df.toPandas() command with the following error message (extract):
Py4JJavaError: An error occurred while calling o234.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 50.0 failed 1 times, most recent failure:
Lost task 0.0 in stage 50.0 (TID 577) (BEXXXXXX.subdomain.domain.com executor driver):
TaskResultLost (result lost from block manager)
This is a well known error message when PySpark runs into memory limits, so I tried to adjust the settings as follows:
In the %SPARK_HOME%/conf/spark-defaults.conf file:
spark.driver.memory 4g
In Jupyter notebook itself:
spark = SparkSession.builder\
.config("spark.driver.memory", "4G")\
.config("spark.driver.maxResultSize", "4G")\
.appName("MyApp")\
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
spark.sparkContext.setSystemProperty('spark.executor.memory', '4G')
I tried to play with the values of spark.driver.memory, spark.executor.memory etc., but the threshold seems to remain the same.
The Spark panel (on http://localhost:4040) says in the Executors menu, that Storage memory is 603 KiB / 2 GiB, Input is 4.1 GiB, Shuffle read: 60.6 MiB, Shuffle write: 111.3 MiB. But this is essentially the same, if I reduce the dataframe size below 1.5MB and the script runs properly.
Have you got any ideas, how to raise this 1.5MB memory limit somehow, and where is it coming from?

Spark application terminating due to Error" Signal Term"

I am running an spark application for 350 GB and getting error of Signal Term in yarn logs. Here are some configuration of Spark I have.
Executor memory : 50 GB
Driver Memory : 50 GB
Memory Overhead : 6 GB
Number of cores per Executor: 5
I am not able to find a root cause and solution. Please help

Can reduced parallelism lead to no shuffle spill?

Consider an example:
I have a cluster with 5 nodes and each node has 64 cores with 244 GB memory.
I decide to run 3 executors on each node and set executor-cores to 21 and executor memory of 80GB, so that each executor can execute 21 tasks in parallel. Now consider that 315(63 * 5) partitions of data, out of which 314 partitions are of size 3GB but one of them is 30GB(due to data skew).
All of the executors that received the 3GB partitions have 63GB(21 * 3 = since each executor can run 21 tasks in parallel and each task takes 3GB of memory space) occupied.
But the one executor that received the 30GB partition will need 90GB(20 * 3 + 30) memory. So will this executor first execute the 20 tasks of 3GB and then load 30GB task or will it just try to load 21 tasks and find that for one task it has to spill to disk? If I set executor-cores to just 15 then the executor that receives the 30 GB partition will only need 14 * 3 + 30 = 72 gb and hence won't spill to disk.
So in this case will reduced parallelism lead to no shuffle spill?

#Venkat Dabri ,
Could you please format the questions with appropriate carriage return/spaces ?
Here are few pointers
Spark (Shuffle)Map Stage ==> the size of each partition depends on filesystem's block size. E.g. if data is read from HDFS , each partition will try to have data as close as 128MB so for input data number of partitions = floor(number of files * blocksize/128 (actually 122.07 as Mebibyte is used))
Now the scenario you are describing is for Shuffled data in Reducer(Result Stage)
Here the blocks processed by reducer tasks are called Shuffled Blocks and By default Spark ( for SQL/Core APIs) will launch 200 reducer tasks
Now important thing to remember Spark can hold Max 2GB so if you have too few partitions and one of them does a remote fetch of a shuffle block > 2GB, you will see an error like Size exceeds Integer.MAX_VALUE
To mitigate that , within default limit Spark employs many optimization (compression/tungsten-sort-shuffle etc) but as a developer we can try to repartition skewed data intelligently and tune default parallelism

How does hive on spark determine reducer number?

I enable Hive on Spark according to Cloudera documentation 1 and 2. I now find that reducer number behaves unexpectedly. I wish someone could provide detailed documentation or explanation regarding that.
As far as I know, Hive on MR calculates reducer number based on data volume and hive.exec.reducers.bytes.per.reducer, which means bytes per reducer processes, hence job parallelism can be adjusted automatically. But Hive on Spark seems to treat this parameter differently. Though setting it to very low number (<1K) increases reducer number indeed, no common rule can be applied to different jobs.
Below is segment from Cloudera tuning documentation for parallelism.
Adjust hive.exec.reducers.bytes.per.reducer to control how much data each reducer processes, and Hive determines an optimal number of partitions, based on the available executors, executor memory settings, the value you set for the property, and other factors. Experiments show that Spark is less sensitive than MapReduce to the value you specify for hive.exec.reducers.bytes.per.reducer, as long as enough tasks are generated to keep all available executors busy
Also, I understand that RDD in Spark spills data on disk when memory is not sufficient. If that, the following error messages from Hive on Spark jobs really confuse me.
Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 146, fuxi-luoge-105, executor 34): ExecutorLostFailure (executor 34 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 6.2 GB of 6.0 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

Need For Large Executor Memory If Block size is 128 MB

I have a question regarding spark. I am using spark 2.2 and as per my knowledge each executor spins up taks and executes the task. Each task corresponds to a partition. Default number of partition is based on default parallelism and the file size/Default Block Size. So considering a file size of 1 GB and a cluster of 4 executors each of which can spin up 2 tasks (2 core). As per calculation the executor memory should be about 256 MB (2 tasks each task operating on 128 MB block)+ 384 MB overhead. However If I run the code with this size as executor memory the performance is slow. If I give executor memory of 1.5 GB (considering some calculations on rdd) still the performance is slow. Only when I increase the executor memory to 3GB the performance is Good.
Can someone explain
1. why do we need so much executor memory when we work on only 128 MB of data at a time.
2. How do we calculate the optimum executor memory needed for the job
Thanks for your help

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string