I have a four node hadoop cluster(mapr) with 40GB memory each. My spark startup parameters are as follows:
MASTER="yarn-client" /opt/mapr/spark/spark-1.6.1/bin/pyspark --num-executors 8 --executor-memory 10g --executor-cores 5 --driver-memory 20g --driver-cores 10 --conf spark.driver.maxResultSize="0" --conf spark.default.parallelism="100"
Now when I run my spark job with 100K records, and run results.count() or result.saveTable(), it runs on all the 8 executors. But if I run the job with 1M records, the jobs is split into 3 stages and final stage runs on only ONE executor. Is it something do with partitioning?
I resolved this issue by converting my dataframe into an rdd and repartition it to a large value like greater than 500, instead of using df.withColumn()
pseudo code:
df_rdd = df.rdd
df_rdd_partioned = df_rdd.repartition(1000)
df_rdd_partioned.cache().count()
result = df_rdd_partioned.map(lambda r: (r, transform(r)), preservesPartitioning=True).toDF()
result.cache()
Related
I have a 10 node cluster, 8 DNs(256 GB, 48 cores) and 2 NNs. I have a spark sql job being submitted to the yarn cluster. Below are the parameters which I have used for spark-submit.
--num-executors 8 \
--executor-cores 50 \
--driver-memory 20G \
--executor-memory 60G \
As can be seen above executor-memory is 60GB, but when I check Spark UI is shows 31GB.
1) Can anyone explain me why it is showing 31GB instead of 60GB.
2) Also help in setting optimal values for parameters mentioned above.
I think,
Memory allocated gets divided into two parts:
1. Storage (caching dataframes/tables)
2. Processing (the one you can see)
31gb is the memory available for processing.
Play around with spark.memory.fraction property to increase/decrease the memory available for processing.
I would suggest to reduce the executor cores to about 8-10
My configuration :
spark-shell --executor-memory 40g --executor-cores 8 --num-executors 100 --conf spark.memory.fraction=0.2
My spark program is working fine and processing all the records properly when the input file size is small ~ 2GB. The same program when runs with 8GB its not considering all the input records and processing only 90% records.
I have tried changing the Spark Submit parameters but its not working. Please suggest.
Even the Spark UI is also showing the less number of records in "Input Size / Records:" field
spark-submit --deploy-mode client --master yarn --executor-memory 6G --executor-cores 5 --num-executors 25 --class com.test.spark.etc
I read streaming data from Kafka and process it using Spark Streaming. The amount of data is around 1 MB every 30 minutes.
I set the batch interval equal to 5 minutes. When I launch the Spark Streaming job with the smallest offset, it should process around 500Mb. For some reason it takes very long time to process it (around 5 hours), though the processing operations are not complex (some filtering of data based on fields and grouping).
I wonder if it has something to do with the parameters of spark submit command and Kafka parameters in my code. For example, I was reading here about the need to set well-balanced values of fetch.min.bytes and fetch.max.wait.ms of Kafka Consumer in Scala. Should I maybe limit the batch size saying that all these 500Mb should be split into batches of 1Mb and processed separately? Or should I set fetch.message.max.bytes to e.g. 1000000 bytes (1Mb)? Or maybe it makes sense to add Thread.sleep(3000) right after ssc.start() and ssc.awaitTermination() in order to give some time for the forgetting old RDDs to complete.
My spark submit command looks as follows:
spark-submit --master yarn --deploy-mode cluster \
--driver-memory 10g --executor-memory 10g \
--num-executors 2 --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:+AlwaysPreTouch" \
--class org.test.TestRunner \
--queue "myqueue" \
testprocess.jar \
"20" "5"
I am working in Spark Project since last 3-4 months and recently.
I am doing some calculation with a huge history file (800 GB) and a small incremental file (3 GB).
The calculation is happening very fast in spark using hqlContext & dataframe, but when I am trying to write the calculated result as a hive table with orc format which will contain almost 20 billion of records with a data size of almost 800 GB is taking too much time (more than 2 hours and finally getting failed).
My cluster details are: 19 nodes , 1.41 TB of Total Memory, Total VCores are 361.
For tuneup I am using
--num-executors 67
--executor-cores 6
--executor-memory 60g
--driver-memory 50g
--driver-cores 6
--master yarn-cluster
--total-executor-cores 100
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC"
at run time.
If I take a count of result, then it is completing within 15 minutes, but if I want to write that result in HDFS as hive table.
[ UPDATED_RECORDS.write.format("orc").saveAsTable("HIST_ORC_TARGET") ]
then I am facing the above issue.
Please provide me with a suggestion or anything regarding this as I am stuck in this case since last couple of days.
Code format:
val BASE_RDD_HIST = hqlContext.sql("select * from hist_orc")
val BASE_RDD_INCR = hqlContext.sql("select * from incr_orc")
some spark calculation using dataframe, hive query & udf.....
Finally:
result.write.format("orc").saveAsTable("HIST_ORC_TARGET_TABLE")
Hello friends I have found the answer of my own question few days back so here
I am writing that.
Whenever we execute any spark program we do not specify the queue parameter and some time the default queue has some limitations which does not allow you to execute as many executors or tasks that you want so it might cause a slow processing and later on a cause of job failure for memory issue as you are running less executors/tasks. So don't forget to mention a queue name at in your execution command:
spark-submit --class com.xx.yy.FactTable_Merging.ScalaHiveHql
--num-executors 25
--executor-cores 5
--executor-memory 20g
--driver-memory 10g
--driver-cores 5
--master yarn-cluster
--name "FactTable HIST & INCR Re Write After Null Merging Seperately"
--queue "your_queue_name"
/tmp/ScalaHiveProgram.jar
/user/poc_user/FactTable_INCR_MERGED_10_PARTITION
/user/poc_user/FactTable_HIST_MERGED_50_PARTITION
I have a lab environment of cdh5 with 6 nodes-node[1-6] and node7 as the nameNode.
node[1-5]: 8gb ram, 2 cores
node[6]: 32gb ram, 8 cores
I am new to spark and I am trying to simply count the number of lines in our data. I have uploaded the data on hdfs (5.3GB).
When I submit my spark job, it only runs 2 executors and I can see its splitting the task into 161 task (there are 161 files in the dir).
In the code, I am reading all the files and doing the count on them.
data_raw = sc.textFile(path)
print data_raw.count()
On CLI: spark-submit --master yarn-client file_name.py --num-executors 6 --executor-cores 1
It should run with 6 executors with 1 task running on them. But I only see 2 executors running. I am not able to figure the cause for it.
Any help would be greatly appreciated.
Correct way to submit the job is:
spark-submit --num-executors 6 --executor-cores 1 --master yarn-client file_name.py
Now its showing all the other executors.
I suspect only 2 nodes are running spark. Go to cloudera manager -> clusters -> spark -> instances to confirm.