spark web UI notations - apache-spark

I am running the sample job on my end and the spark job UI says the total uptime is 26 sec but when i add up the duration column for jobs it is only around 17-18 sec .Which one should i rely on in order to determine the total time to run the execution logic of my job .I am not concerned about the time ot takes to start and stop the cluster .Is 26 sec including that time ,is thats the case how do i ignore the time to start the and stop the cluster and get the final execution time for my logic .
Spark job UI
Also my spark configuration looks like this :
val conf = new SparkConf().setAppName("Metrics").setMaster("spark://master:7077").set("spark.executor.memory", "5g").set("spark.cores.max", "4").set("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")set("spark.executor.memory", "5g")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
I have 2 physical core and 2 virtual core machine i.e 4 logical core .I am trying to use all the core by setting it to 4 core in the configuration but for some reason only 1 executor is used to run the job .Can somebody explain me the reason as to why only 1 executor is spawned and what is the relation between a core and the executor in the spark world .I am new to spark so any help would be great.
Executor for the job here

A single executor can use multiple threads like in your case. You have one executor with 4 cores.
Each executor thread can process a single partition at the time so your cluster can process four partitions concurrently.
In a small setup like this there is no reason to start multiple executor JVM but if you can use spark.executor.cores to configure how many cores a single executor can use.

Related

What is the difference between Fair Scheduler and Fair Scheduler Pool in Spark

In the local mode, I am submitting 10 concurrent jobs with Threadpoolexecutor.
If I only set SparkConf sparkConf = new SparkConf().setAppName("Hello Spark - WordCount").setMaster("local[*]").set("spark.scheduler.mode","FAIR"); Then the 10 jobs are executing parellelly but they are not gettng the same number of cores.
But if I add them to a pool and add the scheduling of the pool to FAIR, they are getting almost the same number of cores. May I know what could be the reason for this?

Spark Sql Job optimization

I have a job which consist around 9 sql statement to pull data from hive and write back to hive db. It is currently running for 3hrs which seems too long considering spark abitlity to process data. The application launchs total 11 stages.
I did some analysis using Spark UI and found below grey areas which can be improved:
Stage 8 in Job 5 has shuffle output of 1.5 TB.
Time gap between job 4 and job 5 is 20 Mins. I read about this time gap and found spark perform IO out of spark job which reflects as gap between two jobs which can be seen in driver logs.
We have a cluster of 800 nodes with restricted resources for each queue and I am using below conf to submit job:
-- num-executor 200
-- executor-core 1
-- executor-memory 6G
-- deployment mode client
Attaching Image of UI as well.
Now my questions are:
Where can I find driver log for this job?
In image, I see a long list of Executor added which I sum is more than 200 but in Executor tab, number is exactly 200. Any explation for this?
Out of all the stages, only one stage has TASK around 35000 but rest of stages has 200 tasks only. Should I increase number of executor or should I go for dynamic allocation facility of spark?
Below are the thought processes that may guide you to some extent:
Is it necessary to have one core per executor? The executor need not be fat always. You can have more cores in one executor. it is a trade-off between creating a slim vs fat executors.
Configure shuffle partition parameter spark.sql.shuffle.partitions
Ensure while reading data from Hive, you are using Sparksession (basically HiveContext). This will pull the data into Spark memory from HDFS and schema information from Metastore of Hive.
Yes, Dynamic allocation of resources is a feature that helps in allocating the right set of resources. It is better than having fixed allocation.

why does a single core of a Spark worker complete each task faster than the rest of the cores in the other workers?

I have three nodes in a cluster, each with a single active core. I mean, I have 3 cores in the cluster.
Assuming that all partitions have almost the same number of records, why does a single core of a worker complete each task faster than the rest of the cores in the other workers?
Please observe this screenshot. The timeline shows that the latency of the worker core (x.x.x.230) is notably shorter than the other two worker core (x.x.x.210 and x.x.x.220) latencies.
This means that the workers x.x.x.210 and x.x.x.220 are doing the same job in a longer time compared to the worker x.x.x.230. This also happens when all the available cores in the cluster are used, but its delay is not so critial.
I submitted this application again. Look at this new screenshot. Now the fastest worker is the x.x.x.210. Observe that tasks 0, 1 and 2 process partitions with almost the same number of records. This execution time discrepancy is not good, is it?
I don't understand!!!
What I'm really doing is creating a DataFrame and doing a mapping operation to get a new DataFrame, saving the result in a Parquet file.
val input: DataFrame = spark.read.parquet(...)
val result: DataFrame = input.map(row => /* ...operations... */)
result.write.parquet(...)
Any idea why this happens? Is that how Spark operates normally?
Thanks in advance.

Spark Streaming Job Keeps growing memory

I am running spark v 1.6.1 on a single machine in standalone mode, having 64GB RAM and 16cores.
I have created five worker instances to create five executor as in standalone mode, there cannot be more than one executor in one worker node.
Configuration:
SPARK_WORKER_INSTANCES 5
SPARK_WORKER_CORE 1
SPARK_MASTER_OPTS "-Dspark.deploy.default.Cores=5"
all other configurations are default in spark_env.sh
I am running a spark streaming direct kafka job at an interval of 1 min, which takes data from kafka and after some aggregation write the data to mongo.
Problems:
when I start master and slave, it starts one master process and five worker processes. each only consume about 212 MB of ram.when i submit the job , it again creates 5 executor processes and 1 job process and also the memory uses grows to 8GB in total and keeps growing over time (slowly) also when there is no data to process.
we are unpersisting cached rdd at the end also set spark.cleaner.ttl to 600. but still memory is growing.
one more thing, I have seen the merged SPARK-1706, then also why i am unable to create multiple executor within a worker.and also in spark_env.sh file , setting any configuration related to executor comes under YARN only mode.
Any help would be greatly appreciated,
Thanks

How to Dynamically Increase Active Tasks in Spark running on Yarn

I am running a spark streaming process where I got a batch of 6000 events. But when I look at executors only one active task is running. I tried dynamic allocation and as well as setting number of executors etc. Even if I have 15 executors only one active task is running at a time. Can any one please guide me what am I doing wrong here.
It looks like you're having only one partition in your DStream. You should try to explicitly repartition your input stream:
val input: DStream[...] = ...
val partitionedInput = input.repartition(numPartitions = 16)
This way you would have 16 partitions in your input DStream, and each of those partitions could be processed in a separate task (and each of those tasks could be executed on a separate executor)

Resources