Color scheme in spark web UI - apache-spark

What does the blue zone represent? I can understand the green zone represents the computing time. By going from legend, the blue zone should represent scheduler delay.However, the numbers do not match as mentioned schedular delay is negligible to the executor time. So, what does it means?

The scheduler is the part of the master that constructs the DAG of stages and tasks and interacts with the cluster to distribute them in the most efficient way it can. Scheduler Delay is the overhead of how long it takes to ship tasks to the executors and get the results back.
This is how it is calculated in the most recent branch:
private[ui] def getSchedulerDelay(
info: TaskInfo, metrics: TaskMetricsUIData, currentTime: Long): Long = {
if (info.finished) {
val totalExecutionTime = info.finishTime - info.launchTime
val executorOverhead = (metrics.executorDeserializeTime +
metrics.resultSerializationTime)
math.max(
0,
totalExecutionTime - metrics.executorRunTime - executorOverhead -
getGettingResultTime(info, currentTime))
} else {
// The task is still running and the metrics like executorRunTime are not available.
0L
}
}

Related

How to get maximum/minimum duration of all DagRun instances in Airflow?

Is there a way to find the maximum/minimum or even an average duration of all DagRun instances in Airflow? - That is all dagruns from all dags not just one single dag.
I can't find anywhere to do this on the UI or even a page with a programmatic/command line example.
You can use airflow- api to get all dag_runs for dag and calculate statistics.
An example to get all dag_runs per dag and calc total time :
import datetime
import requests
from requests.auth import HTTPBasicAuth
airflow_server = "http://localhost:8080/api/v1/"
auth = HTTPBasicAuth("airflow", "airflow")
get_dags_url = f"{airflow_server}dags"
get_dag_params = {
"limit": 100,
"only_active": "true"
}
response = requests.get(get_dags_url, params=get_dag_params, auth=auth)
dags = response.json()["dags"]
get_dag_run_params = {
"limit": 100,
}
for dag in dags:
dag_id = dag["dag_id"]
dag_run_url = f"{airflow_server}/dags/{dag_id}/dagRuns?limit=100&state=success"
response = requests.get(dag_run_url, auth=auth)
dag_runs = response.json()["dag_runs"]
for dag_run in dag_runs:
execution_date = datetime.datetime.fromisoformat(dag_run['execution_date'])
end_date = datetime.datetime.fromisoformat(dag_run['end_date'])
duration = end_date - execution_date
duration_in_s = duration.total_seconds()
print(duration_in_s)
The easiest way will be to query your Airflow metastore. All the scheduling, DAG runs, and task instances are stored there and Airflow can't operate without it. I do recommend filtering on DAG/execution date if your use-case allows. It's not obvious to me what one can do with just these three overarching numbers alone.
select
min(runtime_seconds) min_runtime,
max(runtime_seconds) max_runtime,
avg(runtime_seconds) avg_runtime
from (
select extract(epoch from (d.end_date - d.start_date)) runtime_seconds
from public.dag_run d
where d.execution_date between '2022-01-01' and '2022-06-30' and d.state = 'success'
)
You might also consider joining to the task_instance table to get some task-level data, and perhaps use the min start and max end times for DAG tasks within a DAG run for your timestamps.

Killing spark streaming job when no activity

I want to kill my spark streaming job when there is no activity (i.e. the receivers are not receiving messages) for a certain time. I tried doing this
var counter = 0
myDStream.foreachRDD {
rdd =>
if (rdd.count() == 0L)
{
counter = counter + 1
if (counter == 40) {
ssc.stop(true, true)
}
} else {
counter = 0
}
}
Is there a better way of doing this? How would I make a variable available to all receivers and update the variable by 1 whenever there is no activity?
Use a NoSQL Table like Cassandra or HBase to keep the counter. You can not handle Stream Polling inside a loop. Implement same logic using NoSQL or Maria DB and perform a Graceful Shutdown to your streaming Job if no activity is happening.
The way I did it was I maintained a Table in Maria DB for Streaming JOB having Polling interval of 5 mins. Every 5 mins it hits the data base and writes the count of records it consumed also the method returns what is the count of zero records line items during latest timestamp. This helped me a lot managing my Streaming Job Management. Also this table usually helps me o automatically trigger the Streaming job based on a logic written in a shell script

Hazelcast mapreduce executor overload

I'm setting up a new cluster and I'm getting an error from the hazelcast mapreduce executor:
java.util.concurrent.RejectedExecutionException: Executor[mapreduce::hz::default] is overloaded
Using spring, I am configuring the jobtracker as follows:
<hz:jobtracker name="default" max-thread-size="8" queue-size="0"/>
Per documentation, 0 is the default queue size which is un-bound.
Thoughts? I am only sending about 100 jobs simultaneously
The manual is wrong about it.
A value that's less or equal zero means that the queue size is twice the partitionCount.
int queueSize = jobTrackerConfig.getQueueSize();
if (queueSize <= 0) {
queueSize = ps.getPartitionCount() * 2;
}
Code snippet on github
Use an integer that's big enough for your use-case.

Spark RDD.isEmpty costs much time

I built a Spark cluster.
workers:2
Cores:12
Memory: 32.0 GB Total, 20.0 GB Used
Each worker gets 1 cpu, 6 cores and 10.0 GB memory
My program gets data source from MongoDB cluster. Spark and MongoDB cluster are in the same LAN(1000Mbps).
MongoDB document format:
{name:string, value:double, time:ISODate}
There is about 13 million documents.
I want to get the average value of a special name from a special hour which contains 60 documents.
Here is my key function
/*
*rdd=sc.newAPIHadoopRDD(configOriginal, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])
Apache-Spark-1.3.1 scala doc: SparkContext.newAPIHadoopFile[K, V, F <: InputFormat[K, V]](path: String, fClass: Class[F], kClass: Class[K], vClass: Class[V], conf: Configuration = hadoopConfiguration): RDD[(K, V)]
*/
def findValueByNameAndRange(rdd:RDD[(Object,BSONObject)],name:String,time:Date): RDD[BasicBSONObject]={
val nameRdd = rdd.map(arg=>arg._2).filter(_.get("name").equals(name))
val timeRangeRdd1 = nameRdd.map(tuple=>(tuple, tuple.get("time").asInstanceOf[Date]))
val timeRangeRdd2 = timeRangeRdd1.map(tuple=>(tuple._1,duringTime(tuple._2,time,getHourAgo(time,1))))
val timeRangeRdd3 = timeRangeRdd2.filter(_._2).map(_._1)
val timeRangeRdd4 = timeRangeRdd3.map(x => (x.get("name").toString, x.get("value").toString.toDouble)).reduceByKey(_ + _)
if(timeRangeRdd4.isEmpty()){
return basicBSONRDD(name, time)
}
else{
return timeRangeRdd4.map(tuple => {
val bson = new BasicBSONObject()
bson.put("name", tuple._1)
bson.put("value", tuple._2/60)
bson.put("time", time)
bson })
}
}
Here is part of Job information
My program works so slowly. Does it because of isEmpty and reduceByKey? If yes, how can I improve it ? If not, why?
=======update ===
timeRangeRdd3.map(x => (x.get("name").toString, x.get("value").toString.toDouble)).reduceByKey(_ + _)
is on the line of 34
I know reduceByKey is a global operation, and may costs much time, however, what it costed is beyond my budget. How can I improvet it or it is the defect of Spark. With the same calculation and hardware, it just costs several seconds if I use multiple thread of java.
First, isEmpty is merely the point at which the RDD stage ends. The maps and filters do not create a need for a shuffle, and the method used in the UI is always the method that triggers a stage change/shuffle...in this case isEmpty. Why it's running slow is not as easy to discern from this perspective, especially without seeing the composition of the originating RDD. I can tell you that isEmpty first checks the partition size and then does a take(1) and verifies whether data was returned or not. So, the odds are that there is a bottle neck in the network or something else blocking along the way. It could even be a GC issue... Click into the isEmpty and see what more you can discern from there.

Partial results from spark Async Interface?

Is it possible to cancel an spark future and still get a smaller RDD with the processed elements?
Spark Async Actions "documented" here
http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.rdd.AsyncRDDActions
And the future itself has a rich set of functions
http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.FutureAction
The use case I was thinking of is to have a very huge map, that could be aborted afted 30 minutes of calculation, and still collect -or even iterate or saveAsObjectFile- the subset of the RDD that has been effectively mapped.
FutureAction.cancel causes a failure (see comment in JobWaiter.scala), so you cannot use it to get partial results. I don't think there's a way to do it through the async API.
Instead, you could stop processing the input after 30 minutes.
val stopTime = System.currentTimeMillis + 30 * 60 * 1000 // 30 minutes from now.
rdd.mapPartitions { partition =>
if (System.currentTimeMillis < stopTime) partition.map {
// Process it like usual.
???
} else {
// Time's up. Don't process anything.
Iterator()
}
}
Keep in mind that this only makes a difference once all the shuffle dependencies have completed. (It cannot stop the shuffle from being performed, even when 30 minutes have passed.)

Resources