We know spark is used for data processing, I have a doubt over here when I want to check my data in spark some times the spark task level details are Disappearing so what needs to be done here? Why it is disappearing ?
Thanks,
Divya
Related
I am new to Spark and I am currently try to understand the architecture of spark.
As far as I know, the spark cluster manager assigns tasks to worker nodes and sends them partitions of the data. Once there, each worker node performs the transformations (like mapping etc.) on its own specific partition of the data.
What I don't understand is: where do all the results of these transformations from the various workers go to? are they being sent back to the cluster manager / driver and once there reduced (e.g. sum of values of each unique key)? If yes, is there a specific way this happens?
Would be nice if someone is able to enlighten me, neither the spark docs nor other Resources concerning the architecture haven't been able to do so.
Good question, I think you are asking how does a shuffle work...
Here is a good explanation.
When does shuffling occur in Apache Spark?
I was running spark sql on Yarn and I met the same issue like below link:
Spark: long delay between jobs
There's a long delay post the action which was saving table.
On Spark UI, I could see the particular saveAsTable() job was completed but there's no any new job was submitted.
spark ui screenshot
In the first link, the answer said I/O operations will occur on master node but I doubt that.
At the gap time, I checked hdfs where I saved the tables, then I could see _temporary file rather than _success file. it looks like the answer is truth and spark was saving table on driver end. Why?!!
I'm using below code to save table:
dataframe.write.partitionBy(partitionColumn)).format(format)
.mode(SaveMode.Overwrite)
.saveAsTable(s"$tableName")
BTW, the format is orc format file.
anyone can give me some suggestions? :)
thx in advance.
Spark SQL - Difference between df.repartition and DataFrameWriter partitionBy?
As per above link, partitionBy is used to partition data on disk. so this process cannot be monitored on Spark UI.
I increased the partitions numbers before calling partitionBy(), then too much files would be generated, which caused the delay.
I think that's the truth.
I'm no expert in spark, so my apologies if I'm way off.
We are using apache spark to process different sections of a large file simultaneously. We don't need any aggregations of the results. The problem we are facing is that the worker will process records one by one and we'd like to process them in groups. We can collect them in groups, but the last group will not be processed as we get no information from spark that it is processing the last record. Is there a way to get spark to call something after processing of a partition is completed so that we could complete processing of the last group?
Or maybe a totally different way of approaching this?
We are using java, should you decide to provide some code examples.
Thanks
Suppose, I am running a simple Wordcount application on Spark (actually Spark Streaming) with 2 worker nodes. By default each task (from any stage) is scheduled to any available resource based on a scheduling algorithm. However, I want to change the default scheduling to fix each stage to a specific worker node.
Here is what I am trying to achieve -
Worker Node 'A' should only process the first Stage (like 'map' stage). So all the data that comes in must first go to worker 'A'
and Worker Node 'B' should only process the second stage (like 'reduce' stage). Effectively, the results of Worker A are processed by Worker B.
My first question is - Is this sort of customisation possible on Spark or Spark Streaming by tuning the parameters or choosing a correct config option? (I don't think it is, but can someone confirm this?)
My second question is - Can I achieve this by making some change to the Spark scheduler code? I am ok hardcoding the IPs of the workers if necessary. Any hints or pointers to this specific problem or even understanding the Spark Scheduler code in more detail would be helpful..
I understand that this change defeats the efficiency goals of Spark to some extent but I am only looking to experiment with different setups for a project.
Thanks!
I am relatively new to spark. However I needed to find out that is there are a way by which we can see which data frame is being accessed at what time. Can this be achieved by native spark logging?
If so, then how do I implement this??
The DAG Visualization and Event Timeline are two very important built-in spark tools available from Spark 1.4 that you can use to see which DF/RDD is used and in what steps. See more details here - Understanding your Spark application through visualization