Multiple jobs from a single action (Read, Transform, Write) - apache-spark

Currently using PySpark on Databricks Interactive Cluster (with Databricks-connect to submit jobs) and Snowflake as Input/Output data.
My Spark application is supposed to read data from Snowflake, apply some simple SQL transformations (mainly F.when.otherwise, narrow transformation) , then load it back to Snowflake. (FYI, schema are passed to Snowflake reader & writer)
EDIT : There's also an sort transformation at the end of the process, before writing.
For testing purpose, I named my job like this: (Writer, and Reader are supposed to be named)
sc.setJobDescription("Step Snowflake Reader")
I have trouble understanding what the Spark UI is showing me :
So, I get 3 jobs, with all same jobs name (Writer).
I can understand that I have only one Spark Action, so suppose to have one job, so Spark did name the jobs the last value set by sc.setJobDescription (Reader, which trigg spark compute).
I did also tag my "ReaderClass"
sc = spark.sparkContext
sc.setJobDescription("Step Snowflake Reader")
Why it doesn't show ?
Is the first job is like "Downloading Data from Snowflake", the second "Apply SQL transformation", then the third "Upload data to Snowflake" ?
Why all my jobs are related to same SQL Query ?
What is Query 0 which is related to ... zero jobs ?
Thanks for help.

There are a few things in this.
First thing, is that a job is triggered for an action, and tranformations are not really part of it (they're computed during an action, but a single action can do multiple transformations).
In your case, reading, tranformation and sorting, all these steps would take place when the action is triggered
Please note that reading from snowflake doesn't trigger a job (this is an assumption as the same behaviour is exhibited by Hive) because snowflake already has the metadata which spark needs by traversing the files.
If you'll read a parquet file directly, it'll trigger a different job, and you'll be able to see the job description.
Now comes the part of you naming your job
sc.setJobDescription("Step Snowflake Reader")
This will name the job that was triggered by your write action. And this action in turn is calling multiple jobs (but are still part of the last action you're performing, refer here for more details see this post
Similarly, the last configuration that you make before calling an action is picked up (Same thing happens for setting shufflePartition for instance, you may wanna have a perticular step with more or less shuffle, but for 1 complete action, it'll be set to a single value)
Hope this answers your question.

Related

Spark dataframe : When does it materialize?

I have a spark question :
I have a job that errors out with : 403 Access Denied on S3
The spark job basically:
Gets data from LF resource linked tables from Glue Catalog
Creates temp views
Runs a bunch of transformations
Stores the data in an external location
I get sporadic errors in step 3 where we are doing a bunch of transformations. I say sporadic, because sometimes I would get no errors and the other times it pops up on any one of the functions that exist in step 3.
Wouldnt running a spark sql select statement (and storing it as temp view) on a glue dynamic frame materialize the data within the spark session in-memory?
e.g.:
df=glueContext.create_dynamic_frame_from_catalog(args)
df=df.toDF
df.createorreplacetempview(tbl1)
dfnew=spark.sql(select * from tbl1)
dfnew.createorreplacetempview(tbl2)
..step3 transformations on tbl2(this is where the error happens)
Is my understanding correct in that tbl1 has materialized into the spark session in-memory, but tbl2 is still lazily stored?
If so, then if I run spark sql statement on tbl2 it will materialize by querying from tbl1, not the glue catalog source tables, correct?
How can I ensure in the above script the LF tables are not accessed after getting them in a dynamic frame because the upstream data is continuously updated?
The understanding that you have of spark SQL views is not correct.
Spark SQL views are lazily evaluated and don't really materialize until you call an action. In fact, NONE of the lazily evaluated parts (also called transformations in Spark technical terms) are materialized until and unless you call an action.
All it does is create a DAG in the backend with all the transformations you have done so far and materialize all that when you call an action.
df.createorreplacetempview(tbl1) #lazily-evaluated
dfnew=spark.sql(select * from tbl1) #lazily-evaluated
dfnew.createorreplacetempview(tbl2) #lazily-evaluated
dfnew.show() #Action call --> materilaizes all the transformations done so far.
The error you are getting is most likely because of the permissions while reading or writing into a particular S3 location.
I hope this answers your first half of the question. It can be explained better if you can share what is happening in the transformation or if you are using any action during those transformations or the best way is to share the stacktrace of the error to get more definitive answer.
Also if you are using Spark 3.0 or higher you can materialize your transformations by using noop write format.
df.write.mode("overwrite").format("noop").save()
You can simply specify it as the write format and it will materialize the query and execute all the transformations but it will not write the result anywhere.

How lazy Data structure works

Having some doubts on action and transformation in Spark.
I'm using spark API from last couple of months. (Learned) Spark api has a power which says that it wont load any data into memory until any action is taken to store final transformed data somewhere. Is it correct understanding ?
More refined defination:
Spark will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.
The moment action is triggered (For example writing to file), data will start loading into memory from source and then transforming and finally writing into file. Is it correct meaning of action ? OR Action is something when driver program submits transformation and action graph to master and then master sends respected data and code to different worker nodes to execute ? Which one is correct understanding ?
Have Read online posts, but not cleared.
You right, spark won`t do anything untill certain action will be taken (e.g write).
Every transformation will return a new RDD contains its DAG and when you submit an action, spark will execute the DAG (If you use dataset itll make optimizations too).
Action is a method that submit the DAG as said (writing into file/foreach and more actions).
The driver is response to parallize the work, keep a live on executors and send them tasks.

How does SparkSQL create Jobs/Stages

I am looking into Spark's implementation. I know in Spark core, when calling an action on a RDD, sc.runJob will finally be called, which will submit job to DAGScheduler.
I want to know how DAGSheculer schedules Spark SQL Jobs, so I trace a call to an action on a Dataset. I find it calls sparkplan.executeCollect(), which will finally call doExecute() depend on which specific SparkPlan it is.
However when I look into objects.scala, I find no sc.runJob call, but lots child.execute(). I don't know where it ends.
My question is where does a Dataset action create a Job.

Spark execution - Relationship between spark execution job and spark action

I have one question regarding Spark execution which .
We all know that Each spark application (or the driver program) may contain one or many actions.
My question is which one is correct - Do a collection of jobs correspond to one action OR Does each job correspond to one action. Here job means the one that can be seen in the Spark execution UI.
I think the latter is true (each job correspond to one action). Please validate
Thanks.
Your understanding is correct.
Each action in spark corresponds to a Spark Job. And these actions are called by the driver program in the application.
And therefore an action can involve many transformation on the dataset(or RDD). Which creates stages in the job.
A stage can be thought of as the set of calculations(tasks) that can each be computed on an executor without communication with other executors or with the driver.
In other words, a new stage begins whenever network travel between workers is required; for example in a shuffle. These dependencies that create stage boundaries are called ShuffleDependencies.

Spark Web UI "take at SerDeUtil.scala:201" interpretation

I am creating a Spark RDD by loading data from Elasticsearch using the elasticsearch-hadoop connector in python (importing pyspark) as:
es_cluster_read_conf = {
"es.nodes" : "XXX",
"es.port" : "XXX",
"es.resource" : "XXX"
}
es_cluster_rdd = sc.newAPIHadoopRDD(
inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_cluster_read_conf)
Now, if I only have these 2 commands in my file and run it, on the Spark Web UI for Application Details, I see on job as: take at SerDeUtil.scala:201
I have 2 questions now:
1) I was under the impression that in Spark RDDs are computed lazily i.e, if no action is applied, there would not be any job launched. In the above scenario, I am not applying any action, yet I see a job as being run on the web UI.
2) If this is a job, what does this "take" operation actually mean? Does this mean that the data is actually loaded from my ElasticSearch node and passed to Spark node ? I understand some jobs as being listed as collect, count, etc because these are valid actions in Spark. However, even after doing extensive research, I still couldn't figure out the semantics of this take operation.
I was under the impression that in Spark RDDs are computed lazily i.e, if no action is applied, there would not be any job launched. I
This is more or less true although there a few exceptions out there when action can be triggered by a secondary task like creating partitioner, data conversions between JVM and guest languages. It is even more complicated when you work with high level Dataset API and Dataframes.
If this is a job, what does this "take" operation actually mean? Does this mean that the data is actually loaded from my ElasticSearch node and passed to Spark node?
It is a job and some amount of data is actually fetched from the source. It is required to determine serializer for the key-value pairs.

Resources