Spark dataframe : When does it materialize? - apache-spark

I have a spark question :
I have a job that errors out with : 403 Access Denied on S3
The spark job basically:
Gets data from LF resource linked tables from Glue Catalog
Creates temp views
Runs a bunch of transformations
Stores the data in an external location
I get sporadic errors in step 3 where we are doing a bunch of transformations. I say sporadic, because sometimes I would get no errors and the other times it pops up on any one of the functions that exist in step 3.
Wouldnt running a spark sql select statement (and storing it as temp view) on a glue dynamic frame materialize the data within the spark session in-memory?
e.g.:
df=glueContext.create_dynamic_frame_from_catalog(args)
df=df.toDF
df.createorreplacetempview(tbl1)
dfnew=spark.sql(select * from tbl1)
dfnew.createorreplacetempview(tbl2)
..step3 transformations on tbl2(this is where the error happens)
Is my understanding correct in that tbl1 has materialized into the spark session in-memory, but tbl2 is still lazily stored?
If so, then if I run spark sql statement on tbl2 it will materialize by querying from tbl1, not the glue catalog source tables, correct?
How can I ensure in the above script the LF tables are not accessed after getting them in a dynamic frame because the upstream data is continuously updated?

The understanding that you have of spark SQL views is not correct.
Spark SQL views are lazily evaluated and don't really materialize until you call an action. In fact, NONE of the lazily evaluated parts (also called transformations in Spark technical terms) are materialized until and unless you call an action.
All it does is create a DAG in the backend with all the transformations you have done so far and materialize all that when you call an action.
df.createorreplacetempview(tbl1) #lazily-evaluated
dfnew=spark.sql(select * from tbl1) #lazily-evaluated
dfnew.createorreplacetempview(tbl2) #lazily-evaluated
dfnew.show() #Action call --> materilaizes all the transformations done so far.
The error you are getting is most likely because of the permissions while reading or writing into a particular S3 location.
I hope this answers your first half of the question. It can be explained better if you can share what is happening in the transformation or if you are using any action during those transformations or the best way is to share the stacktrace of the error to get more definitive answer.
Also if you are using Spark 3.0 or higher you can materialize your transformations by using noop write format.
df.write.mode("overwrite").format("noop").save()
You can simply specify it as the write format and it will materialize the query and execute all the transformations but it will not write the result anywhere.

Related

Multiple jobs from a single action (Read, Transform, Write)

Currently using PySpark on Databricks Interactive Cluster (with Databricks-connect to submit jobs) and Snowflake as Input/Output data.
My Spark application is supposed to read data from Snowflake, apply some simple SQL transformations (mainly F.when.otherwise, narrow transformation) , then load it back to Snowflake. (FYI, schema are passed to Snowflake reader & writer)
EDIT : There's also an sort transformation at the end of the process, before writing.
For testing purpose, I named my job like this: (Writer, and Reader are supposed to be named)
sc.setJobDescription("Step Snowflake Reader")
I have trouble understanding what the Spark UI is showing me :
So, I get 3 jobs, with all same jobs name (Writer).
I can understand that I have only one Spark Action, so suppose to have one job, so Spark did name the jobs the last value set by sc.setJobDescription (Reader, which trigg spark compute).
I did also tag my "ReaderClass"
sc = spark.sparkContext
sc.setJobDescription("Step Snowflake Reader")
Why it doesn't show ?
Is the first job is like "Downloading Data from Snowflake", the second "Apply SQL transformation", then the third "Upload data to Snowflake" ?
Why all my jobs are related to same SQL Query ?
What is Query 0 which is related to ... zero jobs ?
Thanks for help.
There are a few things in this.
First thing, is that a job is triggered for an action, and tranformations are not really part of it (they're computed during an action, but a single action can do multiple transformations).
In your case, reading, tranformation and sorting, all these steps would take place when the action is triggered
Please note that reading from snowflake doesn't trigger a job (this is an assumption as the same behaviour is exhibited by Hive) because snowflake already has the metadata which spark needs by traversing the files.
If you'll read a parquet file directly, it'll trigger a different job, and you'll be able to see the job description.
Now comes the part of you naming your job
sc.setJobDescription("Step Snowflake Reader")
This will name the job that was triggered by your write action. And this action in turn is calling multiple jobs (but are still part of the last action you're performing, refer here for more details see this post
Similarly, the last configuration that you make before calling an action is picked up (Same thing happens for setting shufflePartition for instance, you may wanna have a perticular step with more or less shuffle, but for 1 complete action, it'll be set to a single value)
Hope this answers your question.

is there any performance hit when calling createOrReplaceTempView on a Spark Dataset?

In my code we use a lot createOrReplaceTempView so that we can invoke SQL on the generated view. This is done on multiple stages of the transformation. It also help us to keep the code in modules each performing a particular operation. A sample code below to put in context my question is shown below. So my questions are:
What is the performance penalty if any by creating the temp view
from the Dataset?
When I create more than one from each transformation does this
increases memory size?
What is the life cycle of those views and is there any function call
to remove them?
val dfOne = spark.read.option("header",true).csv("/apps/cortex/landing/auth/cof_auth.csv")
dfOne.createOrReplaceTempView("dfOne")
val dfTwo = spark.sql("select * from dfOne where column_one=1234567890")
dfTwo.createOrReplaceTempView("dfTwo")
val dfThree = spark.sql("select column_two, count(*) as count_two from dfTwo")
dfTree.createOrReplaceTempView("dfThree")
No.
From the manuals on
Running SQL Queries Programmatically
The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.
In order to do this you register the dataFrame as a SQL temporary view. This is a "lazy" artefact and there must already be a data frame / dataset present. It's just needs registering to allow the SQL interface.
createOrReplaceTempView takes some time in processing.
udf also takes time as it should be register in spark application and can cause performance issue.
From my experience, its better to use and look for bulitin function --> then I would put other two in same speed udf == sql_temp_table/view. Here, table/view has not all possiblities.

How lazy Data structure works

Having some doubts on action and transformation in Spark.
I'm using spark API from last couple of months. (Learned) Spark api has a power which says that it wont load any data into memory until any action is taken to store final transformed data somewhere. Is it correct understanding ?
More refined defination:
Spark will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.
The moment action is triggered (For example writing to file), data will start loading into memory from source and then transforming and finally writing into file. Is it correct meaning of action ? OR Action is something when driver program submits transformation and action graph to master and then master sends respected data and code to different worker nodes to execute ? Which one is correct understanding ?
Have Read online posts, but not cleared.
You right, spark won`t do anything untill certain action will be taken (e.g write).
Every transformation will return a new RDD contains its DAG and when you submit an action, spark will execute the DAG (If you use dataset itll make optimizations too).
Action is a method that submit the DAG as said (writing into file/foreach and more actions).
The driver is response to parallize the work, keep a live on executors and send them tasks.

Spark UI - Spark SQL Query Execution

I am using Spark SQL API. When I see the Spark SQL section on the spark UI which details the query execution plan it says it scans parquet stage multiple times even though I am reading the parquet only once.
Is there any logical explanation?
I would also like to understand the different operations like Hash Aggregate, SortMergeJoin etc and understand the Spark UI better as a whole.
If you are doing unions or joins they may force your plan to be "duplicated" since the beginning.
Since spark doesn't keep intermediate states (unless you cache) automatically, it will have to read the sources multiple times
Something like
1- df = Read ParquetFile1
2- dfFiltered = df.filter('active=1')
3- dfFiltered.union(df)
The plan will probably look like : readParquetFIle1 --> union <-- filter <-- readParquetFIle1

What role Spark SQL acts? Memory DB?

Recently i come to Spark SQL.
I read the Data Source Api and still confused at what role Spark SQL acts.
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory? That means spark sql is only a memory db that works on data already loaded. Or it scan locally every time?
Really willing to any answers.
Best Regards.
I read the Data Source Api and still confused at what role Spark SQL acts.
Spark SQL is not a database. It is just an interface that allows you to execute SQL-like queries over the data that you store in Spark specific row-based structures called DataFrame
To run a SQL query via Spark, the first requirement is that the table on which you are trying to run a query should be present in either the Hive Metastore (i.e the table should be present in Hive) or it should be a temporary view that is part of the current SQLContext/HiveContext.
So, if you have a dataframe df and you want to run SQL queries over it, you can either use:
df.createOrReplaceTempView("temp_table") // or registerTempTable
and then you can use the SQLContext/HiveContext or the SparkSession to run queries over it.
spark.sql("SELECT * FROM temp_table")
Here's eliasah's answer that explains how createOrReplaceTempView works internally
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory?
The data will be stored in-memory or on disk depending upon the persistence strategy that you use. If you choose to cache the table, the data will get stored in memory and the operations would be considerable faster when compared to the case where data is fetched from the disk. That part is anyway configurable and up to the user. You can basically tell Spark how you want it to store the data.
Spark-sql will only cache the rows that are pulled by the action, this means that it will cache as many partitions as it has to read during the action. this makes your first call much faster than your second call

Resources