How lazy Data structure works - apache-spark

Having some doubts on action and transformation in Spark.
I'm using spark API from last couple of months. (Learned) Spark api has a power which says that it wont load any data into memory until any action is taken to store final transformed data somewhere. Is it correct understanding ?
More refined defination:
Spark will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.
The moment action is triggered (For example writing to file), data will start loading into memory from source and then transforming and finally writing into file. Is it correct meaning of action ? OR Action is something when driver program submits transformation and action graph to master and then master sends respected data and code to different worker nodes to execute ? Which one is correct understanding ?
Have Read online posts, but not cleared.

You right, spark won`t do anything untill certain action will be taken (e.g write).
Every transformation will return a new RDD contains its DAG and when you submit an action, spark will execute the DAG (If you use dataset itll make optimizations too).
Action is a method that submit the DAG as said (writing into file/foreach and more actions).
The driver is response to parallize the work, keep a live on executors and send them tasks.

Related

Spark dataframe : When does it materialize?

I have a spark question :
I have a job that errors out with : 403 Access Denied on S3
The spark job basically:
Gets data from LF resource linked tables from Glue Catalog
Creates temp views
Runs a bunch of transformations
Stores the data in an external location
I get sporadic errors in step 3 where we are doing a bunch of transformations. I say sporadic, because sometimes I would get no errors and the other times it pops up on any one of the functions that exist in step 3.
Wouldnt running a spark sql select statement (and storing it as temp view) on a glue dynamic frame materialize the data within the spark session in-memory?
e.g.:
df=glueContext.create_dynamic_frame_from_catalog(args)
df=df.toDF
df.createorreplacetempview(tbl1)
dfnew=spark.sql(select * from tbl1)
dfnew.createorreplacetempview(tbl2)
..step3 transformations on tbl2(this is where the error happens)
Is my understanding correct in that tbl1 has materialized into the spark session in-memory, but tbl2 is still lazily stored?
If so, then if I run spark sql statement on tbl2 it will materialize by querying from tbl1, not the glue catalog source tables, correct?
How can I ensure in the above script the LF tables are not accessed after getting them in a dynamic frame because the upstream data is continuously updated?
The understanding that you have of spark SQL views is not correct.
Spark SQL views are lazily evaluated and don't really materialize until you call an action. In fact, NONE of the lazily evaluated parts (also called transformations in Spark technical terms) are materialized until and unless you call an action.
All it does is create a DAG in the backend with all the transformations you have done so far and materialize all that when you call an action.
df.createorreplacetempview(tbl1) #lazily-evaluated
dfnew=spark.sql(select * from tbl1) #lazily-evaluated
dfnew.createorreplacetempview(tbl2) #lazily-evaluated
dfnew.show() #Action call --> materilaizes all the transformations done so far.
The error you are getting is most likely because of the permissions while reading or writing into a particular S3 location.
I hope this answers your first half of the question. It can be explained better if you can share what is happening in the transformation or if you are using any action during those transformations or the best way is to share the stacktrace of the error to get more definitive answer.
Also if you are using Spark 3.0 or higher you can materialize your transformations by using noop write format.
df.write.mode("overwrite").format("noop").save()
You can simply specify it as the write format and it will materialize the query and execute all the transformations but it will not write the result anywhere.

Multiple jobs from a single action (Read, Transform, Write)

Currently using PySpark on Databricks Interactive Cluster (with Databricks-connect to submit jobs) and Snowflake as Input/Output data.
My Spark application is supposed to read data from Snowflake, apply some simple SQL transformations (mainly F.when.otherwise, narrow transformation) , then load it back to Snowflake. (FYI, schema are passed to Snowflake reader & writer)
EDIT : There's also an sort transformation at the end of the process, before writing.
For testing purpose, I named my job like this: (Writer, and Reader are supposed to be named)
sc.setJobDescription("Step Snowflake Reader")
I have trouble understanding what the Spark UI is showing me :
So, I get 3 jobs, with all same jobs name (Writer).
I can understand that I have only one Spark Action, so suppose to have one job, so Spark did name the jobs the last value set by sc.setJobDescription (Reader, which trigg spark compute).
I did also tag my "ReaderClass"
sc = spark.sparkContext
sc.setJobDescription("Step Snowflake Reader")
Why it doesn't show ?
Is the first job is like "Downloading Data from Snowflake", the second "Apply SQL transformation", then the third "Upload data to Snowflake" ?
Why all my jobs are related to same SQL Query ?
What is Query 0 which is related to ... zero jobs ?
Thanks for help.
There are a few things in this.
First thing, is that a job is triggered for an action, and tranformations are not really part of it (they're computed during an action, but a single action can do multiple transformations).
In your case, reading, tranformation and sorting, all these steps would take place when the action is triggered
Please note that reading from snowflake doesn't trigger a job (this is an assumption as the same behaviour is exhibited by Hive) because snowflake already has the metadata which spark needs by traversing the files.
If you'll read a parquet file directly, it'll trigger a different job, and you'll be able to see the job description.
Now comes the part of you naming your job
sc.setJobDescription("Step Snowflake Reader")
This will name the job that was triggered by your write action. And this action in turn is calling multiple jobs (but are still part of the last action you're performing, refer here for more details see this post
Similarly, the last configuration that you make before calling an action is picked up (Same thing happens for setting shufflePartition for instance, you may wanna have a perticular step with more or less shuffle, but for 1 complete action, it'll be set to a single value)
Hope this answers your question.

Spark - do transformations also involve driver operations

My course notes have the following sentence: "RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset." But I think this is misleading because the transformation reduceByKey is performed locally on the workers and then on the driver as well (although the change does not take place until there's an action to be performed). Could you please correct me if I am wrong.
Here are the concepts
In Spark Transformation defines where one RDD generates one or more RDD. Everytime a new RDD is created. RDDs are immutable so any transformation on one RDD generates a new RDD and its added to DAG.
Action in spark are the function where new RDDs are not generated its generated other datatypes like String, int etc.. and result is returned to driver or other storage system.
Transformations are lazy in nature and nothing happen until action is triggered.
reduceByKey - Its a transformation as it generates a RDD from input RDD and its a WIDE TRANFORMATION. In reduce by key nothing happens until action is triggered. Please see the image below
reduce - its an action as it generates a non RDD type. Please see the image below
As a matter of fact, driver's first responsibility is managing the job. Moreover, RDD's objects are not located on driver to have an action on them. So, all the results are on workers till the actions' turns come. The thing which I mean is about lazy execution of spark, it means at first of the execution the plan is reviewed to the first action and if it could not find any then the whole program would result nothing. Otherwise, whole the program will be executed on the input data which would be presented as rdd object on the worker nodes to reach the action and all the data during this period would all be on workers and just the result according to the type of the action would be sent to or at least managed by the driver.

Spark Transformation - Why is it lazy and what is the advantage?

Spark Transformations are lazily evaluated - when we call the action it executes all the transformations based on lineage graph.
What is the advantage of having the Transformations Lazily evaluated?
Will it improve the performance and less amount of memory consumption compare to eagerly evaluated?
Is there any disadvantage of having the Transformation lazily evaluated?
For transformations, Spark adds them to a DAG of computation and only when driver requests some data, does this DAG actually gets executed.
One advantage of this is that Spark can make many optimization decisions after it had a chance to look at the DAG in entirety. This would not be possible if it executed everything as soon as it got it.
For example -- if you executed every transformation eagerly, what does that mean? Well, it means you will have to materialize that many intermediate datasets in memory. This is evidently not efficient -- for one, it will increase your GC costs. (Because you're really not interested in those intermediate results as such. Those are just convnient abstractions for you while writing the program.) So, what you do instead is -- you tell Spark what is the eventual answer you're interested and it figures out best way to get there.
Consider a 1 GB log file where you have error,warning and info messages and it is present in HDFS as blocks of 64 or 128 MB(doesn't matter in this context).You first create a RDD called "input" of this text file. Then,you create another RDD called "errors" by applying filter on the "input" RDD to fetch only the lines containing error messages and then call the action first() on the "error" RDD. Spark will here optimize the processing of the log file by stopping as soon as it finds the first occurrence of an error message in any of the partitions. If the same scenario had been repeated in eager evaluation, Spark would have filtered all the partitions of the log file even though you were only interested in the first error message.
From https://www.mapr.com/blog/5-minute-guide-understanding-significance-apache-spark
Lazy evaluation means that if you tell Spark to operate on a set of data, it listens to what you ask it to do, writes down some shorthand for it so it doesn’t forget, and then does absolutely nothing. It will continue to do nothing, until you ask it for the final answer. [...]
It waits until you’re done giving it operators, and only when you ask it to give you the final answer does it evaluate, and it always looks to limit how much work it has to do.
It saves time and unwanted processing power.
Consider When Spark is not Lazy..
For Example : we are having 1GB file loaded into memory from the HDFS
We are having the transformation like
rdd1 = load file from HDFS
rdd1.println(line1)
In this case when the 1st line is executed entry would be made to the DAG and 1GB file would be loaded to memory. In the second line the disaster is that just to print the line1 of the file the entire 1GB file is loaded onto memory.
Consider When Spark is Lazy
rdd1 = load file from HDFS
rdd1.println(line1)
In this case 1st line executed anf entry is made to the DAG and entire execution plan is built. And spark does the internal optimization. Instead of loading the entire 1GB file only 1st line of the file loaded and printed..
This helps avoid too much of computation and makes way for optimization.
Advantages:
"Spark allows programmers to develop complex, multi-step data pipelines usind directed acyclic graph (DAG) pattern" - [Khan15]
"Since spark is based on DAG, it can follow a chain from child to parent to fetch any value like traversal" - [Khan15]
"DAG supports fault tolerance" - [Khan15]
Description:
(According to "Big data Analytics on Apache Spark" [SA16] and [Khan15])
"Spark will not compute RDDs until an action is called." - [SA16]
Example of actions: reduce(func), collect(), count(), first(), take(n), ... [APACHE]
"Spark keeps track of the lineage graph of transformations, which is used to compute each RDD on demand and to recover lost data." - [SA16]
Example of transformations: map(func), filter(func), filterMap(func), groupByKey([numPartitions]), reduceByKey(func, [numPartitions]), ... [APACHE]

How to update an RDD?

We are developing Spark framework wherein we are moving historical data into RDD sets.
Basically, RDD is immutable, read only dataset on which we do operations.
Based on that we have moved historical data into RDD and we do computations like filtering/mapping, etc on such RDDs.
Now there is a use case where a subset of the data in the RDD gets updated and we have to recompute the values.
HistoricalData is in the form of RDD.
I create another RDD based on request scope and save the reference of that RDD in a ScopeCollection
So far I have been able to think of below approaches -
Approach1: broadcast the change:
For each change request, my server fetches the scope specific RDD and spawns a job
In a job, apply a map phase on that RDD -
2.a. for each node in the RDD do a lookup on the broadcast and create a new Value which is now updated, thereby creating a new RDD
2.b. now I do all the computations again on this new RDD at step2.a. like multiplication, reduction etc
2.c. I Save this RDDs reference back in my ScopeCollection
Approach2: create an RDD for the updates
For each change request, my server fetches the scope specific RDD and spawns a job
On each RDD, do a join with the new RDD having changes
now I do all the computations again on this new RDD at step2 like multiplication, reduction etc
Approach 3:
I had thought of creating streaming RDD where I keep updating the same RDD and do re-computation. But as far as I understand it can take streams from Flume or Kafka. Whereas in my case the values are generated in the application itself based on user interaction.
Hence I cannot see any integration points of streaming RDD in my context.
Any suggestion on which approach is better or any other approach suitable for this scenario.
TIA!
The usecase presented here is a good match for Spark Streaming. The two other options bear the question: "How do you submit a re-computation of the RDD?"
Spark Streaming offers a framework to continuously submit work to Spark based on some stream of incoming data and preserve that data in RDD form. Kafka and Flume are only two possible Stream sources.
You could use Socket communication with the SocketInputDStream, reading files in a directory using FileInputDStream or even using shared Queue with the QueueInputDStream. If none of those options fit your application, you could write your own InputDStream.
In this usecase, using Spark Streaming, you will read your base RDD and use the incoming dstream to incrementally transform the existing data and maintain an evolving in-memory state. dstream.transform will allow you to combine the base RDD with the data collected during a given batch interval, while the updateStateByKey operation could help you build an in-memory state addressed by keys. See the documentation for further information.
Without more details on the application is hard to go up to the code level on what's possible using Spark Streaming. I'd suggest you to explore this path and make new questions for any specific topics.
I suggest to take a look at IndexedRDD implementation, which provides updatable RDD of key value pairs. That might give you some insights.
The idea is based on the knowledge of the key and that allows you to zip your updated chunk of data with the same keys of already created RDD. During update it's possible to filter out previous version of the data.
Having historical data, I'd say you have to have sort of identity of an event.
Regarding streaming and consumption, it's possible to use TCP port. This way the driver might open a TCP connection spark expects to read from and sends updates there.

Resources