Actually I am working on a project which includes a workflow which consists of multiple tasks and single task consists of multiple components.
for eg.
In join , we need 4 components. 2 for input ( consider two table join) , 1 for join logic and 1 for output ( writing back to hdfs ).
This is one task. , similarly "sort" can be another task.
suppose a workflow with these two tasks and which are linked i.e
after performing join , we are using output of join in our sorting task.
But "join" and "sort" invoke separate "spark sessions".
So the flow is like , one spark session is created for join using spark submit , output is saved in hdfs and current spark session is closed. For the sort another session is created using spark submit and output stored at hdfs by join task is fetched for sorting.
But the problem is there is a overhead of fetching the data from hdfs.
So is there any way I can share a session for different tasks between two spark-submit. So that i will not loose my result dataframe from join and which can be directly used in my next spark-submit of sorting.
So basically I have multiple spark-submit associated with different task . But I want to retain my result in dataframe in memory so that I need not to persist it and can be used in another linked task ( spark-submit)
The spark session builder has a function to get or create the SparkSession.
val spark = SparkSession.builder()
.master("local[*]")
.appName("AzuleneTest")
.getOrCreate()
But reading your description you may be better off passing dataframes between between the functions (which hold a reference to a spark session). This avoids creating multiple instances of spark. For example
//This is run within a single spark application
val df1 = spark.read.parquet("s3://bucket/1/")
val df2 = spark.read.parquet("s3://bucket/1/")
val df3 = df1.join(df2,"id")
df3.write.parquet("s3://bucket/3/")
Related
I'm trying to read a list of directories each into its own dataframe. E.g.
dir_list = ['dir1', 'dir2', ...]
df1 = spark.read.csv(dir_list[0])
df2 = spark.read.csv(dir_list[1])
...
Each directory has data of varying schemas.
I want to do this in parallel, so just a simple for loop won't work. Is there any way of doing this?
Yes there is :))
"How" will depend on what kind of processing you do after read, because by itself the spark.read.csv(...) won't execute until you call an action (due to Spark's lazy evaluation) and putting multiple reads in a for loop will work just fine.
So, if the results of evaluating multiple dataframes have the same schema, parallelism can be simply achieved by UNION'ing them. For example,
df1 = spark.read.csv(dir_list[0])
df2 = spark.read.csv(dir_list[1])
df1.withColumn("dfid",lit("df1")).groupBy("dfid").count()
.union(df2.withColumn("dfid",lit("df1")).groupBy("dfid").count())
.show(truncate=False)
... will cause both dir_list[0] and dir_list[1] to be read in parallel.
If this is not feasible, then there is always a Spark Scheduling route:
Inside a given Spark application (SparkContext instance), multiple
parallel jobs can run simultaneously if they were submitted from
separate threads. By “job”, in this section, we mean a Spark action
(e.g. save, collect) and any tasks that need to run to evaluate that
action.
I'm using spark structured streaming in databricks. In that, I'm using foreach operation to perform some operations on every record of the data. But the function which I'm passing in foreach uses SparkSession but it's throwing an error: _pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers.
So, Is there any way to use SparkSession inside foreach?
EDIT #1:
One example of function passed in foreach would be something like:
def process_data(row):
df = spark.createDataFrame([row])
df.write.mode("overwrite").saveAsTable("T2")
spark.sql("""
MERGE INTO T1
USING T2 ON T2.type="OrderReplace" AND T1.ReferenceNumber=T2.originalReferenceNumber
WHEN MATCHED THEN
UPDATE SET
shares = T2.shares,
price = T2.price,
ReferenceNumber = T2.newReferenceNumber
""")
So, I need SparkSession here which is not available inside foreach.
From the description of the tasks you want to perform row by row on your streaming dataset, you can try using foreachBatch.
A streaming dataset may have thousands to millions of records and if you are hitting external systems on each record, it is a massive overhead. From your example, it seems you are updating a base data set T1 based on events in T2.
From the link above, you can do something like below,
def process_data(df, epoch_id):
df.write.mode("overwrite").saveAsTable("T2")
spark.sql("""
MERGE INTO T1
USING T2 ON T2.type="OrderReplace" AND T1.ReferenceNumber=T2.originalReferenceNumber
WHEN MATCHED THEN
UPDATE SET
shares = T2.shares,
price = T2.price,
ReferenceNumber = T2.newReferenceNumber
""")
streamingDF.writeStream.foreachBatch(process_data).start()
Some additional points regarding what you are doing,
You are trying to do a foreach on a streaming data set and overwriting T2 hive table. Since records are processed parallel in clusters-managers like yarn, there might be a corrupted T2 since multiple tasks maybe updating it.
If you are persisting records in T2 (even temporarily), why don't you try making the merge/ update logic as a separate batch process? This will be similar to the foreachBatch solution I wrote above.
Updating/ Creating Hive tables fire a map/reduce job ultimately which requires resource negotiations, etc., etc. and may be painfully slow if done on each record, you may want to consider a different type of destination like a JDBC based one for example.
I have a question on the inner workings of Spark.
If I define a dataframe from a Hive table e.g. df1 = spark_session.table('db.table'); is that table read just once?
What I mean is, if I created 4 or 5 new dataframes from df1 and output them all to separate files, is that more efficient than running them all as different spark files?
Is this more efficient than the below diagram? Does it result in less load on Hive because we read the data once, or is that now how it works?
Than this:
If I define a dataframe from a Hive table e.g. df1 = spark_session.table('db.table'); is that table read just once?
You need to cache() the df1 = spark_session.table('db.table').cache() then spark will read the table once and caches the data when action is performed.
If you output df1 to 4 or 5 different files also spark only read the data from hive table once as we already cached the data.
Is this more efficient than the below diagram? Does it result in less load on Hive because we read the data once, or is that now how it works?
Yes in your first diagram we are keeping less load on hive as we are reading data once.
In your second diagram if we write separate spark jobs for each file that means we are reading hive table in each job.
I have the following codes. It performs a join operation on two Datasets, one of which is being filtered inside the join transformation.
activeUserProfileDataset.join(
allJobModelsDataset.filter(jobModel => jobIdRecCandidatesBroadcasted.value.contains(jobModel.JobId)),
$"notVisitedJobId" === col(JobModelFieldNames.jobId),
"left_outer")
This caused problem:
SparkException: Task not serializable
However, when I take out the filter transformation and create the second Dataset outside of join, this time works:
val jobIdRecCandidatesJobModels = allJobModelDataset.filter(jobModel => jobIdRecCandidatesBroadcasted.value.contains(jobModel.JobId))
val userJobPredictionsDataset3 = userJobPredictionsDataset2.join(
jobIdRecCandidatesJobModels,
$"notVisitedJobId" === col(JobModelFieldNames.jobId),
"left_outer")
Could you explain to me why is this? Could you tell me how these transformation operations (like join, filter) works internally?
Thanks!
This is because in Spark you can't specify transformations inside another transformations.
The main idea:
Driver node processes DAGs and creates tasks
Workers execute transformations (in form of tasks)
In first example you try to process DAGs and create tasks INSIDE a transformation (on a worker node). Generally you crate a task that need to create tasks on another DF). But remember - workers can't create new tasks. They only execute them.
In second example - you do everything correctly on driver node. E.g. you first create transformations on DF and then you just use resulting DF in new task.
Hope that helps :)
I'm using spark with mongodb , I want to know how input rdd splitted across different worker nodes in the cluster, because my job is to club two records(one is request another is response) into one , based on the msg_id ,flag(flag indicates request or response) fields, msg_id is same in both records.while spark splitting input rdd ,each split for each node then how to handle the case if request record in one node and response record in another node.
Firstly, Spark master does not split data. It just controls workers.
Secondly, rdd splits (while reading from external sources) are decided by InputSplits, implemented through input format. This part is fairly similar to map reduce. So in your case, rdd splits (or partitions, in spark terms) are decided by mongodb input format.
In your case, I believe what you are looking for is to co-locate all records for a msg id to one node. That can be achieved using a partitionByKey function.
RDDs will be build based on your transformation(Subjected to the scenario) and master has less room to play the role here.Refer this link How does Spark paralellize slices to tasks/executors/workers? .
In your case, you may need to implement groupby() or groupbykey() (this one is not recommended) transformations to group your values based on the keys(msg_id).
For example
val baseRDD = sc.parallelize(Array("1111,REQUEST,abcd","1111,RESPONSE,wxyz","2222,REQUEST,abcd","2222,RESPONSE,wxyz"))
//convert your base rdd to keypair RDD
val keyValRDD =baseRDD.map { line => (line.split(",")(0),line)}
//Group it by message_id
val groupedRDD = keyValRDD.groupBy(keyvalue => keyvalue._1)
groupedRDD.saveAsTextFile("c:\\result")
Result :
(1111,CompactBuffer((1111,1111,REQUEST,abcd), (1111,1111,RESPONSE,wxyz)))
(2222,CompactBuffer((2222,2222,REQUEST,abcd), (2222,2222,RESPONSE,wxyz)))
In the above case, possibility of having all the values for a key in same partition is high(subjected to data volume and available computing resource at run time)