RuntimeError: SparkContext should only be created and accessed on the driver - apache-spark

I am trying to execute the below code since I need to lookup the table and create a new column out of it. So, I am trying to go with udf as joining didn't work out.
In that, I am getting the RuntimeError: SparkContext should only be created and accessed on the driver. error.
To avoid this error I have included the config('spark.executor.allowSparkContext', 'true') inside the udf function.
But this time I am getting the pyspark.sql.utils.AnalysisException: Table or view not found: ser_definition; line 3 pos 5; error due to the temp table does not spread across the executors.
How to overcome this error or is there any other better approach.
Below is the code.
df_subsbill_label = spark.read.format("csv").option("inferSchema", True).option("header", True).option("multiLine", True)\
.load("file:///C://Users//test_data.csv")\
df_service_def = spark.read.format("csv").option("inferSchema", True).option("header", True).option("multiLine", True)\
.load("file:///C://Users//test_data2.csv")\
df_service_def.createGlobalTempView("ser_definition")
query = '''
SELECT mnthlyfass
FROM ser_definition
WHERE uid = {0}
AND u_soc = '{1}'
AND ser_type = 'SOC'
AND t_type = '{2}'
AND c_type = '{3}'
ORDER BY d_fass DESC, mnthlyfass DESC
LIMIT 1
'''
def lookup_fas(uid, u_soc, t_type, c_type, query):
spark = SparkSession.builder.config('spark.executor.allowSparkContext', 'true').getOrCreate()
query = query.format(uid, u_soc, t_type, c_type,)
df = spark.sql(query)
return df.rdd.flatMap(lambda x : x).collect()
udf_lookup = F.udf(lookup_fas, F.StringType())
df_subsbill_label = df_subsbill_label.withColumn("mnthlyfass", udf_lookup(F.col("uid"), F.col("u_soc"), F.col("t_type"), F.col("c_type"), F.lit(query)))
df_subsbill_label.show(20, False)
Error:
pyspark.sql.utils.AnalysisException: Table or view not found: ser_definition; line 3 pos 5;
'GlobalLimit 1
+- 'LocalLimit 1
+- 'Sort ['d_fass DESC NULLS LAST, 'mnthlyfass DESC NULLS LAST], true

Please add "global_temp", the database name followed by the table name in the SQL.
FROM global_temp.ser_definition
This should work.

First you shoud not get spark session on to executor if you are running spark in cluster mode as spark session object cannot be serialised thus cannot send it to executor. Also, it is against spark design principles to do so.
What you can do here is to broadcast your dataframe instead, this will create a copy of your dataframe inside each executor, then you can get the dataframe in the executor:
df_service_def = spark.read.format("csv").option("inferSchema", True).option("header", True).option("multiLine", True)\
.load("file:///C://Users//test_data2.csv")
broadcastVar = spark.broadcast(Array(0, 1, 2, 3))
broadcasted_df_service_def = spark.sparkContext.broadcast(df_service_def)
then inside your udf:
def lookup_fas(uid, u_soc, t_type, c_type, query):
df = broadcasted_df_service_def.value
# here apply your query on the dataframe ...
PS: Even though this should work I think it my impact the performance since an udf is called for each row, so maybe you should change the design of your solution.

Related

Is it possible to share/create Spark session in UDF?

Is it possible to somehow use spark session inside the UDF function? I will have to ingest the data from child tables as well on the basis of referenced table. It's looking like this:
def select(entity):
query = f"SELECT * FROM `{database.value}`.`{table.value}` WHERE id='{entity}'"
records = spark.sql(query)
# store records on S3 in CSVformat, filename table_name.csv
return records
ingestion = F.udf(select, ArrayType(StringType()))
try:
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
query = f"SELECT * FROM `{database}`.`{entity_table}`"
entities = spark.sql(query)
# store records on S3 in CSV format, filename entities.csv
tables_with_relations = ["entity.some_child_table", "another_child_table"]
for child_table in tables_with_relations:
table = spark.sparkContext.broadcast(child_table)
response = entities.withColumn("response", ingestion("id"))
response.show()
except Exception as e:
raise e
In this case I am getting some pickling error and if I try by creating/accessing another spark session as following:
def select(entity):
query = f"SELECT * FROM `{database.value}`.`{table.value}` WHERE id='{entity}'"
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
records = spark.sql(query)
return records
Then I got this error:
Exception: SparkContext should only be created and accessed on the driver.
I am not experienced in Spark, therefore if it is not something for which UDF has been introduced, then I am fine with nested for loops, otherwise can somebody recommend how to achieve this either by using UDF or any other approach?

Is spark.read or spark.sql lazy transformations?

In Spark if the source data has changed in between two action calls why I still get previous o/p not the most recent ones. Through DAG all operations will get executed including read operation once action is called. Isn't it?
e.g.
df = spark.sql("select * from dummy.table1")
#Reading from spark table which has two records into dataframe.
df.count()
#Gives count as 2 records
Now, a record inserted into table and action is called withou re-running command1 .
df.count()
#Still gives count as 2 records.
I was expecting Spark will execute read operation again and fetch total 3 records into dataframe.
Where my understanding is wrong ?
To contrast your assertion, this below does give a difference - using Databricks Notebook (cells). The insert operation is not known that you indicate.
But the following using parquet or csv based Spark - thus not Hive table, does force a difference in results as the files making up the table change. For a DAG re-compute, the same set of files are used afaik, though.
//1st time in a cell
val df = spark.read.csv("/FileStore/tables/count.txt")
df.write.mode("append").saveAsTable("tab2")
//1st time in another cell
val df2 = spark.sql("select * from tab2")
df2.count()
//4 is returned
//2nd time in a different cell
val df = spark.read.csv("/FileStore/tables/count.txt")
df.write.mode("append").saveAsTable("tab2")
//2nd time in another cell
df2.count()
//8 is returned
Refutes your assertion. Also tried with .enableHiveSupport(), no difference.
Even if creating a Hive table directly in Databricks:
spark.sql("CREATE TABLE tab5 (id INT, name STRING, age INT) STORED AS ORC;")
spark.sql(""" INSERT INTO tab5 VALUES (1, 'Amy Smith', 7) """)
...
df.count()
...
spark.sql(""" INSERT INTO tab5 VALUES (2, 'Amy SmithS', 77) """)
df.count()
...
Still get updated counts.
However, the for a Hive created ORC Serde table, the following "hive" approach or using an insert via spark.sql:
val dfX = Seq((88,"John", 888)).toDF("id" ,"name", "age")
dfX.write.format("hive").mode("append").saveAsTable("tab5")
or
spark.sql(""" INSERT INTO tab5 VALUES (1, 'Amy Smith', 7) """)
will sometimes show and sometimes not show an updated count when just the 2nd df.count() is issued. This is due to Hive / Spark lack of synchronization that may depend on some internal flagging of changes. In any event not consistent. Double-checked.
This is most related to inmutability as I see it. DataFrames are inmutables, hence changes in the original table are not reflected on them.
Once a dataframe is evaluated, it will be never calculated again. So once the dataframe named df is evaluated, it is the picture of table1 at the time of evaluation, it doesn't matter if table1 changes, df won't. So the second df.count does not trigger evaluation it just return the previous result, which is 2
If you want the desired results you have to load again the DF in a different variable:
val df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 2
//Insert record
val df2 = spark.sql("select * from dummy.table1")
df2.count() //Will trigger evaluation and return 3
Or using var instead of val (which is bad)
var df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 2
//Insert record
df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 3
This said: yes, spark read and spark sql are lazy, those are not called until an action is found, but once that happens, evaluation won't be trigger ever again in that dataframe

How to prevent spark query against CSV glue catalog source from including headers?

I am attempting to build a Glue job that will execute a SQL query against an existing glue catalog, and store the results in another glue catalog (in the example below, only return the record with the highest cost for each value of sn.) When executing a spark query against CSV sourced data, however, it is including the header in the results. This issue does not occur when the source is parquet. The glue catalog Serde parameters includes skip.header.line.count 1, and executing the query against the source data through Athena does not include the headers.
Is there a way to explicitly tell spark to ignore header rows when using .sql()?
Here is the essence of the python code my glue job is executing:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
glue_source_database_name = 'source_database'
glue_destination_database_name = 'destination_database'
table_name = 'diamonds10_csv'
partition_count = 5
merge_query = 'SELECT SEQ.`sn`,SEQ.`carat`,SEQ.`cut`,SEQ.`color`,SEQ.`clarity`,SEQ.`depth`,SEQ.`table`,SEQ.`price`,SEQ.`x`,SEQ.`y`,SEQ.`z` FROM ( SELECT SUB.`sn`,SUB.`carat`,SUB.`cut`,SUB.`color`,SUB.`clarity`,SUB.`depth`,SUB.`table`,SUB.`price`,SUB.`x`,SUB.`y`,SUB.`z`, ROW_NUMBER() OVER ( PARTITION BY SUB.`sn` ORDER BY SUB.`price` DESC ) AS test_diamond FROM `diamonds10_csv` AS SUB) AS SEQ WHERE SEQ.test_diamond = 1'
spark_context = SparkContext.getOrCreate()
spark = SparkSession( spark_context )
spark.sql( f'use {glue_source_database_name}')
targettable = spark.sql(merge_query)
targettable.repartition(partition_count).write.option("path",f'{s3_output_path}/{table_name}').mode("overwrite").format("parquet").saveAsTable(f'`{glue_destination_database_name}`.`{table_name}`')

How to avoid excessive dataframe query

Consider there is spark job has multiple dataframe transitions
val baseDF1 = spark.sql(s"select * from db.table1 where condition1='blah'")
val baseDF2 = spark.sql(s"select * from db.table2 where condition2='blah'")
val df3 = basedDF1.join(baseDF12, basedDF1("col1") <=> basedDF1("col2"))
val df4 = df3.withcolumn("col3").withColumnRename("col4", "newcol4")
val df5 = df4.groupBy("groupbycol").agg(expr("coalesce(first(col5, false))"))
val df6 = df5.withColumn("level1", col("coalesce(first(col5, false))")(0))
.withColumn("level2", col("coalesce(first(col5, false))")(1))
.withColumn("level3", col("coalesce(first(col5, false))")(2))
.withColumn("level4", col("coalesce(first(col5, false))")(3))
.withColumn("level5", col("coalesce(first(col5, false))")(4))
.drop("coalesce(first(col5, false))")
I just wondering how Spark generate the spark SQL logic, is it going to generate the query-like transaction for each data frame, i.e
df1 = select * ....
df2 = select * ....
df3 = df1.join.df2 // spark takes content from df1/df2 instead run each query again for joining
....
df6 = ...
or generate large query by the end of the last dataframe
df6 = select coalesce(first(col5, false)).. from ((select * from table1) join (select * from table2 ) on blah ) group by blah 2...
All I trying to figure out, is how to avoid Spark generate huge query-like logic instead I can let Spark "Commit" somewhere to avoid huge long transaction
the reason behind the inquiry is because current spark job threw following exception
19/12/17 10:57:55 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 567, Column 28: Redefinition of parameter "agg_expr_21"
Spark has two operations - transformation and action.
Transformation happens when a DF is being built using various operations like - select, join, filter etc. It is read to be executed but has not done any work yet, it is being lazy. These transformations can be composed to make new transformation which you do while operating on predefined dataframes, like basedDF1.join(baseDF12, basedDF1("col1") <=> basedDF1("col2")). But again nothing has run.
Action happens when certain operations are called like save, collect, show etc. This is when real work happens. Here each and every 'transformation' that was defined before with be either executed or retrieved from cache. You can save a lot of work for Spark if you can cache some of the complex steps. This can also simplify the plan.
val baseDF1 = spark.sql(s"select * from db.table1 where condition1='blah'")
val baseDF2 = spark.sql(s"select * from db.table2 where condition2='blah'")
baseDF1.cache()
baseDF2.cache()
val df3 = basedDF1.join(baseDF12, basedDF1("col1") <=> basedDF1("col2"))
val df4 = baseDF1.join(baseDF12, basedDF1("col2") === basedDF1("col3"))// different join
When df4 is executed after df3, it won't be selecting from db.table1 and db.table2 but rather reading baseDF1 and baseDF2 from cache. The plan will look simpler too.
if some reason cache is gone then Spark will recompute baseDF1 and baseDF2 as they were defined, so it knows its lineage but didn't execute it.
You can also use checkpoint to break up the lineage of overall execution, hence simplify it. I think this can help your case.
I have also saved intermediate dataframe to a temporary file and read It back as a dataframe and use it down the line. This breaks up the complexity at the cost of extra io. I won’t recommend it unless other methods didn’t work.
I am not sure about the error you are getting.

Will a persisted-dataframe be calculated repeatedly many times?

I've got the following structured query:
val A = 'load somedata from HDFS'.persist(StorageLevel.MEMORY_AND_DISK_SER)
val B = A.filter('condition 1')
val C = A.filter('condition 2')
val D = A.filter('condition 3')
val E = A.filter('condition 4')
val F = A.filter('condition 5')
val G = A.filter('condition 6')
val H = A.filter('condition 7')
val I = B.union(C).union(D).union(E).union(F).union(G).union(H)
I persist the dataframe A, so that when I use B/C/D/E/F/G/H, the A dataframe should be calculated only once? But the DAG of this job is below:
From the DAG above, it seems that stage 6-12 are all executed and the dataframe A is calculated 7 times?
Why would this happen?
Maybe the DAG is just fake? I found that there are not lines on the top of stage 7-12 where stage 6 does have two lines from other stage
I didn't list all the operations. After union operation, I save the I dataframe to HDFS. Will this action on the I dataframe make the persist operation be done really? Or must I do an action operation such as count on the A dataframe to trigger the persist operation before reuse A dataframe?
Doing the following line won't persist your dataset.
val A = 'load somedata from HDFS'.persist(StorageLevel.MEMORY_AND_DISK_SER)
Caching/persistence is lazy when used with Dataset API so you have to trigger the caching using count operator or similar that in turn submits a Spark job.
After that all the following operators, filter including, should use InMemoryTableScan with the green dot in the plan (as shown below).
In your case even after union the dataset I is not cached since you have not triggered the caching (but merely marked it for caching).
After union operation, I save the I dataframe to HDFS. Will this action on the I dataframe make the persist operation be done really?
Yes. Only actions (like saving to an external storage) can trigger the persistence for future reuse.
Or must I do an action operation such as count on the A dataframe to trigger the persist operation before reuse A dataframe?
That's the point! In your case, since you want to reuse A dataframe across filter operators you should persist it first, count (to trigger the caching) followed by filters.
In your case, no filter will benefit from any performance increase due to persist. That persist is practically void of any impact on the performance and just makes a code reviewer think it's otherwise.
If you want to see when and if your dataset is cached, you can check out Storage tab in web UI or ask CacheManager about it.
val nums = spark.range(5).cache
nums.count
scala> spark.sharedState.cacheManager.lookupCachedData(nums)
res0: Option[org.apache.spark.sql.execution.CachedData] =
Some(CachedData(Range (0, 5, step=1, splits=Some(8))
,InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *Range (0, 5, step=1, splits=8)
))

Resources