Is it possible to share/create Spark session in UDF? - apache-spark

Is it possible to somehow use spark session inside the UDF function? I will have to ingest the data from child tables as well on the basis of referenced table. It's looking like this:
def select(entity):
query = f"SELECT * FROM `{database.value}`.`{table.value}` WHERE id='{entity}'"
records = spark.sql(query)
# store records on S3 in CSVformat, filename table_name.csv
return records
ingestion = F.udf(select, ArrayType(StringType()))
try:
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
query = f"SELECT * FROM `{database}`.`{entity_table}`"
entities = spark.sql(query)
# store records on S3 in CSV format, filename entities.csv
tables_with_relations = ["entity.some_child_table", "another_child_table"]
for child_table in tables_with_relations:
table = spark.sparkContext.broadcast(child_table)
response = entities.withColumn("response", ingestion("id"))
response.show()
except Exception as e:
raise e
In this case I am getting some pickling error and if I try by creating/accessing another spark session as following:
def select(entity):
query = f"SELECT * FROM `{database.value}`.`{table.value}` WHERE id='{entity}'"
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
records = spark.sql(query)
return records
Then I got this error:
Exception: SparkContext should only be created and accessed on the driver.
I am not experienced in Spark, therefore if it is not something for which UDF has been introduced, then I am fine with nested for loops, otherwise can somebody recommend how to achieve this either by using UDF or any other approach?

Related

RuntimeError: SparkContext should only be created and accessed on the driver

I am trying to execute the below code since I need to lookup the table and create a new column out of it. So, I am trying to go with udf as joining didn't work out.
In that, I am getting the RuntimeError: SparkContext should only be created and accessed on the driver. error.
To avoid this error I have included the config('spark.executor.allowSparkContext', 'true') inside the udf function.
But this time I am getting the pyspark.sql.utils.AnalysisException: Table or view not found: ser_definition; line 3 pos 5; error due to the temp table does not spread across the executors.
How to overcome this error or is there any other better approach.
Below is the code.
df_subsbill_label = spark.read.format("csv").option("inferSchema", True).option("header", True).option("multiLine", True)\
.load("file:///C://Users//test_data.csv")\
df_service_def = spark.read.format("csv").option("inferSchema", True).option("header", True).option("multiLine", True)\
.load("file:///C://Users//test_data2.csv")\
df_service_def.createGlobalTempView("ser_definition")
query = '''
SELECT mnthlyfass
FROM ser_definition
WHERE uid = {0}
AND u_soc = '{1}'
AND ser_type = 'SOC'
AND t_type = '{2}'
AND c_type = '{3}'
ORDER BY d_fass DESC, mnthlyfass DESC
LIMIT 1
'''
def lookup_fas(uid, u_soc, t_type, c_type, query):
spark = SparkSession.builder.config('spark.executor.allowSparkContext', 'true').getOrCreate()
query = query.format(uid, u_soc, t_type, c_type,)
df = spark.sql(query)
return df.rdd.flatMap(lambda x : x).collect()
udf_lookup = F.udf(lookup_fas, F.StringType())
df_subsbill_label = df_subsbill_label.withColumn("mnthlyfass", udf_lookup(F.col("uid"), F.col("u_soc"), F.col("t_type"), F.col("c_type"), F.lit(query)))
df_subsbill_label.show(20, False)
Error:
pyspark.sql.utils.AnalysisException: Table or view not found: ser_definition; line 3 pos 5;
'GlobalLimit 1
+- 'LocalLimit 1
+- 'Sort ['d_fass DESC NULLS LAST, 'mnthlyfass DESC NULLS LAST], true
Please add "global_temp", the database name followed by the table name in the SQL.
FROM global_temp.ser_definition
This should work.
First you shoud not get spark session on to executor if you are running spark in cluster mode as spark session object cannot be serialised thus cannot send it to executor. Also, it is against spark design principles to do so.
What you can do here is to broadcast your dataframe instead, this will create a copy of your dataframe inside each executor, then you can get the dataframe in the executor:
df_service_def = spark.read.format("csv").option("inferSchema", True).option("header", True).option("multiLine", True)\
.load("file:///C://Users//test_data2.csv")
broadcastVar = spark.broadcast(Array(0, 1, 2, 3))
broadcasted_df_service_def = spark.sparkContext.broadcast(df_service_def)
then inside your udf:
def lookup_fas(uid, u_soc, t_type, c_type, query):
df = broadcasted_df_service_def.value
# here apply your query on the dataframe ...
PS: Even though this should work I think it my impact the performance since an udf is called for each row, so maybe you should change the design of your solution.

How to prevent spark query against CSV glue catalog source from including headers?

I am attempting to build a Glue job that will execute a SQL query against an existing glue catalog, and store the results in another glue catalog (in the example below, only return the record with the highest cost for each value of sn.) When executing a spark query against CSV sourced data, however, it is including the header in the results. This issue does not occur when the source is parquet. The glue catalog Serde parameters includes skip.header.line.count 1, and executing the query against the source data through Athena does not include the headers.
Is there a way to explicitly tell spark to ignore header rows when using .sql()?
Here is the essence of the python code my glue job is executing:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
glue_source_database_name = 'source_database'
glue_destination_database_name = 'destination_database'
table_name = 'diamonds10_csv'
partition_count = 5
merge_query = 'SELECT SEQ.`sn`,SEQ.`carat`,SEQ.`cut`,SEQ.`color`,SEQ.`clarity`,SEQ.`depth`,SEQ.`table`,SEQ.`price`,SEQ.`x`,SEQ.`y`,SEQ.`z` FROM ( SELECT SUB.`sn`,SUB.`carat`,SUB.`cut`,SUB.`color`,SUB.`clarity`,SUB.`depth`,SUB.`table`,SUB.`price`,SUB.`x`,SUB.`y`,SUB.`z`, ROW_NUMBER() OVER ( PARTITION BY SUB.`sn` ORDER BY SUB.`price` DESC ) AS test_diamond FROM `diamonds10_csv` AS SUB) AS SEQ WHERE SEQ.test_diamond = 1'
spark_context = SparkContext.getOrCreate()
spark = SparkSession( spark_context )
spark.sql( f'use {glue_source_database_name}')
targettable = spark.sql(merge_query)
targettable.repartition(partition_count).write.option("path",f'{s3_output_path}/{table_name}').mode("overwrite").format("parquet").saveAsTable(f'`{glue_destination_database_name}`.`{table_name}`')

Spark sql querying a Hive table from workers

I am trying to querying a Hive table from a map operation in Spark, but when it run a query the execution getting frozen.
This is my test code
val sc = new SparkContext(conf)
val datasetPath = "npiCodesMin.csv"
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
val df = sparkSession.read.option("header", true).option("sep", ",").csv(datasetPath)
df.createOrReplaceTempView("npicodesTmp")
sparkSession.sql("DROP TABLE IF EXISTS npicodes");
sparkSession.sql("CREATE TABLE npicodes AS SELECT * FROM npicodesTmp");
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '1588667638'") //This works
println(res.first())
val NPIs = sc.parallelize(List("1679576722", "1588667638", "1306849450", "1932102084"))//Some existing NPIs
val rows = NPIs.mapPartitions{ partition =>
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
partition.map{code =>
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '"+code+"'")//The program stops here
res.first()
}
}
rows.collect().foreach(println)
It loads the data from a CSV, creates a new Hive table and fills it with the CSV data.
Then, if I query the table from the master it works perfectly, but if I try to do that in a map operation the execution getting frozen.
It do not generate any error, it continue running without do anything.
The Spark UI shows this situation
Actually, I am not sure if I can query a table in a distributed way, I cannot find it in the documentation.
Any suggestion?
Thanks.

How to evaluate spark Dstream objects with an spark data frame

I am writing an spark app ,where I need to evaluate the streaming data based on the historical data, which sits in a sql server database
Now the idea is , spark will fetch the historical data from the database and persist it in the memory and will evaluate the streaming data against it .
Now I am getting the streaming data as
import re
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext,functions as func,Row
sc = SparkContext("local[2]", "realtimeApp")
ssc = StreamingContext(sc,10)
files = ssc.textFileStream("hdfs://RealTimeInputFolder/")
########Lets get the data from the db which is relavant for streaming ###
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
dataurl = "jdbc:sqlserver://myserver:1433"
db = "mydb"
table = "stream_helper"
credential = "my_credentials"
########basic data for evaluation purpose ########
files_count = files.flatMap(lambda file: file.split( ))
pattern = '(TranAmount=Decimal.{2})(.[0-9]*.[0-9]*)(\\S+ )(TranDescription=u.)([a-zA-z\\s]+)([\\S\\s]+ )(dSc=u.)([A-Z]{2}.[0-9]+)'
tranfiles = "wasb://myserver.blob.core.windows.net/RealTimeInputFolder01/"
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
return globals()['sqlContextSingletonInstance']
def pre_parse(logline):
"""
to read files as rows of sql in pyspark streaming using the pattern . for use of logging
added 0,1 in case there is any failure in processing by this pattern
"""
match = re.search(pattern,logline)
if match is None:
return(line,0)
else:
return(
Row(
customer_id = match.group(8)
trantype = match.group(5)
amount = float(match.group(2))
),1)
def parse():
"""
actual processing is happening here
"""
parsed_tran = ssc.textFileStream(tranfiles).map(preparse)
success = parsed_tran.filter(lambda s: s[1] == 1).map(lambda x:x[0])
fail = parsed_tran.filter(lambda s:s[1] == 0).map(lambda x:x[0])
if fail.count() > 0:
print "no of non parsed file : %d", % fail.count()
return success,fail
success ,fail = parse()
Now I want to evaluate it by the data frame that I get from the historical data
base_data = sqlContext.read.format("jdbc").options(driver=driver,url=dataurl,database=db,user=credential,password=credential,dbtable=table).load()
Now since this being returned as a data frame how do I use this for my purpose .
The streaming programming guide here says
"You have to create a SQLContext using the SparkContext that the StreamingContext is using."
Now this makes me even more confused on how to use the existing dataframe with the streaming object . Any help is highly appreciated .
To manipulate DataFrames, you always need a SQLContext so you can instanciate it like :
sc = SparkContext("local[2]", "realtimeApp")
sqlc = SQLContext(sc)
ssc = StreamingContext(sc, 10)
These 2 contexts (SQLContext and StreamingContext) will coexist in the same job because they are associated with the same SparkContext.
But, keep in mind, you can't instanciate two different SparkContext in the same job.
Once you have created your DataFrame from your DStreams, you can join your historical DataFrame with the DataFrame created from your stream.
To do that, I would do something like :
yourDStream.foreachRDD(lambda rdd: sqlContext
.createDataFrame(rdd)
.join(historicalDF, ...)
...
)
Think about the amount of streamed data you need to use for your join when you manipulate streams, you may be interested by the windowed functions

Spark SQL cassandra delete records

Is there a way to delete some records based on a select query?
I have this query,
Select min(id) from ID having count(*)>1 which will show the duplicates. I need to get those ids and delete them. How can I do it in spark sql?
Spark SQL does not support DELETE.
If the number of ids to delete is small, you can do it using the Cassandra driver instead of through Spark:
import scala.collection.JavaConverters._
import scala.collection.JavaConversions._
import com.datastax.driver.core.{Cluster, Session, BatchStatement}
import com.datastax.driver.core.querybuilder.QueryBuilder
val cluster = Cluster.builder().addContactPoint(host_ip).build()
val session = cluster.connect(keyspace)
val idsToDelete = ... // perform your query and collect the ids
val queries = idsToDelete.map({ id => QueryBuilder.delete().from(keyspace, table).where(QueryBuilder.eq("id", id)) })
val batch = batchStatement().addAll(queries.asJava)
session.execute(batch)
cluster.close

Resources