Error while passing dataframe to UDF in Structured Streaming - apache-spark

I am reading events from Kafka in Spark Structured streaming and need to process events one by one and write to redis. I wrote a UDF for that but it gives me spark context error.
conf = SparkConf()\
.setAppName(spark_app_name)\
.setMaster(spark_master_url)\
.set("spark.redis.host", "redis")\
.set("spark.redis.port", "6379")\
.set("spark.redis.auth", "abc")
spark = SparkSession.builder\
.config(conf=conf)\
.getOrCreate()
def func(element, event, timestamp):
#redis i/o
pass
schema = ArrayType(StructType(
[
StructField("element_id", StringType()),
StructField("event_name", StringType()),
StructField("event_time", StringType())
]
))
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", topic) \
.load()
#.option("includeTimestamp", value = True)\
ds = df.selectExpr(("CAST(value AS STRING)"))\
.withColumn("value", explode(from_json("value", schema)))
filter_func = udf(func, ArrayType(StringType()))
ds = ds.withColumn("column_name", filter_func(
ds['value']['element_id'],
ds['value']['event_name'],
ds['value']['event_time']
))
query = ds.writeStream \
.format("console") \
.start()
query.awaitTermination()
Error message: _pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Any help is appreciated.

I was trying to access spark context from within user defined function which is not allowed.
Within the udf, I was trying to write to spark-redis by using spark context.

Related

Unable to Connect to Apache Spark Kafka Streaming Using SSL on EMR Notebooks

I'm trying to connect to a Kafka consumer secured by SSL using spark structured streaming but I am having issues. I have the Kafka consumer working and reading events using confluent_kafka with the following code:
conf = {'bootstrap.servers': 'url:port',
'group.id': 'group1',
'enable.auto.commit': False,
'security.protocol': 'SSL',
'ssl.key.location': 'file1.key',
'ssl.ca.location': 'file2.pem',
'ssl.certificate.location': 'file3.cert',
'auto.offset.reset': 'earliest'
}
consumer = Consumer(conf)
consumer.subscribe(['my_topic'])
# Reads events without issues
msg = consumer.poll(timeout=0)
I'm having issues replicating this code with Spark Structured Streaming on EMR Notebooks.
This is the current setup I have on EMR Notebooks:
%%configure -f
{
"conf": {
"spark.jars.packages": "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0",
"livy.rsc.server.connect.timeout":"600s",
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
}
}
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark import SparkFiles
cert_file = 'file3.cert'
pem_file = 'file2.pem'
key_file = 'file3.key'
sc.addFile(f's3://.../{cert_file}')
sc.addFile(f's3://.../{pem_file}')
sc.addFile(f's3://.../{key_file}')
spark = SparkSession\
.builder \
.getOrCreate()
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "url:port") \
.option("kafka.group.id", "group1") \
.option("enable.auto.commit", "False") \
.option("kafka.security.protocol", "SSL") \
.option("kafka.ssl.key.location", SparkFiles.get(key_file)) \ # SparkFiles.get() works
.option("kafka.ssl.ca.location", SparkFiles.get(pem_file)) \
.option("kafka.ssl.certificate.location", SparkFiles.get(cert_file)) \
.option("startingOffsets", "earliest") \
.option("subscribe", "my_topic") \
.load()
query = kafka_df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.outputMode("append") \
.format("console") \
.start()
And no rows appear in the Structured Streaming tab in the spark UI even though I expect the rows to show up instantly since I am using the earliest startingOffsets.
My hypothesis is that readStream doesn't work because the SSL information is not set up correctly. I've looked and haven't found .option() parameters that directly correspond to the confluent_kafka API.
Any help would be appreciated.

How to create stream using pyspark and kafka an read it row by row

I'm trying to use pyspark to read a kafka stream and then in further stages I will process each row and store it in influxdb. The problem is pyspark is not reading the stream, no errors are shown.
It's not printing anything but in my code the foreach(show_data) is supposed to print 'test' for each row.
An example row of the stream sent by kafka is attached in the second picture
Code:
spark = (
SparkSession.builder.appName("Kafka Pyspark Streaming")
.master("local[*]")
.getOrCreate()
)
spark.sparkContext.setLogLevel('ERROR')
# Read stream from json and fit schema
inputStream = spark\
.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "SWAT")\
.option("startingOffsets", "latest")\
.load()
inputStream = inputStream.select(col("value").cast("string").alias("data"))
# inputStream.printSchema()
#inputStream = inputStream.selectExpr("CAST(value AS STRING)")
print(inputStream)
# Read stream and process
def show_data(row):
print(f"test")
print(f"> Reading the stream and storing ...")
query = (inputStream
.writeStream
.outputMode("append")
.foreach(show_data)
.option("checkpointLocation", "checkpoints")
.start())
query.awaitTermination()

How to transform dataframes to rdds in structured streaming?

I get data from kafka using pyspark streaming, and the result is a dataframe, when I transform dataframe to rdd, it went wrong:
Traceback (most recent call last):
File "/home/docs/dp_model/dp_algo_platform/dp_algo_core/test/test.py", line 36, in <module>
df = df.rdd.map(lambda x: x.value.split(" ")).toDF()
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 91, in rdd
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: 'Queries with streaming sources must be executed with writeStream.start();;\nkafka'
the right version code:
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test") \
.load()
df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
df = df.withColumn("s", F.split(df['value'], " "))
df = df.withColumn('e', F.explode(df['s']))
# df = df.rdd.map(lambda x: x.value.split(" ")).toDF()
q = df.writeStream \
.format("console") \
.trigger(processingTime='30 seconds') \
.start()
q.awaitTermination()
this is the wrong version code:
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test") \
.load()
df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
# df = df.withColumn("s", F.split(df['value'], " "))
# df = df.withColumn('e', F.explode(df['s']))
df = df.rdd.map(lambda x: x.value.split(" ")).toDF()
q = df.writeStream \
.format("console") \
.trigger(processingTime='30 seconds') \
.start()
q.awaitTermination()
Why it cannot convert dataframe to rdd? and how can I do when I want to transform dataframe to rdd in pyspark streaming?
If your spark version is 2.4.0 and above then u can use below alternative to play around with each row of your dataframe.
query=df.writeStream.foreach(Customized method to work on each row of dataframe rather than RDD).outputMode("update").start()
ssc.start()
ssc.awaitTermination()
This RDD aspect is simply NOT supported. RDDs are legacy and Spark Structured Streaming is DF/DS based. Common abstraction whether streaming or batch.
To perform specific actions over your Dataframe fields you can use UDF functions or even you can create your Spark Custom Transformers. But there are some Dataframe operations that are not supported like transforming to RDD.
structured streaming is running on the spark-sql enginer.Conversion of dataframe or dataset to RDD is not supported.

How to format a pyspark connection string for Azure Eventhub with Kafka

I am trying to parse JSON messages with Pyspark from an Azure Eventhub with enabled Kafka compatibility. I can't find any documentation on how to establish the connection.
import os
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
sc.stop() # Jupyter somehow created a context already..
sc = SparkContext(appName="PythonTest")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
# my connection string:
#Endpoint=sb://example.servicebus.windows.net/;SharedAccessKeyName=examplekeyname;SharedAccessKey=HERETHEJEY=;EntityPath=examplepathname - has a total of 5 partitions
kafkaStream = KafkaUtils.createStream(HOW DO I STRUCTURE THIS??)
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
parsed.count().map(lambda x:'Messages in this batch: %s' % x).pprint()
ssc.start()
ssc.awaitTermination()
See my answer (and question) here. That was for how to write to an Kafka-enabled Event Hub in pyspark but I assume reading config should be pretty similar. The tricky part was to get the security configuration right.
EH_SASL = 'org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=****";'
// Source: https://github.com/Azure/azure-event-hubs-for-kafka/tree/master/tutorials/spark#running-spark
dfKafka \
.write \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.batch.size", 5000) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("kafka.request.timeout.ms", 120000) \
.option("topic", "raw") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()
You can find any official tutorial on how to set up a consumer here. It's for Scala instead of PySpark but it's fairly easy to transform the code if you compare it with my example.

Pyspark Structured streaming processing

I am trying to make a structured streaming application with spark the main idea is to read from a kafka source, process the input, write back to another topic. i have successfully made spark read and write from and to kafka however my problem is with the processing part. I have tried the foreach function to capture every row and process it before writing back to kafka however it always only does the foreach part and never writes back to kafka. If i however remove the foreach part from the writestream it would continue writing but now i lost my processing.
if anyone can give me an example on how to do this with an example i would be extremely grateful.
here is my code
spark = SparkSession \
.builder \
.appName("StructuredStreamingTrial") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "KafkaStreamingSource") \
.load()
ds = df \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")\
.writeStream \
.outputMode("update") \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "StreamSink") \
.option("checkpointLocation", "./testdir")\
.foreach(foreach_function)
.start().awaitTermination()
and the foreach_function simply is
def foreach_function(df):
try:
print(df)
except:
print('fail')
pass
Processing the data before writing into Kafka sink in Pyspark based Structured Streaming API,we can easily handle with UDF function for any kind of complex transformation .
example code is in below . This code is trying to read the JSON format message Kafka topic and parsing the message to convert the message from JSON into CSV format and rewrite into another topic. You can handle any processing transformation in place of 'json_formatted' function .
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.sql.column import Column, _to_java_column
from pyspark.sql.functions import col, struct
from pyspark.sql.functions import udf
import json
import csv
import time
import os
# Spark Streaming context :
spark = SparkSession.builder.appName('pda_inst_monitor_status_update').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Creating readstream DataFrame :
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "KafkaStreamingSource") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
df1.registerTempTable("test")
def json_formatted(s):
val_dict = json.loads(s)
return str([
val_dict["after"]["ID"]
, val_dict["after"]["INST_NAME"]
, val_dict["after"]["DB_UNIQUE_NAME"]
, val_dict["after"]["DBNAME"]
, val_dict["after"]["MON_START_TIME"]
, val_dict["after"]["MON_END_TIME"]
]).strip('[]').replace("'","").replace('"','')
spark.udf.register("JsonformatterWithPython", json_formatted)
squared_udf = udf(json_formatted)
df1 = spark.table("test")
df2 = df1.select(squared_udf("value"))
# Declaring the Readstream Schema DataFrame :
df2.coalesce(1).writeStream \
.writeStream \
.outputMode("update") \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "StreamSink") \
.option("checkpointLocation", "./testdir")\
.start()
ssc.awaitTermination()

Resources