Trying to consuming the kafka streams using spark structured streaming - apache-spark

I'm new to Kafka streaming. I setup a twitter listener using python and it is running in the localhost:9092 kafka server. I could consume the stream produced by the listener using a kafka client tool (conduktor) and also using the command "bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic twitter --from-beginning"
BUt when i try to consume the same stream using Spark Structured streaming, it is not capturing and throws the error - Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
Find the screenshot below
Command output - Consumes Data
Jupyter output for spark consumer - Doesn't consume data
My Producer or listener code:
auth = tweepy.OAuthHandler("**********", "*************")
auth.set_access_token("*************", "***********************")
# session.set('request_token', auth.request_token)
api = tweepy.API(auth)
class KafkaPushListener(StreamListener):
def __init__(self):
#localhost:9092 = Default Zookeeper Producer Host and Port Adresses
self.client = pykafka.KafkaClient("0.0.0.0:9092")
#Get Producer that has topic name is Twitter
self.producer = self.client.topics[bytes("twitter", "ascii")].get_producer()
def on_data(self, data):
#Producer produces data for consumer
#Data comes from Twitter
self.producer.produce(bytes(data, "ascii"))
return True
def on_error(self, status):
print(status)
return True
twitter_stream = Stream(auth, KafkaPushListener())
twitter_stream.filter(track=['#fashion'])
Consumer access from Spark Structured streaming
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "twitter") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

Found what was missing, when I submitted the spark-job, I had to include the right dependency package version.
I have spark 3.0.0
Therefore, I included - org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 package

Add sink It will start consum data from kafka.
Check below code.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "twitter") \
.load()
query = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.outputMode("append") \
.format("console") \ # here I am using console format .. you may change as per your requirement.
.start()
query.awaitTermination()

Related

Spark Structured Streaming inconsistent output to multiple sinks

I am using spark structured streaming to read data from Kafka and apply some udf to the dataset. The code as below :
calludf = F.udf(lambda x: function_name(x))
dfraw = spark.readStream.format('kafka') \
.option('kafka.bootstrap.servers', KAFKA_CONSUMER_IP) \
.option('subscribe', topic_name) \
.load()
df = dfraw.withColumn("value", F.col('value').cast('string')).withColumn('value', calludf(F.col('value')))
ds = df.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format('console') \
.option('truncate', False) \
.start()
dsf = df.selectExpr("CAST (value AS STRING)") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_CONSUMER_IP) \
.option("topic", topic_name_two) \
.option("checkpointLocation", checkpoint_location) \
.start()
ds.awaitTermination()
dsf.awaitTermination()
Now the problem is that I am getting 10 dataframes as input. 2 of them failed due to some issue with the data which is understandable. The console displays rest of the 8 processed dataframes BUT only 6 of those 8 processed dataframes are written to the Kafka topic using dsf steaming query. Even though I have added checkpoint location to it but it is still not working.
PS: Do let me know if you have any suggestion regarding the code as well. I am new to spark structured streaming so maybe there is something wrong with the way I am doing it.

How to guarantee sequence of execution of multiple sinks in spark structured streaming

In my scenario, I have a structured streaming application which reads from kafka and writes to hdfs and kafka using 3 different sinks. Primary sink is the hdfs one and others are secondary. I want the primary sink to run first and then secondary sinks. All have a triggertime of 60seconds. Is there a way to achieve that in spark structured streaming. Adding the code snippet:
val spark = SparkSession
.builder
.master(StreamerConfig.sparkMaster)
.appName(StreamerConfig.sparkAppName)
.getOrCreate()
spark.conf.set("spark.sql.orc.impl", "native")
spark.conf.set("spark.streaming.stopGracefullyOnShutdown","true")
spark.conf.set("spark.sql.files.ignoreCorruptFiles","true")
spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("spark.shuffle.service.enabled","true")
val readData = spark
.readStream
.format("kafka") .option("kafka.bootstrap.servers",StreamerConfig.kafkaBootstrapServer)
.option("subscribe",StreamerConfig.topicName)
.option("failOnDataLoss", false)
.option("startingOffsets",StreamerConfig.kafkaStartingOffset) .option("maxOffsetsPerTrigger",StreamerConfig.maxOffsetsPerTrigger)
.load()
val deserializedRecords = StreamerUtils.deserializeAndMapData(readData,spark)
val streamingQuery = deserializedRecords.writeStream
.queryName(s"Persist data to hive table for ${StreamerConfig.topicName}")
.outputMode("append")
.format("orc")
.option("path",StreamerConfig.hdfsLandingPath)
.option("checkpointLocation",StreamerConfig.checkpointLocation)
.partitionBy("date","hour")
.option("truncate","false")
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
deserializedRecords.select(to_json(struct("*")).alias("value"))
.writeStream
.format("kafka") // Local Testing - "console"
.option("topic", StreamerConfig.watermarkKafkaTopic)
.option("kafka.bootstrap.servers", StreamerConfig.kafkaBroker)
.option("checkpointLocation", StreamerConfig.phase1Checkpoints)
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
deserializedRecords.select(to_json(struct("*")).alias("value"))
.writeStream
.format("kafka") // Local Testing - "console"
.option("topic", StreamerConfig.watermarkKafkaTopic)
.option("kafka.bootstrap.servers", StreamerConfig.kafkaBroker)
.option("checkpointLocation", StreamerConfig.phase2Checkpoints)
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
PS: I am using spark 2.3.2

PySpark Kafka - NoClassDefFound: org/apache/commons/pool2

I am encountering problem with printing the data to console from kafka topic.
The error message I get is shown in below image.
As you can see in the above image that after batch 0 , it doesn't process further.
All this are snapshots of the error messages. I don't understand the root cause of the errors occurring. Please help me.
Following are kafka and spark version:
spark version: spark-3.1.1-bin-hadoop2.7
kafka version: kafka_2.13-2.7.0
I am using the following jars:
kafka-clients-2.7.0.jar
spark-sql-kafka-0-10_2.12-3.1.1.jar
spark-token-provider-kafka-0-10_2.12-3.1.1.jar
Here is my code:
spark = SparkSession \
.builder \
.appName("Pyspark structured streaming with kafka and cassandra") \
.master("local[*]") \
.config("spark.jars","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.executor.extraClassPath","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.executor.extraLibrary","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.driver.extraClassPath","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
#streaming dataframe that reads from kafka topic
df_kafka=spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers",kafka_bootstrap_servers)\
.option("subscribe",kafka_topic_name)\
.option("startingOffsets", "latest") \
.load()
print("Printing schema of df_kafka:")
df_kafka.printSchema()
#converting data from kafka broker to string type
df_kafka_string=df_kafka.selectExpr("CAST(value AS STRING) as value")
# schema to read json format data
ts_schema = StructType() \
.add("id_str", StringType()) \
.add("created_at", StringType()) \
.add("text", StringType())
#parse json data
df_kafka_string_parsed=df_kafka_string.select(from_json(col("value"),ts_schema).alias("twts"))
df_kafka_string_parsed_format=df_kafka_string_parsed.select("twts.*")
df_kafka_string_parsed_format.printSchema()
df=df_kafka_string_parsed_format.writeStream \
.trigger(processingTime="1 seconds") \
.outputMode("update")\
.option("truncate","false")\
.format("console")\
.start()
df.awaitTermination()
The error (NoClassDefFound, followed by the kafka010 package) is saying that spark-sql-kafka-0-10 is missing its transitive dependency on org.apache.commons:commons-pool2:2.6.2, as you can see here
You can either download that JAR as well, or you can change your code to use --packages instead of spark.jars option, and let Ivy handle downloading transitive dependencies
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache...'
spark = SparkSession.bulider...

Spark Streaming Kafka - How to stop streaming after processing all existing messages (gracefully)

This is what i am trying to do
Stream data from a kafka topic, which keeps getting data continuously.
Run the job twice a day, to process all data existing data at that point and stop the stream.
So i put and call stop on the query initially, but it was throwing "TimeoutException"
Then i tried increasing the timeout dynamically, but now i am getting java.io.IOException: Caused by: java.lang.InterruptedException
So, is there any way to gracefully stop the stream without getting any exceptions?
Below is my current code (part), which is throwing the interrupted exception
df = (
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", os.environ["KAFKA_SERVERS"])
.option("subscribe", config.kafka.topic)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 25000)
.load()
)
# <do some processing and save the data>
def save_batch(batch_df, batch_id):
pass
query = df.writeStream.foreachBatch(save_batch).start(
outputMode="append",
checkpointLocation=os.path.join(checkpoint_path, config.kafka.topic),
)
while query.isActive:
progress = query.lastProgress
if progress and progress["numInputRows"] < 25000 * 0.9:
timeout = sum(progress["durationMs"].values())
timeout = min(5 * 60 * 1000, max(15000, timeout))
spark.conf.set("spark.sql.streaming.stopTimeout", str(timeout))
stream_query.stop()
break
time.sleep(10)
Spark Version: 2.4.5
Scala Version: 2.1.1
Update: With Spark 3.3 .trigger(availableNow=True) is an option that will play nicely with .option("maxOffsetsPerTrigger", 25000).
I would recommend .trigger(once=True) and .awaitTermination() (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers).
Warning: This will not work with .option("maxOffsetsPerTrigger", 25000), but if maxOffsetsPerTrigger is not set it will default to pulling all offsets since it was last run to create one large micro-batch.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", os.environ["KAFKA_SERVERS"]) \
.option("subscribe", config.kafka.topic) \
.option("startingOffsets", "earliest") \
.load()
def foreach_batch_function(df, epoch_id):
# Transform and write batchDF
pass
df \
.writeStream \
.foreachBatch(foreach_batch_function) \
.trigger(once=True) \
.start(
outputMode="append",
checkpointLocation=os.path.join(checkpoint_path, config.kafka.topic),
) \
.awaitTermination()

Pyspark Kafka structured streaming: error while writing out

I am able to read a stream from a Kafka topic and write the (transformed) data back to another Kafka topic in two different steps in PySpark. The code to do that is as follows:
# Define Stream:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "instream") \
.load()
# Transform
matchdata = df.select(from_json(F.col("value").cast("string"),schema).alias("value"))\
.select(F.col('value').cast("string"))
# Stream the data, from a Kafka topic to a Spark in-memory table
query = matchdata \
.writeStream \
.format("memory") \
.queryName("PositionTable") \
.outputMode("append") \
.start()
query.awaitTermination(5)
# Create a new dataframe after stream completes:
tmp_df=spark.sql("select * from PositionTable")
# Write data to a different Kafka topic
tmp_df \
.write \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "outstream") \
.save()
The code above works as expected: the data in Kafka topic "instream" is read in PySpark, and then PySpark can write out data to Kafka topic "outstream".
However, I would like to read the stream in and write the transformed data back out immediately (the stream will be unbounded and we would like insights immediately as the data rolls in). Following the documentation, I replaced the query above with the following:
query = matchdata \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "outstream") \
.option("checkpointLocation", "/path/to/HDFS/dir") \
.start()
This does not appear to work.
There is no error message, so I do not know what is wrong. I've also tried windowing and aggregating within windows, but that also does not work. Any advice will be appreciated!
Ok, I found the problem. The main reason was that the subdirectory "path/to/HDFS/dir" has to exist. After creating that directory the code ran as expected. It would have been nice if an error message stated something along those lines.

Resources