How do I consume Kafka topic inside spark streaming app? - apache-spark

When I create a stream from Kafka topic and print its content
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 pyspark-shell'
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext(appName="PythonStreamingKafkaWords")
ssc = StreamingContext(sc, 10)
lines = KafkaUtils.createDirectStream(ssc, ['sample_topic'], {"bootstrap.servers": 'localhost:9092'})
lines.pprint()
ssc.start()
ssc.awaitTermination()
I get an empty result
-------------------------------------------
Time: 2019-12-07 13:11:50
-------------------------------------------
-------------------------------------------
Time: 2019-12-07 13:12:00
-------------------------------------------
-------------------------------------------
Time: 2019-12-07 13:12:10
-------------------------------------------
Meanwhile, it works in the console:
kafka-console-consumer --topic sample_topic --from-beginning --bootstrap-server localhost:9092
correctly gives me all lines of my text in Kafka topic:
ham Ok lor... Sony ericsson salesman... I ask shuhui then she say quite gd 2 use so i considering...
ham Ard 6 like dat lor.
ham Why don't you wait 'til at least wednesday to see if you get your .
ham Huh y lei...
spam REMINDER FROM O2: To get 2.50 pounds free call credit and details of great offers pls reply 2 this text with your valid name, house no and postcode
spam This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.
ham Will ü b going to esplanade fr home?
. . .
What is the proper way to stream data from Kafka topic into Spark streaming app?

Based on your code ,We can't print the streaming RDD directly and should be printing based on the foreachRDD .DStream.foreachRDD is an "output operator" in Spark Streaming. It allows you to access the underlying RDDs of the DStream to execute actions that do something practical with the data.
What's the meaning of DStream.foreachRDD function?
Note:: Still You can achieve through structured streaming as well. ref : Pyspark Structured streaming processing
Sample working code : This code trying to read the message from kafka topic and printing it. You can change this code based on your requirement.
from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
def handler(message):
records = message.collect()
for record in records:
print(record[1])
def main():
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 10)
kvs = KafkaUtils.createDirectStream(ssc, ['topic_name'], {"metadata.broker.list": 'localhost:9192'},valueDecoder=serializer.decode_message)
kvs.foreachRDD(handler)
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()

The reason that you are not seeing any data in streaming output is because spark streaming starts reading data from latest by default. So if you start your spark streaming application first and then write data to Kafka, you will see output in streaming job. Refer documentation here:
By default, it will start consuming from the latest offset of each Kafka partition
But you can also read data from any specific offset of your topic. Take a look at createDirectStream method here. It takes a dict parameter fromOffsets where you can specify the offset per partition in a dictionary.
I have tested below code with kafka 2.2.0 and spark 2.4.3 and Python 3.7.3:
Start pyspark shell with kafka dependencies:
pyspark --packages org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.4.0
Run below code:
from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 1)
topicPartion = TopicAndPartition('test',0)
fromOffset = {topicPartion: 0}
lines = KafkaUtils.createDirectStream(ssc, ['test'],{"bootstrap.servers": 'localhost:9092'}, fromOffsets=fromOffset)
lines.pprint()
ssc.start()
ssc.awaitTermination()
Also you should consider using Structured Streaming instead Spark Streaming if you have kafka broker version 10 or higher. Refer Structured Streaming documentation here and Structured Streaming with Kafka integration here.
Below is a sample code to run in Structured Streaming.
Please use jar version according to your Kafka version and spark version.
I am using spark 2.4.3 with Scala 11 and kafka 0.10 so using jar spark-sql-kafka-0-10_2.11:2.4.3.
Start pyspark shell:
pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test") \
.option("startingOffsets", "earliest") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.format("console") \
.start()

I recommend to use Spark structured streaming. It's the new generation streaming engine comes with the release of Spark 2. You can check it in this link.
For Kafka integration, you can look at the docs at this link.

Related

PySpark Kafka - NoClassDefFound: org/apache/commons/pool2

I am encountering problem with printing the data to console from kafka topic.
The error message I get is shown in below image.
As you can see in the above image that after batch 0 , it doesn't process further.
All this are snapshots of the error messages. I don't understand the root cause of the errors occurring. Please help me.
Following are kafka and spark version:
spark version: spark-3.1.1-bin-hadoop2.7
kafka version: kafka_2.13-2.7.0
I am using the following jars:
kafka-clients-2.7.0.jar
spark-sql-kafka-0-10_2.12-3.1.1.jar
spark-token-provider-kafka-0-10_2.12-3.1.1.jar
Here is my code:
spark = SparkSession \
.builder \
.appName("Pyspark structured streaming with kafka and cassandra") \
.master("local[*]") \
.config("spark.jars","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.executor.extraClassPath","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.executor.extraLibrary","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.driver.extraClassPath","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
#streaming dataframe that reads from kafka topic
df_kafka=spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers",kafka_bootstrap_servers)\
.option("subscribe",kafka_topic_name)\
.option("startingOffsets", "latest") \
.load()
print("Printing schema of df_kafka:")
df_kafka.printSchema()
#converting data from kafka broker to string type
df_kafka_string=df_kafka.selectExpr("CAST(value AS STRING) as value")
# schema to read json format data
ts_schema = StructType() \
.add("id_str", StringType()) \
.add("created_at", StringType()) \
.add("text", StringType())
#parse json data
df_kafka_string_parsed=df_kafka_string.select(from_json(col("value"),ts_schema).alias("twts"))
df_kafka_string_parsed_format=df_kafka_string_parsed.select("twts.*")
df_kafka_string_parsed_format.printSchema()
df=df_kafka_string_parsed_format.writeStream \
.trigger(processingTime="1 seconds") \
.outputMode("update")\
.option("truncate","false")\
.format("console")\
.start()
df.awaitTermination()
The error (NoClassDefFound, followed by the kafka010 package) is saying that spark-sql-kafka-0-10 is missing its transitive dependency on org.apache.commons:commons-pool2:2.6.2, as you can see here
You can either download that JAR as well, or you can change your code to use --packages instead of spark.jars option, and let Ivy handle downloading transitive dependencies
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache...'
spark = SparkSession.bulider...

spark structured streaming operation duration

I am running a structured streaming job with kafka source.
spark: 2.4.7
python: 3.6.8
spark = SparkSession.builder.getOrCreate()
ds = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_servers)
.option("subscribe", topic_name)
.load())
# data preprocessing
ds = ...
model = GBTClassificationModel.load(model_path)
ds = model.transform(ds)
query = (ds.writeStream
.outputMode("update")
.format("kafka")
.option("kafka.bootstrap.servers", kafka_servers)
.option("checkpointLocation", checkpoint_dir)
.option("topic", output_topic)
.trigger(processingTime="0 seconds")
.start())
query.awaitTermination()
The spark web UI displays the following indicators:
AddBatch: 1955.0
GetBatch: 1.0
GetOffset: 0.0
QueryPlanning: 3555.0
TriggerExecution: 2.0
WalCommit: 5569.0
undefined: 20.0
The following is a description of the indicators on the spark official website:
Operation Duration. The amount of time taken to perform various operations in milliseconds. The tracked operations are listed as follows.
addBatch: Time taken to read the micro-batch’s input data from the sources, process it, and write the batch’s output to the sink. This should take the bulk of the micro-batch’s time.
getBatch: Time taken to prepare the logical query to read the input of the current micro-batch from the sources.
latestOffset & getOffset: Time taken to query the maximum available offset for this source.
queryPlanning: Time taken to generates the execution plan.
walCommit: Time taken to write the offsets to the metadata log.
Why are WalCommit and QueryPlanning much larger than AddBatch?
Thanks!

Writing Spark DataFrame to Kafka is ignoring the partition column and kafka.partitioner.class

I am trying to write a Spark DF (batch DF) to Kafka and i need to write the data to specific partitions.
I tried the following code
myDF.write
.format("kafka")
.option("kafka.bootstrap.servers", kafkaProps.getBootstrapServers)
.option("kafka.security.protocol", "SSL")
.option("kafka.truststore.location", kafkaProps.getTrustStoreLocation)
.option("kafka.truststore.password", kafkaProps.getTrustStorePassword)
.option("kafka.keystore.location", kafkaProps.getKeyStoreLocation)
.option("kafka.keystore.password", kafkaProps.getKeyStorePassword)
.option("kafka.partitioner.class", "util.MyCustomPartitioner")
.option("topic",kafkaProps.getTopicName)
.save()
And the Schema of the DF i am writing is
+---+---------+-----+
|key|partition|value|
+---+---------+-----+
+---+---------+-----+
I had to repartition (to 1 partition) the "myDF" since i need to order the data based on date column.
It is writing the data to a Single partition but not the one that is in the DF's "partition" column or the one returned by the Custom Partitioner (which is same as the value in the partition column).
Thanks
Sateesh
The feature to use the column "partition" in your Dataframe is only available with version 3.x and not earlier according to the 2.4.7 docs
However, using the option kafka.partitioner.class will still work. Spark Structured Streaming allows you to use plain KafkaConsumer configuration when using the prefix kafka., so this will also work on version 2.4.4.
Below code runs fine with Spark 3.0.1 and Confluent community edition 5.5.0. On Spark 2.4.4, the "partition" column does not have any impact, but my custom partitioner class applies.
case class KafkaRecord(partition: Int, value: String)
val spark = SparkSession.builder()
.appName("test")
.master("local[*]")
.getOrCreate()
// create DataFrame
import spark.implicits._
val df = Seq((0, "Alice"), (1, "Bob")).toDF("partition", "value").as[KafkaRecord]
df.write
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "test")
.save()
What you then see in the console-consumer:
# partition 0
$ kafka-console-consumer --bootstrap-server localhost:9092 --from-beginning --topic test --partition 0
Alice
and
# partition 1
$ kafka-console-consumer --bootstrap-server localhost:9092 --from-beginning --topic test --partition 1
Bob
Also getting the same results when using a custom Partitioner
.option("kafka.partitioner.class", "org.test.CustomPartitioner")
where my custom Partitioner is defined as
package org.test
class CustomPartitioner extends Partitioner {
override def partition(topic: String, key: Any, keyBytes: Array[Byte], value: Any,valueBytes: Array[Byte],cluster: Cluster): Int = {
if (!valueBytes.isEmpty && valueBytes.map(_.toChar).mkString == "Bob") {
0
} else {
1
}
}
}

StructuredStreaming with ForEachWriter creating duplicates

Hi I'm trying to create a neo4j sink using pyspark and kafka, but for some reason this sink is creating duplicates in neo4j and I'm not sure why this is happening. I am expecting to get only one node, but it looks like it's creating 4. If someone has an idea, please let me know.
Kafka producer code:
from kafka import KafkaProducer
import json
producer = KafkaProducer(bootstrap_servers='10.0.0.38:9092')
message = {
'test_1': 'test_1',
'test_2': 'test_2'
}
producer.send('test_topic', json.dumps(message).encode('utf-8'))
producer.close()
Kafka consumer code:
from kafka import KafkaConsumer
import findspark
from py2neo import Graph
import json
findspark.init()
from pyspark.sql import SparkSession
class ForeachWriter:
def open(self, partition_id, epoch_id):
neo4j_uri = '' # neo4j uri
neo4j_auth = ('', '') # neo4j user, password
self.graph = Graph(neo4j_uri, auth=neo4j_auth)
return True
def process(self, msg):
msg = json.loads(msg.value.decode('utf-8'))
self.graph.run("CREATE (n: MESSAGE_RECEIVED) SET n.key = '" + str(msg).replace("'", '"') + "'")
raise KeyError('received message: {}. finished creating node'.format(msg))
spark = SparkSession.builder.appName('test-consumer') \
.config('spark.executor.instances', 1) \
.getOrCreate()
ds1 = spark.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', '10.0.0.38:9092') \
.option('subscribe', 'test_topic') \
.load()
query = ds1.writeStream.foreach(ForeachWriter()).start()
query.awaitTermination()
neo4j graph after running code
After doing some searching, I found this snippet of text from Stream Processing with Apache Spark: Mastering Structured Streaming and Spark Streaming on chapter 11 p151 after describing open, process, and close for ForeachWriter:
This contract is part of the data delivery semantics because it allows us to remove duplicated partitions that might already have been sent to the sink but are reprocessed by Structured Streaming as part of a recovery scenario. For that mechanism to properly work, the sink must implement some persistent way to remember the partition/version combinations that it has already seen.
On another note from the spark website: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html (see section on Foreach).
Note: Spark does not guarantee same output for (partitionId, epochId), so deduplication cannot be achieved with (partitionId, epochId). e.g. source provides different number of partitions for some reasons, Spark optimization changes number of partitions, etc. See SPARK-28650 for more details. If you need deduplication on output, try out foreachBatch instead.
It seems like I need to implement a check for uniqueness because Structured Streaming automatically reprocesses partitions in case of a fail if I am to use ForeachWriter, otherwise I have to switch to foreachBatch instead.

Spark dataframe lose streaming capability after accessing Kafka source

I use Spark 2.4.3 and Kafka 2.3.0. I want to do Spark structured streaming with data coming from Kafka to Spark. In general it does work in the test mode but since I have to do some processing of the data (and do not know another way to do) the Spark data frames do not have the streaming capability anymore.
#!/usr/bin/env python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructField, StructType, StringType, DoubleType
# create schema for data
schema = StructType([StructField("Signal", StringType()),StructField("Value", DoubleType())])
# create spark session
spark = SparkSession.builder.appName("streamer").getOrCreate()
# create DataFrame representing the stream
dsraw = spark.readStream \
.format("kafka").option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test")
print("dsraw.isStreaming: ", dsraw.isStreaming)
# Convert Kafka stream to something readable
ds = dsraw.selectExpr("CAST(value AS STRING)")
print("ds.isStreaming: ", ds.isStreaming)
# Do query on the converted data
dsQuery = ds.writeStream.queryName("ds_query").format("memory").start()
df1 = spark.sql("select * from ds_query")
print("df1.isStreaming: ", df1.isStreaming)
# convert json into spark dataframe cols
df2 = df1.withColumn("value", from_json("value", schema))
print("df2.isStreaming: ", df2.isStreaming)
The output is:
dsraw.isStreaming: True
ds.isStreaming: True
df1.isStreaming: False
df2.isStreaming: False
So I lose the streaming capability when I create the first dataframe. How can I avoid it? How do I get a streaming Spark dataframe out of a stream?
It is not recommend to use the memory sink for production applications as all the data will be stored in the driver.
There is also no reason to do this, except for debugging purposes, as you can process your streaming dataframes like the 'normal' dataframes. For example:
import pyspark.sql.functions as F
lines = spark.readStream.format("socket").option("host", "XXX.XXX.XXX.XXX").option("port", XXXXX).load()
words = lines.select(lines.value)
words = words.filter(words.value.startswith('h'))
wordCounts = words.groupBy("value").count()
wordCounts = wordCounts.withColumn('count', F.col('count') + 2)
query = wordCounts.writeStream.queryName("test").outputMode("complete").format("memory").start()
In case you still want to go with your approach: Even if df.isStreaming tells you it is not a streaming dataframe (which is correct), the underlying datasource is a stream and the dataframe will therefore grow with each processed batch.

Resources