write pyspark dataframe to kafka - Topic not present in metadata after 60000 ms - apache-spark

I use the following line to submit the pyspark application, my spark version is 3.0.0:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 pyspark_streaming.py'
I run both pyspark and kafka inside docker containers, when trying to send pyspark dataframe to kafka
df_final.select(to_json(struct([col(c).alias(c) for c in df_final.columns])).alias("value"))\
.writeStream.format("kafka").outputMode("append").option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "test") \
.option("checkpointLocation", "checkpoints").start().awaitTermination()
I got the following error:
org.apache.kafka.common.errors.TimeoutException: Topic test not present in metadata after 60000 ms.

Related

I am facing "java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate" error while working with pyspark

I am using this tech stack:
Spark version: 3.3.1
Scala Version: 2.12.15
Hadoop Version: 3.3.4
Kafka Version: 3.3.1
I am trying to get data from kafka topic through spark structure streaming, But I am facing mentioned error, Code I am using is:
For reading data from kafka topic
result_1 = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "sampleTopic1") \
.option("startingOffsets", "latest") \
.load()
For writing data on console
trans_detail_write_stream = result_1 \
.writeStream\
.trigger(processingTime='1 seconds')\
.outputMode("update")\
.option("truncate", "false")\
.format("console")\
.start()\
.awaitTermination()
For execution I am using following command:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1 streamer.py
I am facing this error "java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)"
and on later logs it give me this exception too
"StreamingQueryException: Query [id = 600dfe3b-6782-4e67-b4d6-97343d02d2c0, runId = 197e4a8b-699f-4852-a2e6-1c90994d2c3f] terminated with exception: Writing job aborted"
Please suggest
Edit: Screenshot for Spark Version

Structured Streaming + Kafka: RuntimeError: Java gateway process exited before sending its port number + Failed to find data source: kafka

i am trying to use kafka as streamer and use spark to process data
config:
python3.9
Kubuntu 21.10
echo $JAVA_HOME : /usr/lib/jvm/java-8-openjdk-amd64
echo $SPARK_HOME: /opt/spark
spark version: 3.2.0
pyspark version: pyspark-3.2.1-py2.py3
downloaded kafka version: kafka_2.13-3.1.0.tgz
kafka status:
:~$ sudo systemctl status kafka
kafka.service - Apache Kafka Server
Loaded: loaded (/etc/systemd/system/kafka.service; disabled; vendor preset: enabled)
Active: active (running) since Sat 2022-01-29 19:02:18 +0330; 4s ago
Docs: http://kafka.apache.org/documentation.html
Main PID: 5271 (java)
Tasks: 74 (limit: 19017)
Memory: 348.7M
CPU: 5.188s
my python program:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import time
import os
import findspark as fs
fs.init()
spark_version = '3.2.0'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_3.1.0:{}'.format(spark_version)
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64'
# os.environ['PYSPARK_SUBMIT_ARGS'] = "--master local[2] pyspark-shell"
kafka_topic_name = "bdCovid19"
kafka_bootstrap_servers = 'localhost:9092'
if __name__ == "__main__":
print("Welcome to DataMaking !!!")
print("Stream Data Processing Application Started ...")
print(time.strftime("%Y-%m-%d %H:%M:%S"))
spark = SparkSession \
.builder \
.appName("PySpark Structured Streaming with Kafka and Message Format as JSON") \
.master("local[*]") \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
# Construct a streaming DataFrame that reads from test-topic
orders_df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
.option("subscribe", kafka_topic_name) \
.option("startingOffsets", "latest") \
.load()
running on pycharm
Error:
raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number
in this line: spark = SparkSession \
IF i remove os.environ lines from the code that error disaper but a got this :
raise converted from None
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
in this line: orders_df = spark \
I have read these:
Pyspark: Exception: Java gateway process exited before sending the driver its port number
Creating sparkContext on Google Colab gives: RuntimeError: Java gateway process exited before sending its port number
Spark + Python - Java gateway process exited before sending the driver its port number?
Exception: Java gateway process exited before sending the driver its port number
#743
Pyspark: Exception: Java gateway process exited before sending the driver its port number
Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka
none of them worked for me! any suggestions?

How to load Kafka topic data into a Spark Dstream in Python

I am using Spark 3.0.0 with Python.
I have a test_topic in Kafka that am producing to from a csv.
The code below is consuming from that topic into Spark but I read somewhere that it needs to be in a DStream before I can do any ML on it.
import json
from json import loads
from kafka import KafkaConsumer
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext("local[2]", "test")
ssc = StreamingContext(sc, 1)
consumer = KafkaConsumer('test_topic',
bootstrap_servers =['localhost:9092'],
api_version=(0, 10))
Consumer returns a <kafka.consumer.group.KafkaConsumer at 0x13bf55b0>
How do I edit the above code to give me a DStream?
I am fairly new so kindly point out any silly mistakes made.
EDIT:
Below is my producer code:
import json
import csv
from json import dumps
from kafka import KafkaProducer
from time import sleep
producer = KafkaProducer(bootstrap_servers=['localhost:9092'])
value_serializer=lambda x:dumps(x)
with open('test_data.csv') as file:
reader = csv.DictReader(file, delimiter=';')
for row in reader:
producer.send('test_topic', json.dumps(row).encode('utf=8'))
sleep(2)
print ('Message sent ', row)
It has been a long time I haven't done some Spark, but let me help you !
First as you are using Spark 3.0.0, you can use Spark Structured Streaming, the API will be much easier to use as it is based on dataframes. As you can see here in the link of the docs, there is an integration guide for kafka with PySpark in Structured Streaming mode.
It would be as simple as this query:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test_topic") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
Then you can play with this dataframe using ML pipelines to apply some ML techniques and models that you need. As you can see in this DataBricks notebook, they have some examples of Structured streaming with ML. This is written in Scala, but it will be a good source of inspiration. You can combine it with the ML PySpark docs, to translate it in Python
EDIT: The actual STEPS to follow in order to make it work between PySpark and Kafka
1 - Kafka Setup
So first I setup my local Kafka:
wget https://archive.apache.org/dist/kafka/0.10.2.2/kafka_2.12-0.10.2.2.tgz
tar -xzf kafka_2.11-0.10.2.0.tgz
I open 4 shells, to run the zookeeper / server / create_topic / write_topic scripts :
Zookeeper
cd kafka_2.11-0.10.2.0
bin/zookeeper-server-start.sh config/zookeeper.properties
Server
cd kafka_2.11-0.10.2.0
bin/kafka-server-start.sh config/server.properties
Create topic and check creation
cd kafka_2.11-0.10.2.0
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
bin/kafka-topics.sh --list --zookeeper localhost:2181
Test message in the topic (write them interactively in the shell for testing purpose):
cd kafka_2.11-0.10.2.0
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
2 - PySpark Setup
Getting the additional jars
Now that we have set up our Kafka, we will setup our PySpark with specific jars downloads:
spark-streaming-kafka-0-10-assembly_2.12-3.0.0.jar
wget https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-10-assembly_2.12/3.0.0/spark-streaming-kafka-0-10-assembly_2.12-3.0.0.jar
spark-sql-kafka-0-10_2.12-3.0.0.jar
wget https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.0.0/spark-sql-kafka-0-10_2.12-3.0.0.jar
commons-pool2-2.8.0.jar
wget https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.8.0/commons-pool2-2.8.0.jar
kafka-clients-0.10.2.2.jar
wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/0.10.2.2/kafka-clients-0.10.2.2.jar
Run the PySpark shell command
Don't forget to specify the folder path for each jars, if you're not in the jars folder when you execute the pyspark command.
PYSPARK_PYTHON=python3 $SPARK_HOME/bin/pyspark --jars spark-sql-kafka-0-10_2.12-3.0.0.jar,spark-streaming-kafka-0-10-assembly_2.12-3.0.0.jar,kafka-clients-0.10.2.2.jar,commons-pool2-2.8.0.jar
3 - Run the PySpark code
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test") \
.load()
query = df \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.format("console") \
.start()
Cheers
You need to use the org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 package for running it. It will download related jars using the spark-submit.
You need to use KafkaUtils createDirectStream method.
Here is a code sample from the official Spark documentation:
from pyspark.streaming.kafka import KafkaUtils
directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})

Spark Cassandra Connector Error: java.lang.NoClassDefFoundError: com/datastax/spark/connector/TableRef

Spark version:3.00
scala:2.12
Cassandra::3.11.4
spark-cassandra-connector_2.12-3.0.0-alpha2.jar
I am not using DSE. Below is my test code to write the dataframe into my Cassandra database.
spark = SparkSession \
.builder \
.config("spark.jars","spark-streaming-kafka-0-10_2.12-3.0.0.jar,spark-sql-kafka-0-10_2.12-3.0.0.jar,kafka-clients-2.5.0.jar,commons-pool2-2.8.0.jar,spark-token-provider-kafka-0-10_2.12-3.0.0.jar,**spark-cassandra-connector_2.12-3.0.0-alpha2.jar**") \
.config("spark.cassandra.connection.host", "127.0.0.1")\
.config('spark.cassandra.output.consistency.level', 'ONE')\
.appName("StructuredNetworkWordCount") \
.getOrCreate()
streamingInputDF = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "192.168.56.1:9092") \
.option("subscribe", "def") \
.load()
##Dataset operations
def write_to_cassandra(streaming_df,E):
streaming_df\
.write \
.format("org.apache.spark.sql.cassandra") \
.options(table="a", keyspace="abc") \
.save()
q1 =sites_flat.writeStream \
.outputMode('update') \
.foreachBatch(write_to_cassandra) \
.start()
q1.awaitTermination()
I am able to do some operations to dataframe and print it to the console but I am not able to save or even read it from my Cassandra database. The error i am getting is:
File "C:\opt\spark-3.0.0-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o70.load.
: java.lang.NoClassDefFoundError: com/datastax/spark/connector/TableRef
at org.apache.spark.sql.cassandra.DefaultSource$.TableRefAndOptions(DefaultSource.scala:142)
at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:56)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203)
I have tried with other cassandra connector version(2.5) but getting the same error
Please help!!!
The problem is that you're using spark.jars options that includes only provided jars into the classpath. But the TableRef case class is in the spark-cassandra-connector-driver package that is dependency for spark-cassandra-connector. To fix this problem, it's better to start the pyspark or spark-submit with --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-alpha2 (same for kafka support) - in this case Spark will fetch all necessary dependencies & put them into classpath.
P.S. With alpha2 release you may get problems with fetching some dependencies, like, ffi, groovy, etc. - this is a known bug (mostly in Spark): SPARKC-599, that is already fixed, and we'll hopefully get beta drop very soon.
Update (14.03.2021): It's better to use assembly version of SCC that includes all necessary dependencies.
P.P.S. for writing to Cassandra from Spark Structured Streaming, don't use foreachbatch, just use as normal data sink:
val query = streamingCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "webhdfs://192.168.0.10:5598/checkpoint")
.option("keyspace", "test")
.option("table", "sttest_tweets")
.start()
I ran into the same problem,try it :
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.3</version>
</dependency>
version compatibility is presumed to be the cause

Spark 3.x Integration with Kafka in Python

Kafka with spark-streaming throws an error:
from pyspark.streaming.kafka import KafkaUtils ImportError: No module named kafka
I have already setup a kafka broker and a working spark environment with one master and one worker.
import os
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python2.7'
import findspark
findspark.init('/usr/spark/spark-3.0.0-preview2-bin-hadoop2.7')
import pyspark
import sys
from pyspark import SparkConf,SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__=="__main__":
sc = SparkContext(appName="SparkStreamAISfromKAFKA")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc,1)
kvs = KafkaUtils.createStream(ssc,"my-kafka-broker","raw-event-streaming-consumer",{'enriched_ais_messages':1})
lines = kvs.map(lambda x: x[1])
lines.count().map(lambda x: 'Messages AIS: %s' % x).pprint()
ssc.start()
ssc.awaitTermination()
I assume for the error that something is missing related to kafka ans specifically with the versions. Can anyone help with this?
spark-version: version 3.0.0-preview2
I execute with:
/usr/spark/spark-3.0.0-preview2-bin-hadoop2.7/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.1 --jars spark-streaming-kafka-0-10_2.11 spark_streamer.py spark://mysparkip:7077
According to the Spark Streaming + Kafka Integration Guide:
"Kafka 0.8 support is deprecated as of Spark 2.3.0."
In addition, the screenshot below shows that Python is no supported for Kafka 0.10 (and higher).
In your case you will have to use Spark 2.4 in order to get your code running.
PySpark supports Structured Streaming
If you plan to use the latest version of Spark (e.g. 3.x) and still want to integrate Spark with Kafka in Python you can use Structured Streaming. You will find detailed instructions on how to use the Python API in the Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher):
Reading Data from Kafka
# Subscribe to 1 topic
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
Writing Data to Kafka
# Write key-value data from a DataFrame to a specific Kafka topic specified in an option
ds = df \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("topic", "topic1") \
.start()

Resources