I am facing "java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate" error while working with pyspark - apache-spark

I am using this tech stack:
Spark version: 3.3.1
Scala Version: 2.12.15
Hadoop Version: 3.3.4
Kafka Version: 3.3.1
I am trying to get data from kafka topic through spark structure streaming, But I am facing mentioned error, Code I am using is:
For reading data from kafka topic
result_1 = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "sampleTopic1") \
.option("startingOffsets", "latest") \
.load()
For writing data on console
trans_detail_write_stream = result_1 \
.writeStream\
.trigger(processingTime='1 seconds')\
.outputMode("update")\
.option("truncate", "false")\
.format("console")\
.start()\
.awaitTermination()
For execution I am using following command:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1 streamer.py
I am facing this error "java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)"
and on later logs it give me this exception too
"StreamingQueryException: Query [id = 600dfe3b-6782-4e67-b4d6-97343d02d2c0, runId = 197e4a8b-699f-4852-a2e6-1c90994d2c3f] terminated with exception: Writing job aborted"
Please suggest
Edit: Screenshot for Spark Version

Related

write pyspark dataframe to kafka - Topic not present in metadata after 60000 ms

I use the following line to submit the pyspark application, my spark version is 3.0.0:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 pyspark_streaming.py'
I run both pyspark and kafka inside docker containers, when trying to send pyspark dataframe to kafka
df_final.select(to_json(struct([col(c).alias(c) for c in df_final.columns])).alias("value"))\
.writeStream.format("kafka").outputMode("append").option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "test") \
.option("checkpointLocation", "checkpoints").start().awaitTermination()
I got the following error:
org.apache.kafka.common.errors.TimeoutException: Topic test not present in metadata after 60000 ms.

How to load Kafka topic data into a Spark Dstream in Python

I am using Spark 3.0.0 with Python.
I have a test_topic in Kafka that am producing to from a csv.
The code below is consuming from that topic into Spark but I read somewhere that it needs to be in a DStream before I can do any ML on it.
import json
from json import loads
from kafka import KafkaConsumer
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext("local[2]", "test")
ssc = StreamingContext(sc, 1)
consumer = KafkaConsumer('test_topic',
bootstrap_servers =['localhost:9092'],
api_version=(0, 10))
Consumer returns a <kafka.consumer.group.KafkaConsumer at 0x13bf55b0>
How do I edit the above code to give me a DStream?
I am fairly new so kindly point out any silly mistakes made.
EDIT:
Below is my producer code:
import json
import csv
from json import dumps
from kafka import KafkaProducer
from time import sleep
producer = KafkaProducer(bootstrap_servers=['localhost:9092'])
value_serializer=lambda x:dumps(x)
with open('test_data.csv') as file:
reader = csv.DictReader(file, delimiter=';')
for row in reader:
producer.send('test_topic', json.dumps(row).encode('utf=8'))
sleep(2)
print ('Message sent ', row)
It has been a long time I haven't done some Spark, but let me help you !
First as you are using Spark 3.0.0, you can use Spark Structured Streaming, the API will be much easier to use as it is based on dataframes. As you can see here in the link of the docs, there is an integration guide for kafka with PySpark in Structured Streaming mode.
It would be as simple as this query:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test_topic") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
Then you can play with this dataframe using ML pipelines to apply some ML techniques and models that you need. As you can see in this DataBricks notebook, they have some examples of Structured streaming with ML. This is written in Scala, but it will be a good source of inspiration. You can combine it with the ML PySpark docs, to translate it in Python
EDIT: The actual STEPS to follow in order to make it work between PySpark and Kafka
1 - Kafka Setup
So first I setup my local Kafka:
wget https://archive.apache.org/dist/kafka/0.10.2.2/kafka_2.12-0.10.2.2.tgz
tar -xzf kafka_2.11-0.10.2.0.tgz
I open 4 shells, to run the zookeeper / server / create_topic / write_topic scripts :
Zookeeper
cd kafka_2.11-0.10.2.0
bin/zookeeper-server-start.sh config/zookeeper.properties
Server
cd kafka_2.11-0.10.2.0
bin/kafka-server-start.sh config/server.properties
Create topic and check creation
cd kafka_2.11-0.10.2.0
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
bin/kafka-topics.sh --list --zookeeper localhost:2181
Test message in the topic (write them interactively in the shell for testing purpose):
cd kafka_2.11-0.10.2.0
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
2 - PySpark Setup
Getting the additional jars
Now that we have set up our Kafka, we will setup our PySpark with specific jars downloads:
spark-streaming-kafka-0-10-assembly_2.12-3.0.0.jar
wget https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-10-assembly_2.12/3.0.0/spark-streaming-kafka-0-10-assembly_2.12-3.0.0.jar
spark-sql-kafka-0-10_2.12-3.0.0.jar
wget https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.0.0/spark-sql-kafka-0-10_2.12-3.0.0.jar
commons-pool2-2.8.0.jar
wget https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.8.0/commons-pool2-2.8.0.jar
kafka-clients-0.10.2.2.jar
wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/0.10.2.2/kafka-clients-0.10.2.2.jar
Run the PySpark shell command
Don't forget to specify the folder path for each jars, if you're not in the jars folder when you execute the pyspark command.
PYSPARK_PYTHON=python3 $SPARK_HOME/bin/pyspark --jars spark-sql-kafka-0-10_2.12-3.0.0.jar,spark-streaming-kafka-0-10-assembly_2.12-3.0.0.jar,kafka-clients-0.10.2.2.jar,commons-pool2-2.8.0.jar
3 - Run the PySpark code
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test") \
.load()
query = df \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.format("console") \
.start()
Cheers
You need to use the org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 package for running it. It will download related jars using the spark-submit.
You need to use KafkaUtils createDirectStream method.
Here is a code sample from the official Spark documentation:
from pyspark.streaming.kafka import KafkaUtils
directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})

Error loading spark sql context for redshift jdbc url in glue

Hello I am trying to fetch month-wise data from a bunch of heavy redshift table(s) in glue job.
As far as I know glue documentation on this is very limited.
The query works fine in SQL Workbench which I have connected using the same jdbc connection being used in glue 'myjdbc_url'.
Below is what I have tried and seeing error -
from pyspark.context import SparkContext
sc = SparkContext()
sql_context = SQLContext(sc)
df1 = sql_context.read \
.format("jdbc") \
.option("url", myjdbc_url) \
.option("query", mnth_query) \
.option("forward_spark_s3_credentials","true") \
.option("tempdir", "s3://my-bucket/sprk") \
.load()
print("Total recs for month :"+str(mnthval)+" df1 -> "+str(df1.count()))
However it shows me driver error in the logs as below -
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at scala.Option.getOrElse(Option.scala:121)
I have used following too but to no avail. Ends up in Connection
refused error.
sql_context.read \
.format("com.databricks.spark.redshift")
.option("url", myjdbc_url) \
.option("query", mnth_query) \
.option("forward_spark_s3_credentials","true") \
.option("tempdir", "s3://my-bucket/sprk") \
.load()
What is the correct driver to use. As I am using glue which is a managed service with transient cluster in the background. Not sure what am I missing.
Please help what is the right driver?

Spark Cassandra Connector Error: java.lang.NoClassDefFoundError: com/datastax/spark/connector/TableRef

Spark version:3.00
scala:2.12
Cassandra::3.11.4
spark-cassandra-connector_2.12-3.0.0-alpha2.jar
I am not using DSE. Below is my test code to write the dataframe into my Cassandra database.
spark = SparkSession \
.builder \
.config("spark.jars","spark-streaming-kafka-0-10_2.12-3.0.0.jar,spark-sql-kafka-0-10_2.12-3.0.0.jar,kafka-clients-2.5.0.jar,commons-pool2-2.8.0.jar,spark-token-provider-kafka-0-10_2.12-3.0.0.jar,**spark-cassandra-connector_2.12-3.0.0-alpha2.jar**") \
.config("spark.cassandra.connection.host", "127.0.0.1")\
.config('spark.cassandra.output.consistency.level', 'ONE')\
.appName("StructuredNetworkWordCount") \
.getOrCreate()
streamingInputDF = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "192.168.56.1:9092") \
.option("subscribe", "def") \
.load()
##Dataset operations
def write_to_cassandra(streaming_df,E):
streaming_df\
.write \
.format("org.apache.spark.sql.cassandra") \
.options(table="a", keyspace="abc") \
.save()
q1 =sites_flat.writeStream \
.outputMode('update') \
.foreachBatch(write_to_cassandra) \
.start()
q1.awaitTermination()
I am able to do some operations to dataframe and print it to the console but I am not able to save or even read it from my Cassandra database. The error i am getting is:
File "C:\opt\spark-3.0.0-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o70.load.
: java.lang.NoClassDefFoundError: com/datastax/spark/connector/TableRef
at org.apache.spark.sql.cassandra.DefaultSource$.TableRefAndOptions(DefaultSource.scala:142)
at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:56)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203)
I have tried with other cassandra connector version(2.5) but getting the same error
Please help!!!
The problem is that you're using spark.jars options that includes only provided jars into the classpath. But the TableRef case class is in the spark-cassandra-connector-driver package that is dependency for spark-cassandra-connector. To fix this problem, it's better to start the pyspark or spark-submit with --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-alpha2 (same for kafka support) - in this case Spark will fetch all necessary dependencies & put them into classpath.
P.S. With alpha2 release you may get problems with fetching some dependencies, like, ffi, groovy, etc. - this is a known bug (mostly in Spark): SPARKC-599, that is already fixed, and we'll hopefully get beta drop very soon.
Update (14.03.2021): It's better to use assembly version of SCC that includes all necessary dependencies.
P.P.S. for writing to Cassandra from Spark Structured Streaming, don't use foreachbatch, just use as normal data sink:
val query = streamingCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "webhdfs://192.168.0.10:5598/checkpoint")
.option("keyspace", "test")
.option("table", "sttest_tweets")
.start()
I ran into the same problem,try it :
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.3</version>
</dependency>
version compatibility is presumed to be the cause

Spark 3.x Integration with Kafka in Python

Kafka with spark-streaming throws an error:
from pyspark.streaming.kafka import KafkaUtils ImportError: No module named kafka
I have already setup a kafka broker and a working spark environment with one master and one worker.
import os
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python2.7'
import findspark
findspark.init('/usr/spark/spark-3.0.0-preview2-bin-hadoop2.7')
import pyspark
import sys
from pyspark import SparkConf,SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__=="__main__":
sc = SparkContext(appName="SparkStreamAISfromKAFKA")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc,1)
kvs = KafkaUtils.createStream(ssc,"my-kafka-broker","raw-event-streaming-consumer",{'enriched_ais_messages':1})
lines = kvs.map(lambda x: x[1])
lines.count().map(lambda x: 'Messages AIS: %s' % x).pprint()
ssc.start()
ssc.awaitTermination()
I assume for the error that something is missing related to kafka ans specifically with the versions. Can anyone help with this?
spark-version: version 3.0.0-preview2
I execute with:
/usr/spark/spark-3.0.0-preview2-bin-hadoop2.7/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.1 --jars spark-streaming-kafka-0-10_2.11 spark_streamer.py spark://mysparkip:7077
According to the Spark Streaming + Kafka Integration Guide:
"Kafka 0.8 support is deprecated as of Spark 2.3.0."
In addition, the screenshot below shows that Python is no supported for Kafka 0.10 (and higher).
In your case you will have to use Spark 2.4 in order to get your code running.
PySpark supports Structured Streaming
If you plan to use the latest version of Spark (e.g. 3.x) and still want to integrate Spark with Kafka in Python you can use Structured Streaming. You will find detailed instructions on how to use the Python API in the Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher):
Reading Data from Kafka
# Subscribe to 1 topic
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
Writing Data to Kafka
# Write key-value data from a DataFrame to a specific Kafka topic specified in an option
ds = df \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("topic", "topic1") \
.start()

Resources