Spark Cassandra Connector Error: java.lang.NoClassDefFoundError: com/datastax/spark/connector/TableRef - apache-spark

Spark version:3.00
scala:2.12
Cassandra::3.11.4
spark-cassandra-connector_2.12-3.0.0-alpha2.jar
I am not using DSE. Below is my test code to write the dataframe into my Cassandra database.
spark = SparkSession \
.builder \
.config("spark.jars","spark-streaming-kafka-0-10_2.12-3.0.0.jar,spark-sql-kafka-0-10_2.12-3.0.0.jar,kafka-clients-2.5.0.jar,commons-pool2-2.8.0.jar,spark-token-provider-kafka-0-10_2.12-3.0.0.jar,**spark-cassandra-connector_2.12-3.0.0-alpha2.jar**") \
.config("spark.cassandra.connection.host", "127.0.0.1")\
.config('spark.cassandra.output.consistency.level', 'ONE')\
.appName("StructuredNetworkWordCount") \
.getOrCreate()
streamingInputDF = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "192.168.56.1:9092") \
.option("subscribe", "def") \
.load()
##Dataset operations
def write_to_cassandra(streaming_df,E):
streaming_df\
.write \
.format("org.apache.spark.sql.cassandra") \
.options(table="a", keyspace="abc") \
.save()
q1 =sites_flat.writeStream \
.outputMode('update') \
.foreachBatch(write_to_cassandra) \
.start()
q1.awaitTermination()
I am able to do some operations to dataframe and print it to the console but I am not able to save or even read it from my Cassandra database. The error i am getting is:
File "C:\opt\spark-3.0.0-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o70.load.
: java.lang.NoClassDefFoundError: com/datastax/spark/connector/TableRef
at org.apache.spark.sql.cassandra.DefaultSource$.TableRefAndOptions(DefaultSource.scala:142)
at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:56)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203)
I have tried with other cassandra connector version(2.5) but getting the same error
Please help!!!

The problem is that you're using spark.jars options that includes only provided jars into the classpath. But the TableRef case class is in the spark-cassandra-connector-driver package that is dependency for spark-cassandra-connector. To fix this problem, it's better to start the pyspark or spark-submit with --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-alpha2 (same for kafka support) - in this case Spark will fetch all necessary dependencies & put them into classpath.
P.S. With alpha2 release you may get problems with fetching some dependencies, like, ffi, groovy, etc. - this is a known bug (mostly in Spark): SPARKC-599, that is already fixed, and we'll hopefully get beta drop very soon.
Update (14.03.2021): It's better to use assembly version of SCC that includes all necessary dependencies.
P.P.S. for writing to Cassandra from Spark Structured Streaming, don't use foreachbatch, just use as normal data sink:
val query = streamingCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "webhdfs://192.168.0.10:5598/checkpoint")
.option("keyspace", "test")
.option("table", "sttest_tweets")
.start()

I ran into the same problem,try it :
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.3</version>
</dependency>
version compatibility is presumed to be the cause

Related

I am facing "java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate" error while working with pyspark

I am using this tech stack:
Spark version: 3.3.1
Scala Version: 2.12.15
Hadoop Version: 3.3.4
Kafka Version: 3.3.1
I am trying to get data from kafka topic through spark structure streaming, But I am facing mentioned error, Code I am using is:
For reading data from kafka topic
result_1 = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "sampleTopic1") \
.option("startingOffsets", "latest") \
.load()
For writing data on console
trans_detail_write_stream = result_1 \
.writeStream\
.trigger(processingTime='1 seconds')\
.outputMode("update")\
.option("truncate", "false")\
.format("console")\
.start()\
.awaitTermination()
For execution I am using following command:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1 streamer.py
I am facing this error "java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)"
and on later logs it give me this exception too
"StreamingQueryException: Query [id = 600dfe3b-6782-4e67-b4d6-97343d02d2c0, runId = 197e4a8b-699f-4852-a2e6-1c90994d2c3f] terminated with exception: Writing job aborted"
Please suggest
Edit: Screenshot for Spark Version

Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found when trying to write data on S3 bucket from Spark

I am trying to write data on an S3 bucket from my local computer:
spark = SparkSession.builder \
.appName('application') \
.config("spark.hadoop.fs.s3a.access.key", configuration.AWS_ACCESS_KEY_ID) \
.config("spark.hadoop.fs.s3a.secret.key", configuration.AWS_ACCESS_SECRET_KEY) \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.getOrCreate()
lines = spark.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', kafka_server) \
.option('subscribe', kafka_topic) \
.option("startingOffsets", "earliest") \
.load()
streaming_query = lines.writeStream \
.format('parquet') \
.outputMode('append') \
.option('path', configuration.S3_PATH) \
.start()
streaming_query.awaitTermination()
Hadoop version: 3.2.1,
Spark version 3.2.1
I have added the dependency jars to pyspark jars:
spark-sql-kafka-0-10_2.12:3.2.1,
aws-java-sdk-s3:1.11.375,
hadoop-aws:3.2.1,
I get the following error when executing:
py4j.protocol.Py4JJavaError: An error occurred while calling o68.start.
: java.io.IOException: From option fs.s3a.aws.credentials.provider
java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
In my case, it worked in the end by adding the following statement:
.config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
Also, all the hadoop jars in site-package/pyspark/jars must be in the same version, hadoop-aws:3.2.2, hadoop-client-api-3.2.2, hadoop-client-runtime-3.2.2, hadoop-yam-server-web-proxy-3.2.2
For version 3.2.2 of hadoop-aws, aws-java-sdk-s3:1.11.563 package is needed.
Also I replaced guava-14.0.jar with guava-23.0.jar.
I used same package with you.
in my case, when i added below the line.
config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
i got this error.
py4j.protocol.Py4JJavaError: An error occurred while calling o56.parquet.
: java.lang.NoSuchMethodError: 'void com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, java.lang.Object, java.lang.Object)'
....
For solving this I installed `guava-30.0
try to download jar lib put into spark/jars
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.2/hadoop-aws-3.2.2.jar
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.563/aws-java-sdk-bundle-1.11.563.jar

Pyspark is unable to find bigquery datasource

This is my pyspark configuration. Ive followed the steps mentioned here and didnt create a sparkcontext.
spark = SparkSession \
.builder \
.appName(appName) \
.config(conf=spark_conf) \
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.22.0') \
.config('spark.jars.packages','com.google.cloud.bigdataoss:gcsio:1.5.4') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar,spark-bigquery-with-dependencies_2.12-0.21.1.jar,spark-bigquery-latest_2.11.jar') \
.config('spark.jars', 'postgresql-42.2.23.jar,bigquery-connector-hadoop2-latest.jar') \
.getOrCreate()
Then when i try to write a demo spark dataframe to bigquery
df.write.format('bigquery') \
.mode(mode) \
.option("credentialsFile", "creds.json") \
.option('table', table) \
.option("temporaryGcsBucket",bucket) \
.save()
It throws and error
File "c:\sparktest\vnenv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o60.save.
: java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
My problem was with faulty jar versions. I am using spark 3.1.2 and hadoop 3.2 this was the maven jars with code which worked for me.
spark = SparkSession \
.builder \
.master('local') \
.appName('spark-read-from-bigquery') \
.config('spark.jars.packages','com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.22.0,com.google.cloud.bigdataoss:gcs-connector:hadoop3-1.9.5,com.google.guava:guava:r05') \
.config('spark.jars','guava-11.0.1.jar,gcsio-1.9.0-javadoc.jar') \ # you will have to download these jars manually
.getOrCreate()

Error loading spark sql context for redshift jdbc url in glue

Hello I am trying to fetch month-wise data from a bunch of heavy redshift table(s) in glue job.
As far as I know glue documentation on this is very limited.
The query works fine in SQL Workbench which I have connected using the same jdbc connection being used in glue 'myjdbc_url'.
Below is what I have tried and seeing error -
from pyspark.context import SparkContext
sc = SparkContext()
sql_context = SQLContext(sc)
df1 = sql_context.read \
.format("jdbc") \
.option("url", myjdbc_url) \
.option("query", mnth_query) \
.option("forward_spark_s3_credentials","true") \
.option("tempdir", "s3://my-bucket/sprk") \
.load()
print("Total recs for month :"+str(mnthval)+" df1 -> "+str(df1.count()))
However it shows me driver error in the logs as below -
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at scala.Option.getOrElse(Option.scala:121)
I have used following too but to no avail. Ends up in Connection
refused error.
sql_context.read \
.format("com.databricks.spark.redshift")
.option("url", myjdbc_url) \
.option("query", mnth_query) \
.option("forward_spark_s3_credentials","true") \
.option("tempdir", "s3://my-bucket/sprk") \
.load()
What is the correct driver to use. As I am using glue which is a managed service with transient cluster in the background. Not sure what am I missing.
Please help what is the right driver?

Why does spark-submit ignore the package that I include as part of the configuration of my spark session?

I am trying to include the org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 package as part of my spark code (via the SparkSession Builder). I understand that I can download the JAR myself and include it but I would like to figure out why the following is not working as expected:
from pyspark.sql import SparkSession
import pyspark
import json
if __name__ == "__main__":
spark = SparkSession.builder \
.master("local") \
.appName("App Name") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "first_topic") \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
query = df \
.writeStream \
.format("console") \
.outputMode("update") \
.start()
When I run the job:
spark-submit main.py
I receive the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o48.load.
: org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:652)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:161)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
If I instead include the packages via the --packages flag, the dependencies are downloaded and the code runs as expected:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 main.py
The code also works if I open the PySpark shell and paste the code above. Is there a reason that the spark-submit ignores the configuration?
I think that for configurations like "spark.jars.packages", these should be configured either in spark-defaults or passed by command-line arguments, setting it in the runtime shouldn't work.
Against better judgement
I remember some people claimed something like this worked for them, but I would say that the dependency is already somewhere there (installed in local repo), just loaded.
conf = pyspark.SparkConf()
conf.set("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3")
spark = SparkSession.builder \
.master("local") \
.appName("App Name") \
.config(conf = conf) \
.getOrCreate()
When you run spark-submit, it already creates a SparkSession that is reused by your code - thus you have to provide everything through spark-submit.
However, you do not need to actually use spark-submit to run your Spark code. Assuming your main method looks like this:
def main():
spark = SparkSession.builder.config(...).getOrCreate()
# your spark code below
...
You can run this code just via python:
> python ./my_spark_script.py
This will run your program correctly
I faced same problem, after google I found that link
https://issues.apache.org/jira/browse/SPARK-21752
According to #srowen Sean R. Owen "At that point, your app has already launched. You can't change the driver classpath."

Resources