Why does Structured Streaming fail with "java.lang.IncompatibleClassChangeError: Implementing class"? - apache-spark

I'd like to run a Spark application using Structured Streaming with PySpark.
I use Spark 2.2 with Kafka 0.10 version.
I fail with the following error:
java.lang.IncompatibleClassChangeError: Implementing class
spark-submit command used as below:
/bin/spark-submit \
--packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 \
--master local[*] \
/home/umar/structured_streaming.py localhost:2181 fortesting
structured_streaming.py code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("StructuredStreaming").config("spark.driver.memory", "2g").config("spark.executor.memory", "2g").getOrCreate()
raw_DF = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:2181").option("subscribe", "fortesting").load()
values = raw_DF.selectExpr("CAST(value AS STRING)").as[String]
values.writeStream.trigger(ProcessingTime("5 seconds")).outputMode("append").format("console").start().awaitTermination()

You need spark-sql-kafka for structured streaming:
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
Also make sure that you use the same versions of Scala (2.11 above) and Spark (2.2.0) as you use on your cluster.

Please reference This
You're using spark-streaming-kafka-0-10 which currently not support python.

Related

neo4j spark connector doesn't work correctly

I want to integrate Spark GraphX with Neo4j using 1
I tried to follow the steps in 2 but it doesn't work.
What should I do exactly with the neo4j-connector-apache-spark_2.12-4.0.0.jar file ? I put it in the jar files in the Spark folder.
in bash I write:
C:>Spark\spark-3.1.1-bin-hadoop2.7\bin\spark-shell --jars neo4j-connector-apache-spark_2.12-4.0.0.jar
Any suggestions please?
Update no. 1
I tried this C:\Spark\spark-3.1.1-bin-hadoop2.7\bin\spark-shell --packages neo4j-contrib:neo4j-connector-apache-spark_2.12:4.0.0
I think it work. but when I want to write the DataFrame to nodes of type Person in spark-shell:
import org.apache.spark.sql.{SaveMode, SparkSession}
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val df = Seq(
("John Doe"),
("Jane Doe")
).toDF("name")
df.write.format("org.neo4j.spark.DataSource")
.mode(SaveMode.ErrorIfExists)
.option("url", "bolt://localhost:7687")
.option("authentication.basic.username", "neo4j")
.option("authentication.basic.password", "neo4j")
.option("labels", ":Person")
.save()
It raises errors. what should I do?
Update no. 2
I follow the steps in the 3 and it gives error when entering this:
val neo = Neo4j(sc)
as follow:
error: not found: value Neo4j
Use:
$SPARK_HOME\bin\spark-shell --conf spark.neo4j.password=<password> --packages neo4j-contrib:neo4j-spark-connector:2.4.5-M2
instead of:
$SPARK_HOME\bin\spark-shell --conf spark.neo4j.bolt.password=<password> --packages neo4j-contrib:neo4j-spark-connector:2.4.5-M2
Just remove the bolt word.
Update'
Now I want to use the following package:
$SPARK_HOME/bin/spark-shell --packages neo4j-contrib:neo4j-connector-apache-spark_2.12:4.0.1_for_spark_3
As mentioned in 1
The only one that works is the following (the old version):
$SPARK_HOME/bin/spark-shell --packages neo4j-contrib:neo4j-spark-connector:2.4.5-M2
But using it, the Neo4jGraph.saveGraph is not working. The error is : Writing in read access mode not allowed.
Thanks for your help.

spark 3.x on HDP 3.1 in headless mode with hive - hive tables not found

How can I configure Spark 3.x on HDP 3.1 using headless (https://spark.apache.org/docs/latest/hadoop-provided.html) version of spark to interact with hive?
First, I have downloaded and unzipped the headless spark 3.x:
cd ~/development/software/spark-3.0.0-bin-without-hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf/
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export SPARK_DIST_CLASSPATH=$(hadoop --config /usr/hdp/current/spark2-client/conf classpath)
ls /usr/hdp # note version ad add it below and replace 3.1.x.x-xxx with it
./bin/spark-shell --master yarn --queue myqueue --conf spark.driver.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.hadoop.metastore.catalog.default=hive --files /usr/hdp/current/hive-client/conf/hive-site.xml
spark.sql("show databases").show
// only showing default namespace, existing hive tables are missing
+---------+
|namespace|
+---------+
| default|
+---------+
spark.conf.get("spark.sql.catalogImplementation")
res2: String = in-memory # I want to see hive here - how? How to add hive jars onto the classpath?
NOTE
This is an updated version of How can I run spark in headless mode in my custom version on HDP? for Spark 3.x ond HDP 3.1 and custom spark does not find hive databases when running on yarn.
Furthermore: I am aware of the problems of ACID hive tables in spark. For now, I simply want to be able to see the existing databases
edit
We must get the hive jars onto the class path. Trying as follows:
export SPARK_DIST_CLASSPATH="/usr/hdp/current/hive-client/lib*:${SPARK_DIST_CLASSPATH}"
And now using spark-sql:
./bin/spark-sql --master yarn --queue myqueue--conf spark.driver.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.hadoop.metastore.catalog.default=hive --files /usr/hdp/current/hive-client/conf/hive-site.xml
fails with:
Error: Failed to load class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
Failed to load main class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
I.e. the line: export SPARK_DIST_CLASSPATH="/usr/hdp/current/hive-client/lib*:${SPARK_DIST_CLASSPATH}", had no effect (same issue if not set).
As noted above and custom spark does not find hive databases when running on yarn the Hive JARs are needed. They are not supplied in the headless version.
I was unable to retrofit these.
Solution: instead of worrying: simply use the spark build with Hadoop 3.2 (on HDP 3.1)

Error: while writing pyspark dataframe into Habse

I am trying to write the pyspark dataframe into Hbase. Facing below error.
Spark and Hbase version on my cluster are:
Spark Version: 2.4.0
Hbase Version: 1.4.8
Spark Submit
spark-submit --jars /tmp/hbase-spark-1.0.0.jar --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf/hbase-site.xml to_hbase.py
error:
Any help would be much appreciated!
It is a known problem - using spark-habse-connector (shc) with spark2.4.
There is a fix #dhananjay_patka did.
Check: SHC With spark 2.4
and his fix

Pyspark Structured streaming locally with Kafka-Jupyter

After looking at the other answers i still cant figure it out.
I am able to use kafkaProducer and kafkaConsumer to send and receive a messages from within my notebook.
producer = KafkaProducer(bootstrap_servers=['127.0.0.1:9092'],value_serializer=lambda m: json.dumps(m).encode('ascii'))
consumer = KafkaConsumer('hr',bootstrap_servers=['127.0.0.1:9092'],group_id='abc' )
I've tried to connect to the stream with both spark context and spark session.
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext("local[*]", "stream")
ssc = StreamingContext(sc, 1)
Which gives me this error
Spark Streaming's Kafka libraries not found in class path. Try one
of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-
kafka-0-8:2.3.2 ...
It seems that i needed to add the JAR to my
!/usr/local/bin/spark-submit --master local[*] /usr/local/Cellar/apache-spark/2.3.0/libexec/jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar pyspark-shell
which returns
Error: No main class set in JAR; please specify one with --class
Run with --help for usage help or --verbose for debug output
What class do i put in?
How do i get Pyspark to connect to the consumer?
The command you have is trying to run spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar, and trying to find pyspark-shell as a Java class inside of that.
As the first error says, you missed a --packages after spark-submit, which means you would do
spark-submit --packages ... someApp.jar com.example.YourClass
If you are just locally in Jupyter, you may want to try Kafka-Python, for example, rather than PySpark... Less overhead, and no Java dependencies.

How to specify multiple dependencies using --packages for spark-submit?

I have the following as the command line to start a spark streaming job.
spark-submit --class com.biz.test \
--packages \
org.apache.spark:spark-streaming-kafka_2.10:1.3.0 \
org.apache.hbase:hbase-common:1.0.0 \
org.apache.hbase:hbase-client:1.0.0 \
org.apache.hbase:hbase-server:1.0.0 \
org.json4s:json4s-jackson:3.2.11 \
./test-spark_2.10-1.0.8.jar \
>spark_log 2>&1 &
The job fails to start with the following error:
Exception in thread "main" java.lang.IllegalArgumentException: Given path is malformed: org.apache.hbase:hbase-common:1.0.0
at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1665)
at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:432)
at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:288)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:87)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:105)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I've tried removing the formatting and returning to a single line, but that doesn't resolve the issue. I've also tried a bunch of variations: different versions, added _2.10 to the end of the artifactId, etc.
According to the docs (spark-submit --help):
The format for the coordinates should be groupId:artifactId:version.
So what I have should be valid and should reference this package.
If it helps, I'm running Cloudera 5.4.4.
What am I doing wrong? How can I reference the hbase packages correctly?
A list of packages should be separated using commas without whitespaces (breaking lines should work just fine) for example
--packages org.apache.spark:spark-streaming-kafka_2.10:1.3.0,\
org.apache.hbase:hbase-common:1.0.0
I found it worthy to use SparkSession in spark version 3.0.0 for mysql and postgres
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mysql-postgres').config('spark.jars.packages', 'mysql:mysql-connector-java:8.0.20,org.postgresql:postgresql:42.2.16').getOrCreate()
#Mohammad thanks for this input. This worked for me too. I had to load the Kafka and msql packages in a single sparksession. I did something like this:
spark = (SparkSession .builder ... .appName('myapp') # Add kafka and msql package .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,mysql:mysql-connector-java:8.0.26") .getOrCreate())

Resources