Connecting Pyspark with Kafka - apache-spark

I'm having problem understanding how to connect Kafka and PySpark.
I have kafka installation on Windows 10 with topic nicely streaming data.
I've installed pyspark which runs properly-I'm able to create test DataFrame without problem.
But when I try to connect to Kafka stream it gives me error:
AnalysisException: Failed to find data source: kafka. Please deploy
the application as per the deployment section of "Structured Streaming-
Kafka Integration Guide".
Spark documentation is not really helpful - it says:
...
groupId = org.apache.spark
artifactId = spark-sql-kafka-0-10_2.12
version = 3.2.0
...
For Python applications, you need to add this above library and its dependencies when deploying your application. See the Deploying subsection below.
And then when you go to Deploying section it says:
As with any Spark applications, spark-submit is used to launch your application. spark-sql-kafka-0-10_2.12 and its dependencies can be directly added to spark-submit using --packages, such as,
./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 ...
I'm developing app, I don't want to deploy it.
Where and how to add these dependencies if I'm developing pyspark app?
Tried several tutorials ended up being more confused.
Saw answer saying that
"You need to add kafka-clients JAR to your --packages".so-answer
Few more steps could be useful because for someone who is new this is unclear.
versions:
kafka 2.13-2.8.1
spark 3.1.2
java 11.0.12
All environmental variables and paths are correctly set.
EDIT
I've load :
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.apache.kafka:kafka-clients:2.8.1'
as suggested but still getting same error.
I've triple checked kafka, scala and spark versions and tried various combinations but not it didn't work, I'm still getting same error:
AnalysisException: Failed to find data source: kafka. Please deploy
the application as per the deployment section of "Structured Streaming-Kafka Integration Guide".
EDIT 2
I installed latest Spark 3.2.0 and Hadoop 3.3.1 and kafka version kafka_2.12-2.8.1. Changed all environmental variables, tested Spark and Kafka - working properly.
My environment variable looks like this now:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,org.apache.kafka:kafka-clients:2.8.1'
Still no luck, I get same error :(

Spark documentation is not really helpful - it says ... artifactId = spark-sql-kafka-0-10_2.12 version = 3.2.0 ...
Yes, that is correct... but for the latest version of Spark
versions:
spark 3.1.2
Have you tried looking at the version specific docs?
In other words, you want the matching spark-sql-kafka version of 3.1.2.
bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2
Or in Python,
scala_version = '2.12'
spark_version = '3.1.2'
# TODO: Ensure match above values match the correct versions
packages = [
f'org.apache.spark:spark-sql-kafka-0-10_{scala_version}:{spark_version}',
'org.apache.kafka:kafka-clients:3.2.1'
]
spark = SparkSession.builder\
.master("local")\
.appName("kafka-example")\
.config("spark.jars.packages", ",".join(packages))\
.getOrCreate()
Or with an env-var
import os
spark_version = '3.1.2'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:{}'.format(spark_version)
# init spark here
need to add this above library and its dependencies
As you found in my previous answer, also append the kafka-clients package using comma-separated list.
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.apache.kafka:kafka-clients:2.8.1
I'm developing app, I don't want to deploy it.
"Deploy" is Spark terminology. Running locally is still a "deployment"

Related

Read data from Cassandra in spark-shell

I want to read data from cassandra node in my client node on :
This is what i tried :
spark-shell --jars /my-dir/spark-cassandra-connector_2.11-2.3.2.jar.
val df = spark.read.format("org.apache.spark.sql.cassandra")\
.option("keyspace","my_keyspace")\
.option("table","my_table")\
.option("spark.cassandra.connection.host","Hostname of my Cassandra node")\
.option("spark.cassandra.connection.port","9042")\
.option("spark.cassandra.auth.password","mypassword)\
.option("spark.cassandra.auth.username","myusername")\
.load
I'm getting this error: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.cassandra.DefaultSource$
and
java.lang.NoClassDefFoundError: org/apache/commons/configuration/ConfigurationException.
Am I missing any properties? What this error is for ? How would I resolve this ?
Spark-version:2.3.2, DSE version 6.7.8
The Spark Cassandra Connector itself depends on the number of other dependencies, that could be missing here - this happens because you're providing only one jar, and not all required dependencies.
Basically, in your case you need to have following choice:
If you're running this on the DSE node, then you can use built-in Spark, if the cluster has Analytics enabled - in this case, all jars and properties are already provided, and you only need to provide username and password when starting spark shell via dse -u user -p password spark
if you're using external Spark, then it's better to use so-called BYOS (bring your own spark) - special version of the Spark Cassandra Connector with all dependencies bundled inside, and you can download jar from DataStax's Maven repo, and use with --jars
you can still use open source Spark Cassandra Connector, but in this case, it's better to use --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.2 so Spark will able to fetch all dependencies automatically.
P.S. In case of open source Spark Cassandra Connector I would recommend to use version 2.5.1 or higher, although it requires Spark 2.4.x (although 2.3.x may work) - this version has improved support for DSE, plus a lot of the new functionality not available in the earlier versions. Plus for that version there is a version that includes all required dependencies (so-called assembly) that you can use with --jars if your machine doesn't have access to the internet.

How to enable pyspark HIVE support on Google Dataproc master node

I created a dataproc cluster and manually install conda and Jupyter notebook. Then, I install pyspark by conda. I can successfully run spark by
from pyspark import SparkSession
sc = SparkContext(appName="EstimatePi")
However, I cannot enable HIVE support. The following code gets stucked and doesn't return anything.
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.config('spark.driver.memory', '2G')
.config("spark.kryoserializer.buffer.max", "2000m")
.enableHiveSupport()
.getOrCreate())
Python version 2.7.13, Spark version 2.3.4
Any way to enable HIVE support?
I do not recommend manually installing pyspark. When you do this, you get a new spark/pyspark installation that is different from Dataproc's own and do not get the configuration/tuning/classpath/etc. This is likely the reason Hive support does not work.
To get conda with properly configured pyspark I suggest selecting ANACONDA and JUPYTER optional components on image 1.3 (the default) or later.
Additionally, on 1.4 and later images Mini-Conda is the default user Python with pyspark preconfigured. You can pip/conda install Jupyter on your own if you wish.
See https://cloud.google.com/dataproc/docs/tutorials/python-configuration
Also as #Jayadeep Jayaraman points out, Jupyter optional component works with Component Gateway which means you can use it from a link in Developers Console as opposed to opening ports to the world or SSH tunneling.
tl/dr: I recomment these flags for your next cluster: --optional-components ANACONDA,JUPYTER --enable-component-gateway
Cloud Dataproc now has the option to install optional components in the dataproc cluster and also has an easy way of accessing them via the Gateway. You can find details of installing Jupyter and Conda here - https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
The details of the component gateway can be found here - https://cloud.google.com/dataproc/docs/concepts/accessing/dataproc-gateways. Note that this is Alpha.

Pyspark and Cassandra Connection Error

I have stucked with a problem. When i write sample cassandra connection code while import cassandra connector gives error.
i am starting the script like below code (both of them gave error)
./spark-submit --jars spark-cassandra-connector_2.11-1.6.0-M1.jar /home/beyhan/sparkCassandra.py
./spark-submit --jars spark-cassandra-connector_2.10-1.6.0.jar /home/beyhan/sparkCassandra.py
But giving below error while
import pyspark_cassandra
ImportError: No module named pyspark_cassandra
Which part i did wrong ?
Note:I have already installed cassandra database.
You are mixing up DataStax' Spark Cassandra Connector (in the jar you add to spark submit), and TargetHolding's PySpark Cassandra project (which has the pyspark_cassandra module). The latter is deprecated, so you should probably use the Spark Cassandra Connector. Documention for this package can be found here.
To use it, you can add the following flags to spark submit:
--conf spark.cassandra.connection.host=127.0.0.1 \
--packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3
Of course use the IP address on which Cassandra is listening, and check what connector version you need to use: 2.0.0-M3 is the latest version and works with Spark 2.0 and most Cassandra versions. See the compatibility table in case you are using a different version of Spark. 2.10 or 2.11 is the version of Scala your Spark version is built with. If you use Spark 2, by default it is 2.11, before 2.x it was version 2.10.
Then the nicest way to work with the connector is to use it to read dataframes, which looks like this:
sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()
See the PySpark with DataFrames documentation for more details

Connecting to cassandra using pyspark

I am a beginner learning to work with spark and cassandra.
I am trying to connect to cassandra using pyspark. I am running cassandra 2.1 and spark 1.3.
I have cloned this repo https://github.com/TargetHolding/pyspark-cassandra and followed instructions to get it working with spark shell as well as with spark-submit.
This is the command I am using ./bin/spark-submit --packages pyspark-cassandra:1.3 --conf spark.cassandra.connection.host=127.0.0.1:9042 cassandra_test.py
and similarly with pyspark replacing spark-submit (without the script in the end)
I am getting this error:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: pyspark-cassandra:1.3
I have tried to look for this error and go through related questions, but not able to get the connector working.
Any help will be greatly appreciated.
Thanks in advance.
Haven't tried it, but the spark packages page is here: http://spark-packages.org/package/TargetHolding/pyspark-cassandra
Seems to suggest:
$SPARK_HOME/bin/spark-shell --packages TargetHolding:pyspark-cassandra:0.1.5
Note the TargetHolding: bit. That might be it.

Failed on running an example of Kafka & Spark Streaming named KafkaWordCount

I worked on the example named KafkaWordCount as found on http://rishiverma.com/software/blog/2014/07/31/spark-streaming-and-kafka-quickstart/
BTW, I modified some details which doesn't matter. And when I went to the last step to build a Kafka consumer, it failed and said:
Exception in thread "main" org.apache.spark.SparkException: Could not parse Master URL: 'localhost:2181'
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1493)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:279)
at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:542)
at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:555)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:92)
at org.apache.spark.streaming.examples.KafkaWordCount$.main(KafkaWordCount.scala:54)
at org.apache.spark.streaming.examples.KafkaWordCount.main(KafkaWordCount.scala)
Did anyone meet this failure?
Which version of Spark are you using? In Spark 1.0+, KafkaWordCount is under the org.apache.spark.examples.streaming package. From your stacktrace, it looks like your version is under org.apache.spark.streaming.examples, which suggests that you're using a pre-1.0 version of Spark.
In Spark 0.9.x (which was released prior to the introduction of SparkConf and spark-submit), this example's first argument was a Spark master URL (source), causing the problem that you're seeing, since the "Could not parse Master URL" error suggests that local:2181 isn't a valid Spark master URL.
If you can, I recommend using a newer version of Spark (the tutorial that you linked recommends Spark 1.0.1 or higher). Otherwise, follow the instructions at the top of your particular version of KafkaWordCount (example).

Resources