Spark streaming with kafka(ssl enabled) - apache-spark

I have seen multiple questions regarding this topic but unable to find an answer for it? I have tried to add comments and have not got any response yet.
Requirement -> spark-streaming with Kafka SSL enabled
Below are the version details
I am running spark locally.
Spark -> version 2.4.6
kafka version -> 2.2.1
code snippet:
import sys
import logging
from datetime import datetime
try:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 20)
brokers = "b1:p,b2:p"
topic = "topic1"
kafkaParams = {"security.protocol":"SSL","metadata.broker.list": brokers}
kvs = KafkaUtils.createDirectStream(ssc, [topic],kafkaParams)
lines = kvs.map(lambda x: x[1])
print(lines)
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
except ImportError as e:
print("Error importing Spark Modules :", e)
sys.exit(1)
When I am submitting it as below.
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.10:2.0.0 --master local pysparkKafka.py
This is due to the reference I took from this
Enabling SSL between Apache spark and Kafka broker
I am getting the below error
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.6 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.6.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
Now when I submit it as below. This is due to seeing the above error and picking a jar from maven.
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 --master local pysparkKafka.py
I am getting the below error with the SSL port of the brokers.
20/07/24 11:58:46 INFO VerifiableProperties: Verifying properties
20/07/24 11:58:46 INFO VerifiableProperties: Property group.id is overridden to
20/07/24 11:58:46 WARN VerifiableProperties: Property security.protocol is not valid
20/07/24 11:58:46 INFO VerifiableProperties: Property zookeeper.connect is overridden to
20/07/24 11:58:46 INFO SimpleConsumer: Reconnect due to socket error: java.io.EOFException: Received -1 when reading from channel, socket has likely been closed.
20/07/24 11:58:47 INFO SimpleConsumer: Reconnect due to socket error: java.io.EOFException: Received -1 when reading from channel, socket has likely been closed.
Traceback (most recent call last):
Note -> The above submit works well without SSL. But my requirement is to enable SSL.
Any help is greatly appreciated. Thanks.

Related

Connecting to Casssandra on remote client using Spark

I have two PCs, one of them is Ubuntu system that has Cassandra, and the other one is Windows PC.
I have made same installations of Java, Spark, Python and Scala versions on both PCs. My goal is read data with Jupyter Notebook using Spark from Cassandra that on other PC.
On the PC that has Cassandra, I was able to read data with connecting to Cassandra using Spark. But when I try to connect that Cassandra from remote client using Spark, I could not connect to Cassandra and get an error.
Representation of the system
Commands that run on Ubuntu PC which has Cassandra.
~/spark/bin ./pyspark --master spark://10.0.0.10:7077 --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0 --conf spark.driver.extraJavaOptions=-Xss512m --conf spark.executer.extraJavaOptions=-Xss512m
from spark.sql.functions import col
host = {"spark.cassandra.connection.host":'10.0.0.10,10.0.0.11,10.0.0.12',"table":"table_one","keyspace":"log_keyspace"}
data_frame = sqlContext.read.format("org.apache.spark.sql.cassandra").options(**hosts).load()
a = data_frame.filter(col("col_1")<100000).select("col_1","col_2","col_3","col_4","col_5").toPandas()
As a result of the above codes running, the data received from Cassandra can be displayed.
Commands trying to get data by connecting to Cassandra from another PC.
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = ' --master spark://10.0.0.10:7077 --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0 --conf spark.driver.extraJavaOptions=-Xss512m --conf spark.executer.extraJavaOptions=-Xss512m spark.cassandra.connection.host=10.0.0.10 pyspark '
import findspark
findspark.init()
findspark.find()
from pyspark import SparkContext SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('example')
sc = pyspark.SparkContext(conf = conf)
spark = SparkSession(sc)
hosts ={"spark.cassandra.connection.host":'10.0.0.10',"table":"table_one","keyspace":"log_keyspace"}
sqlContext = SQLContext(sc)
data_frame = sqlContext.read.format("org.apache.spark.sql.cassandra").options(**hosts).load()
As a result of the above codes running, " :java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at http://spark.apache.org/third-party-projects.html " error occurs.
What can I do for fixing this error?

How to do streaming Kafka->Zeppelin->Spark with current versions

I have a Kafka 2.3 message broker and want to do some processing with data of the messages within Spark. For the beginning I want to use the Spark 2.4.0 that is integrated in Zeppelin 0.8.1 and want to use the Zeppelin notebooks for rapid prototyping.
For this streaming task I need "spark-streaming-kafka-0-10" for Spark>2.3 according to https://spark.apache.org/docs/latest/streaming-kafka-integration.html that only supports Java and Scale (and not Python). But there are no default Java or Scala interpreters in Zeppelin.
If I try this code (taken from https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/)
%spark.pyspark
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:9092', 'spark-streaming', {'test':1})
I get the following error
Spark Streaming's Kafka libraries not found in class path. Try one
of the following.
Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.0 ...
Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.0.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
Fail to execute line 1: kafkaStream = KafkaUtils.createStream(ssc,
'localhost:9092', 'spark-streaming', {'test':1}) Traceback (most
recent call last): File
"/tmp/zeppelin_pyspark-8982542851842620568.py", line 380, in
exec(code, _zcUserQueryNameSpace) File "", line 1, in File
"/usr/local/analyse/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py",
line 78, in createStream
helper = KafkaUtils._get_helper(ssc._sc) File "/usr/local/analyse/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py",
line 217, in _get_helper
return sc._jvm.org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper()
TypeError: 'JavaPackage' object is not callable
So I wonder how to tackle the task:
Should I really use spark-streaming-kafka-0-8 despited being deprecated since some months? But spark-streaming-kafka-0-10 seems to be in the default zeppelin-jar directory.
Configure/Create interpreter in Zeppelin for Java/Scala since spark-streaming-kafka-0-10 does only support these langauges?
Ignore Zeppelin and do it on the console using "spark-submit"?

Pyspark Structured streaming locally with Kafka-Jupyter

After looking at the other answers i still cant figure it out.
I am able to use kafkaProducer and kafkaConsumer to send and receive a messages from within my notebook.
producer = KafkaProducer(bootstrap_servers=['127.0.0.1:9092'],value_serializer=lambda m: json.dumps(m).encode('ascii'))
consumer = KafkaConsumer('hr',bootstrap_servers=['127.0.0.1:9092'],group_id='abc' )
I've tried to connect to the stream with both spark context and spark session.
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext("local[*]", "stream")
ssc = StreamingContext(sc, 1)
Which gives me this error
Spark Streaming's Kafka libraries not found in class path. Try one
of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-
kafka-0-8:2.3.2 ...
It seems that i needed to add the JAR to my
!/usr/local/bin/spark-submit --master local[*] /usr/local/Cellar/apache-spark/2.3.0/libexec/jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar pyspark-shell
which returns
Error: No main class set in JAR; please specify one with --class
Run with --help for usage help or --verbose for debug output
What class do i put in?
How do i get Pyspark to connect to the consumer?
The command you have is trying to run spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar, and trying to find pyspark-shell as a Java class inside of that.
As the first error says, you missed a --packages after spark-submit, which means you would do
spark-submit --packages ... someApp.jar com.example.YourClass
If you are just locally in Jupyter, you may want to try Kafka-Python, for example, rather than PySpark... Less overhead, and no Java dependencies.

How to add hbase-site.xml config file using spark-shell

I have the following simple code:
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.HBaseConfiguration
val hbaseconfLog = HBaseConfiguration.create()
val connectionLog = ConnectionFactory.createConnection(hbaseconfLog)
Which I'm running on spark-shell, and I'm getting the following error:
14:23:42 WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected
error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:30)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
Many these errors actually, and a few of these every now and then:
14:23:46 WARN client.ZooKeeperRegistry: Can't retrieve clusterId from
Zookeeper org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
Through Cloudera's VM I'm able to solve this by simply restarting the hbase-master, regionserver and thrift, but here in my company I'm not allowed to do it, I also solved it once by copying the file hbase-site.xml to spark conf directory but I can't to it either, is there a way to set the path for this specific file in the spark-shell parameters?
1) make sure that your zookeeper is running
2) need to copy hbase-site.xml to /etc/spark/conf folder just like we copy hive-site.xml to /etc/spark/conf to access the Hive tables.
3) export SPARK_CLASSPATH=/a/b/c/hbase-site.xml;/d/e/f/hive-site.xml
just as described in hortonworks forum.. like this
or
open spark-shell with out adding hbase-site.xml
3 commands to execute in spark-shell
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/home/spark/development/hbase/conf/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, table_name)

Spark output when submitting via Yarn cluster vs. client

I am new to Spark and just got it running on my cluster (Spark 2.0.1 on a 9 node cluster running Community version of MapR). I submit the wordcount example via
./bin/spark-submit --master yarn --jars ~/hadoopPERMA/jars/hadoop-lzo-0.4.21-SNAPSHOT.jar examples/src/main/python/wordcount.py ./README.md
and get the following output
17/04/07 13:21:34 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
: 68
help: 1
when: 1
Hadoop: 3
...
Looks like everything is working properly. When I add --deploy-mode cluster I get the following output:
17/04/07 13:23:52 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
So no errors but I am not seeing the wordcount results. What am I missing? I see the job in my History Server and it says it completed successfully. Also I checked my user directory in DFS but no new files were written except for this empty directory: /user/myuser/.sparkStaging
Code (wordcount.py example shipped with Spark):
from __future__ import print_function
import sys
from operator import add
from pyspark.sql import SparkSession
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: wordcount <file>", file=sys.stderr)
exit(-1)
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
spark.stop()
The reason for your output not printing is:
When you run in spark-client mode then the node on which you are initiating the job is the DRIVER and when you collect the result it is collected on that node and you print it.
In yarn-cluster mode your driver is some other node not the one through which you initiated the job. So when you call the .collect function the result is collected on that and printed on that node. You can find the result being printed in the sys-out of the driver.
A better approach would be to write the output somewhere in HDFS.
The reason for your spark.yarn.jars warning is:
In order to run a spark job yarn needs some binaries available on all the nodes of the cluster if these binaries are not available then as a part of job preparation, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.
To solve this :
By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable(chmod 777) location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set spark.yarn.jars to hdfs:///some/path.
After placing your jars run your code like :
./bin/spark-submit --master yarn --jars ~/hadoopPERMA/jars/hadoop-lzo-0.4.21-SNAPSHOT.jar examples/src/main/python/wordcount.py ./README.md --conf spark.yarn.jars="hdfs:///some/path"
Source : http://spark.apache.org/docs/latest/running-on-yarn.html

Resources