Kafka integration with spark - apache-spark

I want to setup a streaming application using Apache Kafka and Spark streaming. Kafka is running on a seperate unix machine version 0.9.0.1 and spark v1.6.1 is a part of a hadoop cluster.
I have started the zookeeper and kafka server and want to stream in messages from a log file using console producer and consumed by spark streaming application using direct method (no receivers). I have written code in python and executing using the below command:
spark-submit --jars spark-streaming-kafka-assembly_2.10-1.6.1.jar streamingDirectKafka.py
getting below error:
/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 152, in createDirectStream
py4j.protocol.Py4JJavaError: An error occurred while calling o38.createDirectStreamWithoutMessageHandler.
: java.lang.ClassCastException: kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
Could you please help?
Thanks!!
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
conf = SparkConf().setAppName("StreamingDirectKafka")
sc = SparkContext(conf = conf)
ssc = StreamingContext(sc, 1)
topic = ['test']
kafkaParams = {"metadata.broker.list": "apsrd7102:9092"}
lines = (KafkaUtils.createDirectStream(ssc, topic, kafkaParams)
.map(lambda x: x[1]))
counts = (lines.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b))
counts.pprint()
ssc.start()
ssc.awaitTermination()

Looks like you are using incompatible version of Kafka. From the documentation as of Spark 2.0 - Kafka 0.8.x is supported.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources

Related

Spark : writeStream' can be called only on streaming Dataset/DataFrame

I'm trying to retrieve tweets from my Kafka cluster to Spark Streaming in which I perform some analysis to store them in an ElasticSearch Index.
Versions :
Spark - 2.3.0
Pyspark - 2.3.0
Kafka - 2.3.0
Elastic Search - 7.9
Elastic Search Hadoop - 7.6.2
I run the following code in my Jupyter env to write the streaming dataframe into Elastic Search .
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0,org.elasticsearch:elasticsearch-hadoop:7.6.2 pyspark-shell'
from pyspark import SparkContext
# Spark Streaming
from pyspark.streaming import StreamingContext
# Kafka
from pyspark.streaming.kafka import KafkaUtils
# json parsing
import json
import nltk
import logging
from datetime import datetime
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from nltk.sentiment.vader import SentimentIntensityAnalyzer
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
return globals()['sqlContextSingletonInstance']
def analyze_sentiment(tweet):
scores = dict([('pos', 0), ('neu', 0), ('neg', 0), ('compound', 0)])
sentiment_analyzer = SentimentIntensityAnalyzer()
score = sentiment_analyzer.polarity_scores(tweet)
for k in sorted(score):
scores[k] += score[k]
return json.dumps(scores)
def process(time,rdd):
print("========= %s =========" % str(time))
try:
if rdd.count()==0:
raise Exception('Empty')
sqlContext = getSqlContextInstance(rdd.context)
df = sqlContext.read.json(rdd)
df = df.filter("text not like 'RT #%'")
if df.count() == 0:
raise Exception('Empty')
udf_func = udf(lambda x: analyze_sentiment(x),returnType=StringType())
df = df.withColumn("Sentiment",lit(udf_func(df.text)))
print(df.take(10))
df.writeStream.outputMode('append').format('org.elasticsearch.spark.sql').option('es.nodes','localhost').option('es.port',9200)\
.option('checkpointLocation','/checkpoint').option('es.spark.sql.streaming.sink.log.enabled',False).start('PythonSparkStreamingKafka_RM_01').awaitTermination()
except Exception as e:
print(e)
pass
sc = SparkContext(appName="PythonSparkStreamingKafka_RM_01")
sc.setLogLevel("INFO")
ssc = StreamingContext(sc, 20)
kafkaStream = KafkaUtils.createDirectStream(ssc, ['kafkaspark'], {
'bootstrap.servers':'localhost:9092',
'group.id':'spark-streaming',
'fetch.message.max.bytes':'15728640',
'auto.offset.reset':'largest'})
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
parsed.foreachRDD(process)
ssc.start()
ssc.awaitTermination(timeout=180)
But I get the error :
'writeStream' can be called only on streaming Dataset/DataFrame;
And , it looks like I have to use .readStream , but how do I use it to read from KafkaStream without CreateDirectStream ?
Could someone please help me with writing this dataframe into Elastic Search . I am a beginner to Spark Streaming and Elastic Search and find it quite challenging . Would be happy if someone could guide me through getting this done.
.writeStream is a part of the Spark Structured Streaming API, so you need to use corresponding API to start reading the data - the spark.readStream, and pass options specific for the Kafka source that are described in the separate document, and also use the additional jar that contains the Kafka implementation. The corresponding code would look like that (full code is here):
val streamingInputDF = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.0.10:9092")
.option("subscribe", "tweets-txt")
.load()

sc._jvm.org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper() TypeError: 'JavaPackage' object is not callable when using

I am learning how to integrate spark with kafka. Currently i created a virtualenv and installed pyspark, py4j packages.
I also configured these environment:
PYSPARK_PYTHON : C:\learn_new\learn_utils\venv\Scripts\python.exe
SPARK_HOME : C:\spark-2.4.3-bin-hadoop2.7
Then i want to run the example python source code under C:\spark-2.4.3-bin-hadoop2.7\examples\src\main\python\streaming\direct_kafka_wordcount.py
The script code is this:
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: direct_kafka_wordcount.py <broker_list> <topic>", file=sys.stderr)
sys.exit(-1)
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
The command line to run the python code under virtualenv is this:
python --default --client --host localhost --port 60614 c:\spark-2.4.3-bin-hadoop2.7\examples\src\main\python\streaming\direct_kafka_wordcount.py kafka_host_name:9092 topic_name
Then i got this error:
File "c:\spark-2.4.3-bin-hadoop2.7\examples\src\main\python\venv\lib\site-packages\pyspark\streaming\kafka.py", line 138, in createDirectStream
helper = KafkaUtils._get_helper(ssc._sc)
File "c:\spark-2.4.3-bin-hadoop2.7\examples\src\main\python\venv\lib\site-packages\pyspark\streaming\kafka.py", line 217, in _get_helper
return sc._jvm.org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper()
TypeError: 'JavaPackage' object is not callable
What's the issue of it?
Thanks very much.
I mainly want to debug code locally, so i do not want to use spark-submit and add --jars
or --packages parameters to run the code.
But it really need the spark-streaming-kafka-0-8-assembly_2.11-2.4.3.jar package.(here change the package version according to your spark version)
So i tried to download the package and save it to the C:\spark-2.4.3-bin-hadoop2.7\jars(change it to your spark installation path, and find the jars folder).
Then the issue is solved. Hope it will help other people.
I had similar problem, just added the jar separately to two places one where spark had all its jars. Secondly, added the jar into jars of pyspark which was kept at a different location inside the present python version. And it worked

Not able to view kafka consumer output while executing in ECLIPSE: PySpark

I installed kafka and zookeeper in windows system. i have started kafka and zookeeper servers, created topic "javainuse-topic" , started producer and consumer with the below commands
.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties
.\bin\windows\kafka-server-start.bat .\config\server.properties
.\bin\windows\kafka-topics.bat --create --zookeeper localhost:2181
--replication-factor 1 --partitions 1 --topic javainuse-topic
.\bin\windows\kafka-console-producer.bat --broker-list localhost:9092
--topic javainuse-topic
.\bin\windows\kafka-console-consumer.bat --bootstrap-server
localhost:9092 --topic javainuse-topic --from-beginning
i am able to transfer data successfully from producer to consumer. So, i have wrote below code in eclipse and tried to execute it in local. but i am not able to view the consumer data in my eclipse console.
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0 pyspark-shell'
import sys
import time
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
n_secs = 1
topic = "javainuse-topic"
conf = SparkConf().setAppName("KafkaStreamProcessor").setMaster("local[*]")
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, n_secs)
kafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {
'bootstrap.servers':'localhost:9092',
'group.id':'javainuse-topic',
'fetch.message.max.bytes':'15728640',
'auto.offset.reset':'largest'})
# Group ID is completely arbitrary
lines = kafkaStream.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
time.sleep(6) # Run stream for 10 minutes just in case no detection of producer
# ssc.awaitTermination()
ssc.stop(stopSparkContext=True,stopGraceFully=True)
You might try again but this time setting auto.offset.reset to 'earliest' (or 'smallest' if you are using the old consumer).
kafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {
'bootstrap.servers':'localhost:9092',
'group.id':'javainuse-topic',
'fetch.message.max.bytes':'15728640',
'auto.offset.reset':'earliest'})
# Group ID is completely arbitrary

How to enable backpressure in Spark Streaming (using pyspark)

I would like know what would be the correct way of enable backpressure in spark streaming through pyspark. It looks like I have too many messages sent from Kafka in a short time and explode to it. Below is my code for spark streaming. Can anyone point me to the correct place to enable back pressure?
sc = SparkContext(appName="PythonStreamingDirectKafka")
ssc = StreamingContext(sc, 5)
ssc.checkpoint("/spark_check/")
kvs = KafkaUtils.createDirectStream(ssc, [kafka_topic],
{"metadata.broker.list": bootstrap_servers_ipaddress})
parsed_msg = kvs.map(lambda (key, value): json.loads(value))
## do something below
Here is how i set backpressure in my kafka streaming code.
Hope it helps.
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("PythonStreamingDirectKafka")\
.set("spark.streaming.backpressure.enabled", "true") \
.set("spark.streaming.backpressure.initialRate", "500")
sc = SparkContext(conf=conf)

Kafka and Pyspark Integration

I am naive in Big data, I am trying to connect kafka to spark.
Here is my producer code
import os
import sys
import pykafka
def get_text():
## This block generates my required text.
text_as_bytes=text.encode(text)
producer.produce(text_as_bytes)
if __name__ == "__main__":
client = pykafka.KafkaClient("localhost:9092")
print ("topics",client.topics)
producer = client.topics[b'imagetext'].get_producer()
get_text()
This is printing my generated text on console consumer when I do
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic imagetext --from-beginning
Now I want this text to be consumed using Spark and this is my Jupyter code
import findspark
findspark.init()
import os
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /spark-2.1.1-bin-hadoop2.6/spark-streaming-kafka-0-8-assembly_2.11-2.1.0.jar pyspark-shell'
conf = SparkConf().setMaster("local[2]").setAppName("Streamer")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc,5)
print('ssc =================== {} {}')
kstream = KafkaUtils.createDirectStream(ssc, topics = ['imagetext'],
kafkaParams = {"metadata.broker.list": 'localhost:9092'})
print('contexts =================== {} {}')
lines = kstream.map(lambda x: x[1])
lines.pprint()
ssc.start()
ssc.awaitTermination()
ssc.stop(stopGraceFully = True)
But this is producing output on my Jupyter as
Time: 2018-02-21 15:03:25
-------------------------------------------
-------------------------------------------
Time: 2018-02-21 15:03:30
-------------------------------------------
Not the text that is on my console consumer..
Please help, unable to figure out the mistake.
I found another solution to it. While the solution of putting get_text() in a loop works, it is not the right solution. You data was not in continuous fashion when it was sent in Kafka. As a result, Spark streaming should not get it in such a way.
Kafka-python library provides a get(timeout) functionality so that Kafka waits for a request.
producer.send(topic,data).get(timeout=10)
Since you are using pykafka, I am not sure whether it will work. Nevertheless, you can still try once and dont put get_text() in loop.
Just change your port in the consumer from 9092 to 2181 as it is the Zookeeper. From the producer side, it has to be connected to the Kafka with port number 9092. And from the streamer side, it has to be connected to the Zookeeper with port number 2181.

Resources