Kafka and Pyspark Integration - apache-spark

I am naive in Big data, I am trying to connect kafka to spark.
Here is my producer code
import os
import sys
import pykafka
def get_text():
## This block generates my required text.
text_as_bytes=text.encode(text)
producer.produce(text_as_bytes)
if __name__ == "__main__":
client = pykafka.KafkaClient("localhost:9092")
print ("topics",client.topics)
producer = client.topics[b'imagetext'].get_producer()
get_text()
This is printing my generated text on console consumer when I do
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic imagetext --from-beginning
Now I want this text to be consumed using Spark and this is my Jupyter code
import findspark
findspark.init()
import os
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /spark-2.1.1-bin-hadoop2.6/spark-streaming-kafka-0-8-assembly_2.11-2.1.0.jar pyspark-shell'
conf = SparkConf().setMaster("local[2]").setAppName("Streamer")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc,5)
print('ssc =================== {} {}')
kstream = KafkaUtils.createDirectStream(ssc, topics = ['imagetext'],
kafkaParams = {"metadata.broker.list": 'localhost:9092'})
print('contexts =================== {} {}')
lines = kstream.map(lambda x: x[1])
lines.pprint()
ssc.start()
ssc.awaitTermination()
ssc.stop(stopGraceFully = True)
But this is producing output on my Jupyter as
Time: 2018-02-21 15:03:25
-------------------------------------------
-------------------------------------------
Time: 2018-02-21 15:03:30
-------------------------------------------
Not the text that is on my console consumer..
Please help, unable to figure out the mistake.

I found another solution to it. While the solution of putting get_text() in a loop works, it is not the right solution. You data was not in continuous fashion when it was sent in Kafka. As a result, Spark streaming should not get it in such a way.
Kafka-python library provides a get(timeout) functionality so that Kafka waits for a request.
producer.send(topic,data).get(timeout=10)
Since you are using pykafka, I am not sure whether it will work. Nevertheless, you can still try once and dont put get_text() in loop.

Just change your port in the consumer from 9092 to 2181 as it is the Zookeeper. From the producer side, it has to be connected to the Kafka with port number 9092. And from the streamer side, it has to be connected to the Zookeeper with port number 2181.

Related

Not able to view kafka consumer output while executing in ECLIPSE: PySpark

I installed kafka and zookeeper in windows system. i have started kafka and zookeeper servers, created topic "javainuse-topic" , started producer and consumer with the below commands
.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties
.\bin\windows\kafka-server-start.bat .\config\server.properties
.\bin\windows\kafka-topics.bat --create --zookeeper localhost:2181
--replication-factor 1 --partitions 1 --topic javainuse-topic
.\bin\windows\kafka-console-producer.bat --broker-list localhost:9092
--topic javainuse-topic
.\bin\windows\kafka-console-consumer.bat --bootstrap-server
localhost:9092 --topic javainuse-topic --from-beginning
i am able to transfer data successfully from producer to consumer. So, i have wrote below code in eclipse and tried to execute it in local. but i am not able to view the consumer data in my eclipse console.
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0 pyspark-shell'
import sys
import time
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
n_secs = 1
topic = "javainuse-topic"
conf = SparkConf().setAppName("KafkaStreamProcessor").setMaster("local[*]")
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, n_secs)
kafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {
'bootstrap.servers':'localhost:9092',
'group.id':'javainuse-topic',
'fetch.message.max.bytes':'15728640',
'auto.offset.reset':'largest'})
# Group ID is completely arbitrary
lines = kafkaStream.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
time.sleep(6) # Run stream for 10 minutes just in case no detection of producer
# ssc.awaitTermination()
ssc.stop(stopSparkContext=True,stopGraceFully=True)
You might try again but this time setting auto.offset.reset to 'earliest' (or 'smallest' if you are using the old consumer).
kafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {
'bootstrap.servers':'localhost:9092',
'group.id':'javainuse-topic',
'fetch.message.max.bytes':'15728640',
'auto.offset.reset':'earliest'})
# Group ID is completely arbitrary

How can I convert this row form into JSON while pushing into kafka topic

I am using a Spark application for processing textfiles that dropped at /home/user1/files/ folder in my system and which map the comma separated data that present in those text files into a particular JSON format. I have written following python code using spark for doing the same. But the output that comes in Kafka will look like as follows
Row(Name=Priyesh,Age=26,MailId=priyeshkaratha#gmail.com,Address=AddressTest,Phone=112)
Python Code :
import findspark
findspark.init('/home/user1/spark')
from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.sql import Column, DataFrame, Row, SparkSession
from pyspark.streaming.kafka import KafkaUtils
import json
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers='server.kafka:9092')
def handler(message):
records = message.collect()
for record in records:
producer.send('spark.out', str(record))
print(record)
producer.flush()
def main():
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.textFileStream('/home/user1/files/')
fields = lines.map(lambda l: l.split(","))
udr = fields.map(lambda p: Row(Name=p[0],Age=int(p[3].split('#')[0]),MailId=p[31],Address=p[29],Phone=p[46]))
udr.foreachRDD(handler)
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
So how can I convert this row form into JSON while pushing into kafka topic?
You can convert Spark Row objects to dict's, and then serialize those to JSON. For example, you could change this line:
producer.send('spark.out', str(record))
to this:
producer.send('spark.out', json.dumps(record.asDict())))
Alternatively.. in your example code since you aren't using DataFrames you could just create it as a dict to begin with instead of a Row.

Seek to Beginning of Kafka Topic Using PySpark

Using a Kafka Stream in PySpark, is it possible to seek to the beginning of a Kafka topic without creating a new consumer group?
For example, I have the following code snippet:
...
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 pyspark-shell'
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext('local[2]', appName="MyStreamingApp_01")
sc.setLogLevel("INFO")
ssc.StreamingContext(sc, 30)
spark = SparkSession(sc)
kafkaStream = KafkaUtils.createStream(ssc, zookeeper_ip, 'group-id', {'messages': 1})
counted = kafkaStream.count()
...
My goal is to do something along the lines of
kafkaStream.seekToBeginningOfTopic()
Currently, I'm creating a new consumer group to re-read from the beginning of the topic, e.g.:
kafkaStream = KafkaUtils.createStream(ssc, zookeeper, 'group-id-2', {'messages': 1}, {"auto.offset.reset": "smallest"})
Is this the proper way to consume a topic from the beginning using PySpark?

Spark streaming with Kafka. Print statement prints out Bytes instead

I'm Using Spark Streaming(1.6) with Kafka. I'm able to produce and receive the messages, But the messages from the kafkaStream.pprint() statement are displayed as the following.
(u'\x00\x00\x03\x13', u'Message_787')
(u'\x00\x00\x03\x14', u'Message_788')
(u'\x00\x00\x03\x15', u'Message_789')
(u'\x00\x00\x03\x16', u'Message_790')
(u'\x00\x00\x03\x17', u'Message_791')
Code:
from pyspark.streaming.kafka import KafkaUtils
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 5)
kafkaStream = KafkaUtils.createStream(ssc,"zookepper","consumer-group",{"TOPIC": 1})
kafkaStream.pprint()
How do i convert the messages to ASCII or human Readable format.

Kafka integration with spark

I want to setup a streaming application using Apache Kafka and Spark streaming. Kafka is running on a seperate unix machine version 0.9.0.1 and spark v1.6.1 is a part of a hadoop cluster.
I have started the zookeeper and kafka server and want to stream in messages from a log file using console producer and consumed by spark streaming application using direct method (no receivers). I have written code in python and executing using the below command:
spark-submit --jars spark-streaming-kafka-assembly_2.10-1.6.1.jar streamingDirectKafka.py
getting below error:
/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 152, in createDirectStream
py4j.protocol.Py4JJavaError: An error occurred while calling o38.createDirectStreamWithoutMessageHandler.
: java.lang.ClassCastException: kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
Could you please help?
Thanks!!
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
conf = SparkConf().setAppName("StreamingDirectKafka")
sc = SparkContext(conf = conf)
ssc = StreamingContext(sc, 1)
topic = ['test']
kafkaParams = {"metadata.broker.list": "apsrd7102:9092"}
lines = (KafkaUtils.createDirectStream(ssc, topic, kafkaParams)
.map(lambda x: x[1]))
counts = (lines.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b))
counts.pprint()
ssc.start()
ssc.awaitTermination()
Looks like you are using incompatible version of Kafka. From the documentation as of Spark 2.0 - Kafka 0.8.x is supported.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources

Resources