Lost messages in spark-streaming with checkpoint - apache-spark

Read from Kafka with Spark-Streaming and Checkpoint, but lost messages.
Code for generating a test stream:
from kafka import KafkaProducer
import time
p = KafkaProducer(bootstrap_servers='kafka.dev:9092')
for i in range(1000):
time.sleep(2)
p.send('y_test', value='{"test": ' + str(i) + '}')
Code for reading from kafka:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
def createContext():
sc = SparkContext(appName='test_app')
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, zkQuorum='kafka.dev:2181',
groupId='test_app', topics={'y_test': 1})
kafkaStream.saveAsTextFiles('test_dir/')
ssc.checkpoint('checkpoint_dir')
return ssc
context = StreamingContext.getOrCreate('checkpoint_dir', createContext)
context.start()
context.awaitTermination()
How I check:
1) start code for reading
2) start code for generating
3) restart code for reading
4) read hdfs and see data failure:
{"test": 1}
{"test": 2}
{"test": 3}
{"test": 4}
{"test": 8}
{"test": 9}
{"test": 10}
{"test": 11}
{"test": 12}
Kafka 0.9.0.0, Spark 1.6.0

Related

Spark Extension using AWS Glue

I have created a script locally that uses the spark extension 'uk.co.gresearch.spark:spark-extension_2.12:2.2.0-3.3' for comparing different DataFrames in a simple manner.
However, when I try this out on AWS Glue I ran into some issues and received this error:
ModuleNotFoundError: No module named 'gresearch'
I have tried copying the .jar file from my local disk that was referenced when I initialized the spark session locally and received this message:
... The jars for the packages stored in: /Users/["SOME_NAME"]/.ivy2/jars
uk.co.gresearch.spark#spark-extension_2.12 added as a dependency...
In that path I found a file named: uk.co.gresearch.spark_spark-extension_2.12-2.2.0-3.3.jar that I copied to S3 and referenced in the Jar lib path.
But this did not work... How would you go about setting this up in the correct manner?
The example code I've used to test this on AWS Glue looks like this:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
appName = 'test_gresearch'
spark_conf = SparkConf()
spark_conf.setAll([('spark.jars.packages', 'uk.co.gresearch.spark:spark-
extension_2.12:2.2.0-3.3')])
spark=SparkSession.builder.config(conf=spark_conf)\
.enableHiveSupport().appName(appName).getOrCreate()
from gresearch.spark.diff import *
df1 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "UK"],
[3, "GHI", 3000, "JPN"],
[4, "JKL", 4500, "CHN"]
], ["id", "name", "sal", "Address"])
df2 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "CAN"],
[3, "GHI", 3500, "JPN"],
[4, "JKL_M", 4800, "CHN"]
], ["id", "name", "sal", "Address"])
df1.show()
df2.show()
options = DiffOptions().with_change_column('changes')
df1.diff_with_options(df2, options, 'id').show()
Any tips are more than welcome. Thank you in advance!
Regards
After some investigation with the AWS support team, I was instructed to include the package .jar file through the Python library path since the .jar file comprises embedded Python packages. The correct version of the .jar file shall therefore be downloaded (https://mvnrepository.com/artifact/uk.co.gresearch.spark/spark-extension_2.12/2.1.0-3.1 was the version I ended up using) and uploaded to S3 and referenced in under the Glue job setting for Python library path (for eg - s3://bucket-name/spark-extension_2.12-2.1.0-3.1.jar).
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
job.commit()
left = spark.createDataFrame([(1, "one"), (2, "two"), (3, "three")], ["id", "value"])
right = spark.createDataFrame([(1, "one"), (2, "Two"), (4, "four")], ["id", "value"])
from gresearch.spark.diff import *
left.diff(right, "id").show()

Spark : writeStream' can be called only on streaming Dataset/DataFrame

I'm trying to retrieve tweets from my Kafka cluster to Spark Streaming in which I perform some analysis to store them in an ElasticSearch Index.
Versions :
Spark - 2.3.0
Pyspark - 2.3.0
Kafka - 2.3.0
Elastic Search - 7.9
Elastic Search Hadoop - 7.6.2
I run the following code in my Jupyter env to write the streaming dataframe into Elastic Search .
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0,org.elasticsearch:elasticsearch-hadoop:7.6.2 pyspark-shell'
from pyspark import SparkContext
# Spark Streaming
from pyspark.streaming import StreamingContext
# Kafka
from pyspark.streaming.kafka import KafkaUtils
# json parsing
import json
import nltk
import logging
from datetime import datetime
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from nltk.sentiment.vader import SentimentIntensityAnalyzer
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
return globals()['sqlContextSingletonInstance']
def analyze_sentiment(tweet):
scores = dict([('pos', 0), ('neu', 0), ('neg', 0), ('compound', 0)])
sentiment_analyzer = SentimentIntensityAnalyzer()
score = sentiment_analyzer.polarity_scores(tweet)
for k in sorted(score):
scores[k] += score[k]
return json.dumps(scores)
def process(time,rdd):
print("========= %s =========" % str(time))
try:
if rdd.count()==0:
raise Exception('Empty')
sqlContext = getSqlContextInstance(rdd.context)
df = sqlContext.read.json(rdd)
df = df.filter("text not like 'RT #%'")
if df.count() == 0:
raise Exception('Empty')
udf_func = udf(lambda x: analyze_sentiment(x),returnType=StringType())
df = df.withColumn("Sentiment",lit(udf_func(df.text)))
print(df.take(10))
df.writeStream.outputMode('append').format('org.elasticsearch.spark.sql').option('es.nodes','localhost').option('es.port',9200)\
.option('checkpointLocation','/checkpoint').option('es.spark.sql.streaming.sink.log.enabled',False).start('PythonSparkStreamingKafka_RM_01').awaitTermination()
except Exception as e:
print(e)
pass
sc = SparkContext(appName="PythonSparkStreamingKafka_RM_01")
sc.setLogLevel("INFO")
ssc = StreamingContext(sc, 20)
kafkaStream = KafkaUtils.createDirectStream(ssc, ['kafkaspark'], {
'bootstrap.servers':'localhost:9092',
'group.id':'spark-streaming',
'fetch.message.max.bytes':'15728640',
'auto.offset.reset':'largest'})
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
parsed.foreachRDD(process)
ssc.start()
ssc.awaitTermination(timeout=180)
But I get the error :
'writeStream' can be called only on streaming Dataset/DataFrame;
And , it looks like I have to use .readStream , but how do I use it to read from KafkaStream without CreateDirectStream ?
Could someone please help me with writing this dataframe into Elastic Search . I am a beginner to Spark Streaming and Elastic Search and find it quite challenging . Would be happy if someone could guide me through getting this done.
.writeStream is a part of the Spark Structured Streaming API, so you need to use corresponding API to start reading the data - the spark.readStream, and pass options specific for the Kafka source that are described in the separate document, and also use the additional jar that contains the Kafka implementation. The corresponding code would look like that (full code is here):
val streamingInputDF = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.0.10:9092")
.option("subscribe", "tweets-txt")
.load()

Failed to find leader for topics; java.lang.NullPointerException NullPointerException at org.apache.kafka.common.utils.Utils.formatAddress

When we are trying to stream the data from SSL enabled Kafka topic we are facing below error . Can you please help us on this issue .
19/11/07 13:26:54 INFO ConsumerFetcherManager: [ConsumerFetcherManager-1573151189884] Added fetcher for partitions ArrayBuffer()
19/11/07 13:26:54 WARN ConsumerFetcherManager$LeaderFinderThread: [spark-streaming-consumer_dvtcbddc101.corp.cox.com-1573151189725-d40a510f-leader-finder-thread], Failed to find leader for Set([inst_monitor_status_test,2], [inst_monitor_status_test,0], [inst_monitor_status_test,1])
java.lang.NullPointerException
at org.apache.kafka.common.utils.Utils.formatAddress(Utils.java:408)
at kafka.cluster.Broker.connectionString(Broker.scala:62)
at kafka.client.ClientUtils$$anonfun$fetchTopicMetadata$5.apply(ClientUtils.scala:89)
at kafka.client.ClientUtils$$anonfun$fetchTopicMetadata$5.apply(ClientUtils.scala:89)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:89)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
Pyspark code :
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer
def handler(message):
records = message.collect()
for record in records:
print(record)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: kafka_wordcount.py <zk> <topic>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingKafkaWordCount")
ssc = StreamingContext(sc, 10)
zkQuorum, topic = sys.argv[1:]
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
counts.pprint()
kvs.foreachRDD(handler)
ssc.start()
ssc.awaitTermination()
Spark submit command :
Spark submit:
/usr/hdp/2.6.1.0-129/spark2/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 dsstream2.py host:2181 inst_monitor_status_test
Thanks for your inputs . I have passed the SSL parameters in following method and working fine as expected.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
import time
# Spark Streaming context :
spark = SparkSession.builder.appName('PythonStreamingDirectKafkaWordCount').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topic_name"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'kafka_server:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"Clinet_id")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/home/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "password_rd") \
.option("kafka.sasl.kerberos.keytab","/home/path.keytab") \
.option("kafka.sasl.kerberos.principal","path") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
# Creating Writestream DataFrame :
df1.writeStream \
.option("path","target_directory") \
.format("csv") \
.option("checkpointLocation","chkpint_directory") \
.outputMode("append") \
.start()
ssc.awaitTermination()

Seek to Beginning of Kafka Topic Using PySpark

Using a Kafka Stream in PySpark, is it possible to seek to the beginning of a Kafka topic without creating a new consumer group?
For example, I have the following code snippet:
...
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 pyspark-shell'
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext('local[2]', appName="MyStreamingApp_01")
sc.setLogLevel("INFO")
ssc.StreamingContext(sc, 30)
spark = SparkSession(sc)
kafkaStream = KafkaUtils.createStream(ssc, zookeeper_ip, 'group-id', {'messages': 1})
counted = kafkaStream.count()
...
My goal is to do something along the lines of
kafkaStream.seekToBeginningOfTopic()
Currently, I'm creating a new consumer group to re-read from the beginning of the topic, e.g.:
kafkaStream = KafkaUtils.createStream(ssc, zookeeper, 'group-id-2', {'messages': 1}, {"auto.offset.reset": "smallest"})
Is this the proper way to consume a topic from the beginning using PySpark?

Kafka integration with spark

I want to setup a streaming application using Apache Kafka and Spark streaming. Kafka is running on a seperate unix machine version 0.9.0.1 and spark v1.6.1 is a part of a hadoop cluster.
I have started the zookeeper and kafka server and want to stream in messages from a log file using console producer and consumed by spark streaming application using direct method (no receivers). I have written code in python and executing using the below command:
spark-submit --jars spark-streaming-kafka-assembly_2.10-1.6.1.jar streamingDirectKafka.py
getting below error:
/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 152, in createDirectStream
py4j.protocol.Py4JJavaError: An error occurred while calling o38.createDirectStreamWithoutMessageHandler.
: java.lang.ClassCastException: kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
Could you please help?
Thanks!!
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
conf = SparkConf().setAppName("StreamingDirectKafka")
sc = SparkContext(conf = conf)
ssc = StreamingContext(sc, 1)
topic = ['test']
kafkaParams = {"metadata.broker.list": "apsrd7102:9092"}
lines = (KafkaUtils.createDirectStream(ssc, topic, kafkaParams)
.map(lambda x: x[1]))
counts = (lines.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b))
counts.pprint()
ssc.start()
ssc.awaitTermination()
Looks like you are using incompatible version of Kafka. From the documentation as of Spark 2.0 - Kafka 0.8.x is supported.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources

Resources