I deleted the checkpoint directory for my spark stream.
Now, there are no errors, but the stream doesn't pick up any files.
How can I fix my stupid mistake? :)
I have tried to create a new checkpoint directory & changing the queryname but it's not helped
Below is the code that I have implemented.
I don't understand why it doesn't just make a new directory?
CODE
#!/usr/bin/env python
#nohup spark-submit --master local --driver-memory 1g --executor-memory 1g streaming_log_monitor.py >streammon.log 2>stderr.log &
from pyspark.sql import SparkSession
import pyspark.sql.functions as sqlfunc
import argparse, sys
from pyspark.sql import *
from pyspark.sql.functions import *
from datetime import datetime
from pyspark.sql.functions import lit
from pyspark.sql.types import *
from pyspark.sql.functions import udf
from pyspark.sql.types import *
from pyspark.sql.functions import udf, input_file_name, lower
from pyspark.streaming import StreamingContext
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
now = datetime.now()
#create a contexit that supports hive
def create_session(appname):
spark_session = SparkSession\
.builder\
.appName(appname)\
.master('local')\
.enableHiveSupport()\
.getOrCreate()
return spark_session
### START MAIN ###
if __name__ == '__main__':
spark_session = create_session('streaming_monitor')
ssc = StreamingContext(spark_session, 1)
print('start')
print(datetime.now())
myschema = StructType([
StructField('text', StringType())
])
#only files after stream starts
df = spark_session\
.readStream\
.option('newFilesOnly', 'true')\
.option('header', 'true')\
.schema(myschema)\
.text('file:///home/keenek1/analytics/logs/')\
.withColumn("FileName", input_file_name())
def errorcapture(text):
try:
text = str(text).lower()
if 'cannot obtain block length for locatedblock' in text:
return 'error: Cannot obtain block length for LocatedBlock'
elif 'outofmemoryerror' in text:
return 'error: OutOfMemoryError'
elif 'gc overhead limit exceeded' in text:
return 'error: OutOfMemoryError (GC Overhead Limit Exceeded)'
elif 'o3' in text:
return 'error o3: an UnsupportedEncodingException occurred when setting up stdout and stderr streams.'
elif 'o10' in text:
return 'error o10: an uncaught exception occurred'
elif 'o11' in text:
return 'error o11: more than spark.yarn.scheduler.reporterThread.maxFailures executor failures occurred'
elif 'o13' in text:
return 'error o13: the program terminated before the user had initialized the spark context or if the spark context did not initialize before a timeout.'
elif 'o14' in text:
return 'error o14: This is declared as EXIT_SECURITY but never used'
elif 'o15' in text:
return 'error o15: a user class threw an exception'
elif 'o16' in text:
return 'error o16: the shutdown hook called before final status was reported.'
elif 'o52' in text:
return 'error o52: The default uncaught exception handler was reached, and the uncaught exception was an OutOfMemoryError'
elif 'o53' in text:
return 'error o53: DiskStore failed to create local temporary directory after many attempts (bad spark.local.dir?)'
elif 'o54' in text:
return 'error o54: ExternalBlockStore failed to initialize after many attempts'
elif 'o55' in text:
return 'error o55: ExternalBlockStore failed to create a local temporary directory after many attempts'
elif 'o56' in text:
return 'error o56: Executor is unable to send heartbeats to the driver more than "spark.executor.heartbeat.maxFailures" times.'
elif 'array index out of bounds' in text:
return 'error Array Index Out of Bounds'
elif 'string index out of bounds' in text:
return 'error Array Index Out of Bounds'
elif 'error' in text:
return 'Unidentified Error'
else:
return 'Success'
except AttributeError:
return text
except UnicodeEncodeError:
return text
def errorsolution(text):
try:
text = str(text).lower()
if 'cannot obtain block length for locatedblock' in text:
return 'Find and resolve block issue'
elif 'outofmemoryerror' in text:
return 'Increase memory limit using --driver-memory 10g --executor-memory 10g in spark-submit'
elif 'gc overhead limit exceeded' in text:
return 'Increase memory limit using --driver-memory 10g --executor-memory 10g in spark-submit'
elif 'total size of serialized results of' in text:
return 'use this parameter in Spark-Submit --conf spark.driver.maxResultSize=0'
else:
return 'Unknown Solution'
except AttributeError:
return text
except UnicodeEncodeError:
return text
udfdict = udf(errorcapture, StringType())
errorsolutionudf = udf(errorsolution, StringType())
df = df.withColumn('did_it_error',udfdict(df.text))
df = df.withColumn('solution',errorsolutionudf(df.text))
from datetime import datetime
now = datetime.now()
output = df.createOrReplaceTempView('log')
hive_dump = spark_session.sql("select '" + str(now) + "' as timestamp, FileName, did_it_error, solution, text from log")
output = hive_dump\
.writeStream\
.format("csv")\
.queryName('logsmonitor')\
.option("checkpointLocation", "file:///home/keenek1/analytics/logs/chkpoint_dir")\
.start('/user/hive/warehouse/design.db/streaming_log_monitor')\
.awaitTermination()
Check your cluster has trash is enabled or not. If it enabled a file or directory is deleted that file or directory will be moved to the .Trash directory in the user’s home directory instead of being deleted.
Try using below path to check if your checkpoint directory is available or not.
hdfs://<host>/user/<username>/.Trash/Current/<your_checkpoint_path>
Related
I am reading the twitter stream from my Kafka topic while converting it to JSON in Pyspark code, data get missing.
Providing code below
The code is reading the twitter stream from Kafka topic and converting to JSON format.
While accessing tweet['user'] getting a key error (Indices must be an integer) on tweet[0] getting the first character of the message.
from __future__ import print_function
import sys
import json
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: direct_kafka_wordcount.py <broker_list> <topic>", file=sys.stderr)
sys.exit(-1)
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
brokers,topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: json.loads(x[1]))
status=lines.map(lambda tweets: tweets['user']['screen_name'])
#status.pprint()
status.pprint()
#status.map(lambda tweet: tweet['created_at']).pprint()
#counts = lines.flatMap(lambda line: line.split(" ")) \
# .filter(lambda word: word.lower().startswith('#')) \
# .map(lambda word: (word.lower(), 1)) \
# .reduceByKey(lambda a, b: a+b)
#counts.pprint()
ssc.start()
ssc.awaitTermination()
Getting this output after converting Kafka message to JSON
{u'quote_count': 0, u'contributors': None, u'truncated': False, u'text': u'RT #hotteaclout: #TeenChoiceFOX my #TeenChoice vote for #ChoiceActionMovieActor is Chris Evans', u'is_quote_status': False, u'in_reply_to_status_id': None, u'reply_count': 0, u'id': 1149313606304976896, .....}
...
Actual Message is
{"created_at":"Thu Jul 11 13:44:55 +0000 2019","id":1149313623363338241,"id_str":"1149313623363338241","text":"RT #alisonpool_: Legit thought this was Mike Wazowski for a second LMFAO https://t.co/DMzMtOfW2I","source":"\u003ca href=\"http://twitter.com/download/iphone\" ....}
Ok, I solved it, It was a problem with encoding. Just
json.loads(tweets.encode('utf-8'))
Would not work, We need to specify an encoding of the file so that all the file it calls will apply the same encoding.
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Add above code in it.
I am doing twitter streaming data by kafka. I managed to stream the data and consume the twitter json. But now how do i create a pyspark dataframe containing the twitter data and the search keyword?
Below is how i write the kafka producer
I managed to create the dataframe of what data i want from the twitter object. But i don't know how to get the search keyword.
class StdOutListener(StreamListener):
def __init__(self, producer):
self.producer_obj = producer
#on_status is activated whenever a tweet has been heard
def on_data(self, data):
try:
self.producer_obj.send("twitterstreamingdata", data.encode('utf-8'))
print(data)
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
return True
# When an error occurs
def on_error(self, status):
print (status)
return True
# When reach the rate limit
def on_limit(self, track):
# Print rate limiting error
print("Rate limited, continuing")
# Continue mining tweets
return True
# When timed out
def on_timeout(self):
# Print timeout message
print(sys.stderr, 'Timeout...')
# Wait 10 seconds
time.sleep(120)
return True # To continue listening
def on_disconnect(self, notice):
#Called when twitter sends a disconnect notice
return
if __name__ == '__main__':
spark = SparkSession \
.builder \
.appName("Kafka Producer Application") \
.getOrCreate()
#This is the initialization of Kafka producer
producer = KafkaProducer(bootstrap_servers='xx.xxx.xxx.xxx:9092')
#This handles twitter auth and the conn to twitter streaming API
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, StdOutListener(producer))
print("Kafka Producer Application: ")
WORDS = input("Enter any words: ")
print ("Is this what you just said?", WORDS)
word = [u for u in WORDS.split(',')]
#This line filter twitter stream to capture data by keywords
stream.filter(track=word)
One way to resolve your problem it's changing StdOutListener class constructor to receive "keyword" parameter and add "keyword" to JSON in "on_data" function to send to Kafka
import json
import sys
import time
from kafka import KafkaProducer
from pyspark.sql import SparkSession
from tweepy import StreamListener, Stream, OAuthHandler
class StdOutListener(StreamListener):
def __init__(self, producer: KafkaProducer = None, keyword=None):
super(StreamListener, self).__init__()
self.producer = producer
self.keyword = keyword
# on_status is activated whenever a tweet has been heard
def on_data(self, data):
try:
data = json.loads(data)
data['keyword'] = self.keyword
data = json.dumps(data)
self.producer.send("twitterstreamingdata", data.encode('utf-8'))
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
return True
# When an error occurs
def on_error(self, status):
print(status)
return True
# When reach the rate limit
def on_limit(self, track):
# Print rate limiting error
print("Rate limited, continuing")
# Continue mining tweets
return True
# When timed out
def on_timeout(self):
# Print timeout message
print(sys.stderr, 'Timeout...')
# Wait 10 seconds
time.sleep(120)
return True # To continue listening
def on_disconnect(self, notice):
# Called when twitter sends a disconnect notice
return
if __name__ == '__main__':
CONSUMER_KEY = 'YOUR CONSUMER KEY'
CONSUMER_SECRET = 'YOUR CONSUMER SECRET'
ACCESS_TOKEN = 'YOUR ACCESS TOKEN'
ACCESS_SECRET = 'YOUR ACCESS SECRET'
print("Kafka Producer Application: ")
words = input("Enter any words: ")
print("Is this what you just said?", words)
word = [u for u in words.split(',')]
spark = SparkSession \
.builder \
.appName("Kafka Producer Application") \
.getOrCreate()
# This is the initialization of Kafka producer
kafka_producer = KafkaProducer(bootstrap_servers='35.240.157.219:9092')
# This handles twitter auth and the conn to twitter streaming API
auth = OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
stream = Stream(auth, StdOutListener(producer=kafka_producer, keyword=word))
stream.filter(track=word)
Hope it helps you!
This is my first interaction with Kafka and Spark Streaming and i am trying to run WordCount script given below. The script is pretty standard as given in many online blogs. But for whatever reason, spark streaming is not printing the word counts. It is not throwing any error, just does not display the counts.
I have tested the topic via console consumer, and there messages are showing up correctly. I even tried to use foreachRDD to see the lines coming in and thats also not showing anything.
Thanks in advance!
Versions: kafka_2.11-0.8.2.2 , Spark2.2.1, spark-streaming-kafka-0-8-assembly_2.11-2.2.1
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql.context import SQLContext
sc = SparkContext(appName="PythonStreamingKafkaWordCount")
sc.setCheckpointDir('c:\Playground\spark\logs')
ssc = StreamingContext(sc, 10)
ssc.checkpoint('c:\Playground\spark\logs')
zkQuorum, topic = sys.argv[1:]
print(str(zkQuorum))
print(str(topic))
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
lines = kvs.map(lambda x: x[1])
print(kvs)
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.pprint(num=10)
ssc.start()
ssc.awaitTermination()
Producer Code:
import sys,os
from kafka import KafkaProducer
from kafka.errors import KafkaError
import time
producer = KafkaProducer(bootstrap_servers="localhost:9092")
topic = "KafkaSparkWordCount"
def read_file(fileName):
with open(fileName) as f:
print('started reading...')
contents = f.readlines()
for content in contents:
future = producer.send(topic,content.encode('utf-8'))
try:
future.get(timeout=10)
except KafkaError as e:
print(e)
break
print('.',end='',flush=True)
time.sleep(0.2)
print('done')
if __name__== '__main__':
read_file('C:\\\PlayGround\\spark\\BookText.txt')
how many cores do you use ?
Spark Streaming needs at least two cores, one for the receiver and one for the processor.
Running a sample application streaming data from kinesis. I did not get why this application uses so much heap and crashes.
Here is the code :
from __future__ import print_function
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
from pyspark.sql.session import SparkSession
from datetime import datetime
# function declaration
def isDfEmpty(df):
try:
if not df.take(1) :
return True
except Exception as e:
return True
return False
# function declaration
def mergeTable(df):
print("b:mergeTable")
print(str(datetime.now()))
try:
global refDf
if isDfEmpty(df) :
print("no record, waiting !")
else :
if(isDfEmpty(refDf)) :
refDf = df
else :
print(" before count %s" % refDf.count())
refDf = df.unionAll(refDf)
print(" after count %s" % refDf.count())
except Exception as e:
print(e)
print(str(datetime.now()))
print("e:mergeTable")
# function declaration
def doWork(df):
print("b:doWork")
print(str(datetime.now()))
try:
mergeTable(df)
except Exception as e:
print(e)
print(str(datetime.now()))
print("e:doWork")
# function declaration
def sensorFilter(sensorType, rdd):
df = spark.read.json(rdd.filter(lambda x : sensorType in x))
doWork(df)
def printRecord(rdd):
print("========================================================")
print("Starting new RDD")
print("========================================================")
sensorFilter("SensorData", rdd)
refDf = None
if __name__ == "__main__":
reload(sys)
# sys.setdefaultencoding('utf-8')
if len(sys.argv) != 5:
print( "Usage: dump.py <app-name> <stream-name> <endpoint-url> <region-name>", file=sys.stderr)
sys.exit(-1)
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext
# sc = SparkContext(appName="PythonStreamingKinesisWordCountAsl")
ssc = StreamingContext(sc, 10)
appName, streamName, endpointUrl, regionName = sys.argv[1:]
dstream = KinesisUtils.createStream(ssc, appName, streamName, endpointUrl, regionName, InitialPositionInStream.LATEST, 10)
dstream.foreachRDD(printRecord)
ssc.start()
ssc.awaitTermination()
After a time the spark application slowed down due to heap usage. But when i comment out the lines, heap usage decrease to normal levels.(According to SparkUI)
print(" before count %s" % refDf.count())
print(" after count %s" % refDf.count())
I am really new with pyspark and trying to get what is going on.
Merging on data frame continuously may explode the memory of course but the problem of heap occurs very beginning.
EDIT
Environment : Tried on single ubuntu and on cents VM hosted by macOS nothing changed.
I'm new to spark and mqtt. I'm trying to with the code using MQTTUtils that I got online named wordcount.py
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.mqtt import MQTTUtils
if __name__ == "__main__":
if len(sys.argv) != 3:
print >> sys.stderr, "Usage: mqtt_wordcount.py <broker url> <topic>"
exit(-1)
sc = SparkContext(appName="PythonStreamingMQTTWordCount")
ssc = StreamingContext(sc, 1)
brokerUrl = sys.argv[1]
topic = sys.argv[2]
lines = MQTTUtils.createStream(ssc, brokerUrl, topic)
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
and I followed the instruction to installed the mosquitto broker(it's working) ,download the spark-streaming-mqtt-assembly_2.11-1.6.2.jar and run the python script with this command:
~$ spark-submit --jars spark-streaming-mqtt-assembly_*.jar wordcount.py
but the error shown:
from pyspark.streaming.mqtt import MQTTUtils
ImportError: No module named mqtt
Is that I missed out anything from here?
Thank you
For spark versions 2.* we can use MQTT in Structured Streaming by including the Bahir Jar.
From pyspark connect to MQTT broker :
(spark
.readStream
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")
.option("topic","mytopic")
.load("tcp://{}".format(broker_uri)))