Spark Streaming - updateStateByKey and caching data - apache-spark

I have a problem with using updateStateByKey function and caching some big data at the same time. Here is a example.
Lets say I get data (lastname,age) from kafka. I want to keep actual age for every person so I use updateStateByKey. Also I want to know name of every person so I join output with external table (lastname,name) e.g. from Hive. Lets assume it's really big table, so I don't want to load it in every batch. And there's a problem.
All works well, when I load table in every batch, but when I try to cache table, StreamingContext doesn't start. I also tried to use registerTempTable and later join data with sql but i got the same error.
Seems like the problem is checkpoint needed by updateStateByKey. When I remove updateStateByKey and leave checkpoint i got error, but when I remove both it works.
Error I'm getting: pastebin
Here is code:
import sys
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, HiveContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
# function to keep actual state
def updateFunc(channel, actualChannel):
if (actualChannel is None or not channel is None):
try:
actualChannel = channel[-1]
except Exception:
pass
if channel is None:
channel = actualChannel
return actualChannel
def splitFunc(row):
row = row.strip()
lname,age = row.split()
return (lname,age)
def createContext(brokers,topics):
# some conf
conf = SparkConf().setAppName(appName).set("spark.streaming.stopGracefullyOnShutdown","true").set("spark.dynamicAllocation.enabled","false").\
set("spark.serializer","org.apache.spark.serializer.KryoSerializer").set("spark.sql.shuffle.partitions",'100')
# create SparkContext
sc = SparkContext(conf=conf)
# create HiveContext
sqlContext = HiveContext(sc)
# create Streaming Context
ssc = StreamingContext(sc, 5)
# read big_df and cache (not work, Streaming Context not start)
big_df = sqlContext.sql('select lastname,name from `default`.`names`')
big_df.cache().show(10)
# join table
def joinTable(time,rdd):
if rdd.isEmpty()==False:
df = HiveContext.getOrCreate(SparkContext.getOrCreate()).createDataFrame(rdd,['lname','age'])
# read big_df (work)
#big_df = HiveContext.getOrCreate(SparkContext.getOrCreate()).sql('select lastname,name from `default`.`names`')
# join DMS
df2 = df.join(big_df,df.lname == big_df.lastname,"left_outer")
return df2.map(lambda row:row)
# streaming
kvs = KafkaUtils.createDirectStream(ssc, [topics], {'metadata.broker.list': brokers})
kvs.map(lambda (k,v): splitFunc(v)).updateStateByKey(updateFunc).transform(joinTable).pprint()
return ssc
if __name__ == "__main__":
appName="SparkCheckpointUpdateSate"
if len(sys.argv) != 3:
print("Usage: SparkCheckpointUpdateSate.py <broker_list> <topic>")
exit(-1)
brokers, topics = sys.argv[1:]
# getOrCreate Context
checkpoint = 'SparkCheckpoint/checkpoint'
ssc = StreamingContext.getOrCreate(checkpoint,lambda: createContext(brokers,topics))
# start streaming
ssc.start()
ssc.awaitTermination()
Can you tell me how to properly cache data when checkpoint is enabled? Maybe there is some workaround I don't know.
Spark ver. 1.6

I get this working using lazily instantiated global instance of big_df. Something like that is done in recoverable_network_wordcount.py
.
def getBigDf():
if ('bigdf' not in globals()):
globals()['bigdf'] = HiveContext.getOrCreate(SparkContext.getOrCreate()).sql('select lastname,name from `default`.`names`')
return globals()['bigdf']
def createContext(brokers,topics):
...
def joinTable(time,rdd):
...
# read big_df (work)
big_df = getBigDF()
# join DMS
df2 = df.join(big_df,df.lname == big_df.lastname,"left_outer")
return df2.map(lambda row:row)
...
Seems like in streaming all data must be cached inside streaming processing, not before.

Related

pyspark called from an imported method that calls another method gives empty dataframe

I have a module called test_pyspark_module with following code:
class SparkTest:
def __init__(self, spark, sql):
self.spark = spark
self.sql = sql
def fetch_data(self, sql_text):
data = self.sql(sql_text).toPandas()
spark = self.spark
print(len(data))
def call_fetch_data(self):
sql_text = """
SELECT *
FROM
<TABLENAME>
WHERE date BETWEEN '${date-15}' and '${date-1}'
and app_id=1233
"""
return self.fetch_data(sql_text)
def fetch_data(sql, sql_text):
data = sql(sql_text).toPandas()
print(len(data))
I have a pyspark kernel running and I have following code in my jupyter notebook:
from pyspark.sql import SQLContext
from pyspark import SparkContext
sqlContext = SQLContext(spark)
sql = sqlContext.sql
sql_text = """
SELECT *
FROM
<TABLENAME>
WHERE date BETWEEN '${date-15}' and '${date-1}'
and app_id=1233
"""
from test_pyspark_module import *
st = SparkTest(spark, sql)
Now when I run st.fetch_data() I get 43000. However, when I run st.call_fetch_data() I get 0.
I wanted to see if something is going wrong with import so I implemented a duplicate of SparkTest locally calling it SparkTest2. However, this works as I expect with both functions returning 43000.
class SparkTest2:
def __init__(self, spark, sql):
self.spark = spark
self.sql = sql
def fetch_data(self, sql_text):
data = self.sql(sql_text).toPandas()
print(len(data))
def call_fetch_data(self):
sql_text = """
SELECT *
FROM
<TABLE_NAME>
WHERE date BETWEEN '${date-15}' and '${date-1}'
and app_id=1233
"""
return self.fetch_data(sql_text)
st2 = SparkTest2(spark, sql)
st2.fetch_data(sql_text) gives output 43000 and st2.call_fetch_data() gives output 43000
So seems like if try to run a class method that calls another method and the class is imported then it will fail to give correct results. Note that there is no error or exception. Just that I get 0 rows (I do get correct number of columns, i.e. 28)

How to use Prefect's resource manager with a spark cluster

I have been messing around with Prefect for workflow management, but got stuck with
building up and braking down a spark session withing Prefect's resource manager.
I browsed Prefects docs and an example with Dusk is available:
from prefect import resource_manager
from dask.distributed import Client
#resource_manager
class DaskCluster:
def init(self, n_workers):
self.n_workers = n_workers
def setup(self):
"Create a local dask cluster"
return Client(n_workers=self.n_workers)
def cleanup(self, client):
"Cleanup the local dask cluster"
client.close()
with Flow("example") as flow:
n_workers = Parameter("n_workers")
with DaskCluster(n_workers=n_workers) as client:
some_task(client)
some_other_task(client)
However I couldn't work out how to do the same with a spark session.
The simplest way to do this is with Spark in local mode:
from prefect import task, Flow, resource_manager
from pyspark import SparkConf
from pyspark.sql import SparkSession
#resource_manager
class SparkCluster:
def __init__(self, conf: SparkConf = SparkConf()):
self.conf = conf
def setup(self) -> SparkSession:
return SparkSession.builder.config(conf=self.conf).getOrCreate()
def cleanup(self, spark: SparkSession):
spark.stop()
#task
def get_data(spark: SparkSession):
return spark.createDataFrame([('look',), ('spark',), ('tutorial',), ('spark',), ('look', ), ('python', )], ['word'])
#task(log_stdout=True)
def analyze(df):
word_count = df.groupBy('word').count()
word_count.show()
with Flow("spark_flow") as flow:
conf = SparkConf().setMaster('local[*]')
with SparkCluster(conf) as spark:
df = get_data(spark)
analyze(df)
if __name__ == '__main__':
flow.run()
Your setup() method returns the resource being managed and the cleanup() method accepts the same resource returned by setup(). In this case, we create and return a Spark session, and then stop it. You don't need spark-submit or anything (though I find it a bit harder managing dependencies this way).
Scaling it up gets harder and is something I'm still working on figuring out. For example, Prefect won't know how to serialize Spark DataFrames for output caching or persisting results. Also, you have to be careful about using the Dask executor with Spark sessions because they can't be pickled, so you have to set the executor to use scheduler='threads' (see here).

PYSPARK: Why am I getting Key error while reading from kafka broker through pyspark?

I am reading the twitter stream from my Kafka topic while converting it to JSON in Pyspark code, data get missing.
Providing code below
The code is reading the twitter stream from Kafka topic and converting to JSON format.
While accessing tweet['user'] getting a key error (Indices must be an integer) on tweet[0] getting the first character of the message.
from __future__ import print_function
import sys
import json
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: direct_kafka_wordcount.py <broker_list> <topic>", file=sys.stderr)
sys.exit(-1)
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
brokers,topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: json.loads(x[1]))
status=lines.map(lambda tweets: tweets['user']['screen_name'])
#status.pprint()
status.pprint()
#status.map(lambda tweet: tweet['created_at']).pprint()
#counts = lines.flatMap(lambda line: line.split(" ")) \
# .filter(lambda word: word.lower().startswith('#')) \
# .map(lambda word: (word.lower(), 1)) \
# .reduceByKey(lambda a, b: a+b)
#counts.pprint()
ssc.start()
ssc.awaitTermination()
Getting this output after converting Kafka message to JSON
{u'quote_count': 0, u'contributors': None, u'truncated': False, u'text': u'RT #hotteaclout: #TeenChoiceFOX my #TeenChoice vote for #ChoiceActionMovieActor is Chris Evans', u'is_quote_status': False, u'in_reply_to_status_id': None, u'reply_count': 0, u'id': 1149313606304976896, .....}
...
Actual Message is
{"created_at":"Thu Jul 11 13:44:55 +0000 2019","id":1149313623363338241,"id_str":"1149313623363338241","text":"RT #alisonpool_: Legit thought this was Mike Wazowski for a second LMFAO https://t.co/DMzMtOfW2I","source":"\u003ca href=\"http://twitter.com/download/iphone\" ....}
Ok, I solved it, It was a problem with encoding. Just
json.loads(tweets.encode('utf-8'))
Would not work, We need to specify an encoding of the file so that all the file it calls will apply the same encoding.
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Add above code in it.

PySpark Streaming + Kafka Word Count not printing any results

This is my first interaction with Kafka and Spark Streaming and i am trying to run WordCount script given below. The script is pretty standard as given in many online blogs. But for whatever reason, spark streaming is not printing the word counts. It is not throwing any error, just does not display the counts.
I have tested the topic via console consumer, and there messages are showing up correctly. I even tried to use foreachRDD to see the lines coming in and thats also not showing anything.
Thanks in advance!
Versions: kafka_2.11-0.8.2.2 , Spark2.2.1, spark-streaming-kafka-0-8-assembly_2.11-2.2.1
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql.context import SQLContext
sc = SparkContext(appName="PythonStreamingKafkaWordCount")
sc.setCheckpointDir('c:\Playground\spark\logs')
ssc = StreamingContext(sc, 10)
ssc.checkpoint('c:\Playground\spark\logs')
zkQuorum, topic = sys.argv[1:]
print(str(zkQuorum))
print(str(topic))
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
lines = kvs.map(lambda x: x[1])
print(kvs)
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.pprint(num=10)
ssc.start()
ssc.awaitTermination()
Producer Code:
import sys,os
from kafka import KafkaProducer
from kafka.errors import KafkaError
import time
producer = KafkaProducer(bootstrap_servers="localhost:9092")
topic = "KafkaSparkWordCount"
def read_file(fileName):
with open(fileName) as f:
print('started reading...')
contents = f.readlines()
for content in contents:
future = producer.send(topic,content.encode('utf-8'))
try:
future.get(timeout=10)
except KafkaError as e:
print(e)
break
print('.',end='',flush=True)
time.sleep(0.2)
print('done')
if __name__== '__main__':
read_file('C:\\\PlayGround\\spark\\BookText.txt')
how many cores do you use ?
Spark Streaming needs at least two cores, one for the receiver and one for the processor.

pyspark streaming from kinesis kills heap

Running a sample application streaming data from kinesis. I did not get why this application uses so much heap and crashes.
Here is the code :
from __future__ import print_function
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
from pyspark.sql.session import SparkSession
from datetime import datetime
# function declaration
def isDfEmpty(df):
try:
if not df.take(1) :
return True
except Exception as e:
return True
return False
# function declaration
def mergeTable(df):
print("b:mergeTable")
print(str(datetime.now()))
try:
global refDf
if isDfEmpty(df) :
print("no record, waiting !")
else :
if(isDfEmpty(refDf)) :
refDf = df
else :
print(" before count %s" % refDf.count())
refDf = df.unionAll(refDf)
print(" after count %s" % refDf.count())
except Exception as e:
print(e)
print(str(datetime.now()))
print("e:mergeTable")
# function declaration
def doWork(df):
print("b:doWork")
print(str(datetime.now()))
try:
mergeTable(df)
except Exception as e:
print(e)
print(str(datetime.now()))
print("e:doWork")
# function declaration
def sensorFilter(sensorType, rdd):
df = spark.read.json(rdd.filter(lambda x : sensorType in x))
doWork(df)
def printRecord(rdd):
print("========================================================")
print("Starting new RDD")
print("========================================================")
sensorFilter("SensorData", rdd)
refDf = None
if __name__ == "__main__":
reload(sys)
# sys.setdefaultencoding('utf-8')
if len(sys.argv) != 5:
print( "Usage: dump.py <app-name> <stream-name> <endpoint-url> <region-name>", file=sys.stderr)
sys.exit(-1)
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext
# sc = SparkContext(appName="PythonStreamingKinesisWordCountAsl")
ssc = StreamingContext(sc, 10)
appName, streamName, endpointUrl, regionName = sys.argv[1:]
dstream = KinesisUtils.createStream(ssc, appName, streamName, endpointUrl, regionName, InitialPositionInStream.LATEST, 10)
dstream.foreachRDD(printRecord)
ssc.start()
ssc.awaitTermination()
After a time the spark application slowed down due to heap usage. But when i comment out the lines, heap usage decrease to normal levels.(According to SparkUI)
print(" before count %s" % refDf.count())
print(" after count %s" % refDf.count())
I am really new with pyspark and trying to get what is going on.
Merging on data frame continuously may explode the memory of course but the problem of heap occurs very beginning.
EDIT
Environment : Tried on single ubuntu and on cents VM hosted by macOS nothing changed.

Resources