Convert Spark Structured DataFrame to Pandas using pandas_udf - apache-spark

I need to read csv files as stream and then convert that to pandas dataframe.
Here is what I have done so far
DataShema = StructType([ StructField("TimeStamp", LongType(), True), \
StructField("Count", IntegerType(), True), \
StructField("Reading", FloatType(), True) ])
group_columns = ['TimeStamp','Count','Reading']
#pandas_udf(DataShema, PandasUDFType.GROUPED_MAP)
def get_pdf(pdf):
return pd.DataFrame([pdf[group_columns]],columns=[group_columns])
# getting Surge data from the files
SrgDF = spark \
.readStream \
.schema(DataShema) \
.csv("ProcessdedData/SurgeAcc")
mydf = SrgDF.groupby(group_columns).apply(get_pdf)
qrySrg = SrgDF \
.writeStream \
.format("console") \
.start() \
.awaitTermination()
I believe from another source (Convert Spark Structure Streaming DataFrames to Pandas DataFrame) that converting structured streaming dataframe to pandas is not directly possible and it seems that pandas_udf is the right approach but cannot figure out exactly how to achieve this. I need the pandas dataframe to pass into my functions.
Edit
when I run the code (changing the query to mydf rather than SrgDF) then I get the following error: pyspark.sql.utils.StreamingQueryException: 'Writing job aborted.\n=== Streaming Query ===\nIdentifier: [id = 18a15e9e-9762-4464-b6d1-cb2db8d0ac41, runId = e3da131e-00d1-4fed-82fc-65bf377c3f99]\nCurrent Committed Offsets: {}\nCurrent Available Offsets: {FileStreamSource[file:/home/mls5/Work_Research/Codes/Misc/Python/MachineLearning_ArtificialIntelligence/00_Examples/01_ApacheSpark/01_ComfortApp/ProcessdedData/SurgeAcc]: {"logOffset":0}}\n\nCurrent State: ACTIVE\nThread State: RUNNABLE\n\nLogical Plan:\nFlatMapGroupsInPandas [Count#1], get_pdf(TimeStamp#0L, Count#1, Reading#2), [TimeStamp#10L, Count#11, Reading#12]\n+- Project [Count#1, TimeStamp#0L, Count#1, Reading#2]\n +- StreamingExecutionRelation FileStreamSource[file:/home/mls5/Work_Research/Codes/Misc/Python/MachineLearning_ArtificialIntelligence/00_Examples/01_ApacheSpark/01_ComfortApp/ProcessdedData/SurgeAcc], [TimeStamp#0L, Count#1, Reading#2]\n'
19/05/20 18:32:29 ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver
/usr/local/lib/python3.6/dist-packages/pyarrow/__init__.py:152: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
warnings.warn("pyarrow.open_stream is deprecated, please use ".
EDIT-2
Here is the code to reproduce the error
import sys
from pyspark import SparkContext
from pyspark.sql import Row, SparkSession, SQLContext
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.streaming import StreamingContext
from pyspark.sql.types import *
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
import pyarrow as pa
import glob
#####################################################################################
if __name__ == '__main__' :
spark = SparkSession \
.builder \
.appName("RealTimeIMUAnalysis") \
.getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# reduce verbosity
sc = spark.sparkContext
sc.setLogLevel("ERROR")
##############################################################################
# using the saved files to do the Analysis
DataShema = StructType([ StructField("TimeStamp", LongType(), True), \
StructField("Count", IntegerType(), True), \
StructField("Reading", FloatType(), True) ])
group_columns = ['TimeStamp','Count','Reading']
#pandas_udf(DataShema, PandasUDFType.GROUPED_MAP)
def get_pdf(pdf):
return pd.DataFrame([pdf[group_columns]],columns=[group_columns])
# getting Surge data from the files
SrgDF = spark \
.readStream \
.schema(DataShema) \
.csv("SurgeAcc")
mydf = SrgDF.groupby('Count').apply(get_pdf)
#print(mydf)
qrySrg = mydf \
.writeStream \
.format("console") \
.start() \
.awaitTermination()
To run, you need to create a folder named SurgeAcc where the code is and create a csv file inside with the following format:
TimeStamp,Count,Reading
1557011317299,45148,-0.015494
1557011317299,45153,-0.015963
1557011319511,45201,-0.015494
1557011319511,45221,-0.015494
1557011315134,45092,-0.015494
1557011315135,45107,-0.014085
1557011317299,45158,-0.015963
1557011317299,45163,-0.015494
1557011317299,45168,-0.015024`

Your return pandas_udf dataframe is not matching with the schema specified.
Please note that input to the pandas_udf will be pandas dataframe and also returns pandas dataframe.
You can use all pandas functions inside the pandas_udf. Only thing you have to make sure is the ReturnDataShema should match with actual output of the function.
ReturnDataShema = StructType([StructField("TimeStamp", LongType(), True), \
StructField("Count", IntegerType(), True), \
StructField("Reading", FloatType(), True), \
StructField("TotalCount", FloatType(), True)])
#pandas_udf(ReturnDataShema, PandasUDFType.GROUPED_MAP)
def get_pdf(pdf):
# This following stmt is causing schema mismatch
# return pd.DataFrame([pdf[group_columns]],columns=[group_columns])
# If you want to return all the rows of pandas dataframe
# you can simply
# return pdf
# If you want to do any aggregations, you can do like the below, or use pandas query
# but make sure the return pandas dataframe complies with ReturnDataShema
total_count = pdf['Count'].sum()
return pd.DataFrame([(pdf.TimeStamp[0],pdf.Count[0],pdf.Reading[0],total_count)])

Related

How to call avro SchemaConverters in Pyspark

Although PySpark has Avro support, it does not have the SchemaConverters method. I may be able to use Py4J to accomplish this, but I have never used a Java package within Python.
This is the code I am using
# Import SparkSession
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
def _test():
# Create SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("sparvro") \
.getOrCreate()
avroSchema = sc._jvm.org.apache.spark.sql.avro.SchemaConverters.toAvroType(StructType([ StructField("firstname", StringType(), True)]))
if __name__ == "__main__":
_test()
however, I keep getting this error
AttributeError: 'StructField' object has no attribute '_get_object_id'

Pyspark Streaming with checkpointing is failing

I am working on spark streaming data received through custom receiver using pyspark. To make it fault tolerant I enabled checkpointing . Since then the code which was running fine before checkpointing was introduced is now throwing error.
Error Message :
pubsubStream.flatMap(lambda x : x).map(lambda x: convertjson(x)).foreachRDD(lambda rdd : dstream_to_rdd(rdd))
File "/home/test/spark_checkpointing/spark_checkpoint_test.py", line 227, in dstream_to_rdd
df = spark_session.read.option("multiline","true")\
NameError: name 'sparkContext' is not defined
Code is as below :
import sys
from pyspark import SparkContext,SparkConf
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pubsub import PubsubUtils
import json
import time
from pyspark.sql.types import (StructField, StringType, StructType, IntegerType, FloatType,LongType,BooleanType)
from google.cloud import storage
import pyspark
conf_bucket_name = <bucket_name>
#Events list
events_list = ["Event1","Event2"]
# This chunk of schema creation will be automated later
# and most probable moved outside
full_schema = StructType([
StructField('_id', StructType([
StructField('_data', StringType(), True)
])),
StructField('ct', StructType([
StructField('$timestamp' , StructType([
StructField('i', LongType(), True),
StructField('t', LongType(), True),
]))])),
StructField('fg', StructType([
StructField('sgs' , StructType([
StructField('col1',StringType(), True),
StructField('col2',StringType(), True)
]))])),
StructField('col6', StringType(), True),
StructField('_corrupt_record', StringType(), True)
])
def convertjson(ele):
temp = json.loads(ele.decode('utf-8'))
if temp['col6'] == 'update':
del temp['updateDescription']
return temp
return temp
def dstream_to_rdd(x):
if not x.isEmpty():
df = spark_session.read.option("multiline","true")\
.option("mode", "PERMISSIVE")\
.option("primitivesAsString", "false")\
.schema(full_schema)\
.option("columnNameOfCorruptRecord", "_corrupt_record")\
.option("allowFieldAddition","true")\
.json(x)
df.show(truncate=True)
#df.printSchema()
def createContext(all_config):
# If you do not see this printed, that means the StreamingContext has been loaded
# from the new checkpoint
print("Creating new context")
ssc = StreamingContext(spark_session.sparkContext, 10)
pubsubStream = PubsubUtils.createStream(ssc, <SUBSCRIPTION>, 10000, True)
# Print the records of dstream
pubsubStream.pprint() /// Dstreams are getting printed on console
#Dstream is transforme using flatmap to flatten it as tuple may have multiple records
#Then converted it to json forma and finally pushed to BQ
pubsubStream.flatMap(lambda x : x).map(lambda x: convertjson(x)).foreachRDD(lambda rdd : dstream_to_rdd(rdd))
pubsubStream.checkpoint(50)
return ssc
if __name__ == "__main__":
#Declaration of spark session and streaming session
checkpointDir = <checkpointdir path on google cloud storage>
spark_session = SparkSession.builder.appName("Test_spark_checkpoint").getOrCreate()
spark_session.conf.set('temporaryGcsBucket', <temp bucket name>)
ssc = StreamingContext.getOrCreate(checkpointDir,lambda: createContext(all_config))
ssc.start()
ssc.awaitTermination()
The error message is of sparkContext not defined. On doing dir(spark_session) this I found
that it returns list of the attributes and methods which contains sparkContext . Am I suppose to pass it explicitly. What is the miss here?
Also help me understanding the position of checkpointing used in code is correct or not.
Update piece of code : Tried with sparkContext instead of sparkSession
conf = SparkConf()
conf.setAppName("Test_spark_checkpoint")
conf.set('temporaryGcsBucket', <temp bucket>)
sc = SparkContext(conf=conf)
print(dir(sc))
ssc = StreamingContext.getOrCreate(checkpointDir,lambda: createContext(all_config))
df = sc.read.option("multiline","true")\
.option("mode", "PERMISSIVE")\
.option("primitivesAsString", "false")\
.schema(full_schema)\
.option("columnNameOfCorruptRecord", "_corrupt_record")\
.option("allowFieldAddition","true")\
.json(x)
df.show(truncate=True)

How to check if n consecutive events from kafka stream is greater or less than threshold limit

I an new to pyspark. I have written a pyspark program to read kafka stream using window operation. I am publishing the below message to kafka every second with different sources and temperatures along with the timestamp.
{"temperature":34,"time":"2019-04-17 12:53:02","source":"1010101"}
{"temperature":29,"time":"2019-04-17 12:53:03","source":"1010101"}
{"temperature":28,"time":"2019-04-17 12:53:04","source":"1010101"}
{"temperature":34,"time":"2019-04-17 12:53:05","source":"1010101"}
{"temperature":45,"time":"2019-04-17 12:53:06","source":"1010101"}
{"temperature":34,"time":"2019-04-17 12:53:07","source":"1010102"}
{"temperature":29,"time":"2019-04-17 12:53:08","source":"1010102"}
{"temperature":28,"time":"2019-04-17 12:53:09","source":"1010102"}
{"temperature":34,"time":"2019-04-17 12:53:10","source":"1010102"}
{"temperature":45,"time":"2019-04-17 12:53:11","source":"1010102"}
How do I check if n consecutive temperature records for a source crosses threshold limit (<30 and >40) and then publish the alerts to Kafka. Also please let me know if the below program is efficient to read the kafka stream or require any changes?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, FloatType, TimestampType
from pyspark.sql.functions import avg, window, from_json, from_unixtime, unix_timestamp
import uuid
schema = StructType([
StructField("source", StringType(), True),
StructField("temperature", FloatType(), True),
StructField("time", StringType(), True)
])
spark = SparkSession \
.builder.master("local[8]") \
.appName("test-app") \
.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 5)
df1 = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test") \
.load() \
.selectExpr("CAST(value AS STRING)")
df2 = df1.select(from_json("value", schema).alias(
"sensors")).select("sensors.*")
df3 = df2.select(df2.source, df2.temperature, from_unixtime(
unix_timestamp(df2.time, 'yyyy-MM-dd HH:mm:ss')).alias('time'))
df4 = df3.groupBy(window(df3.time, "2 minutes", "1 minutes"),
df3.source).agg(avg("temperature"))
query1 = df4.writeStream \
.outputMode("complete") \
.format("console") \
.option("checkpointLocation", "/tmp/temporary-" + str(uuid.uuid4())) \
.start()
query1.awaitTermination()

Spark Streaming Job is running very slow

I am running a spark streaming job in my local and it is taking approximately 4 to 5 min for one batch. Can someone suggest what could be the issue with the bellow code?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, FloatType, TimestampType
from pyspark.sql.functions import avg, window, from_json, from_unixtime, unix_timestamp
import uuid
schema = StructType([
StructField("source", StringType(), True),
StructField("temperature", FloatType(), True),
StructField("time", StringType(), True)
])
spark = SparkSession \
.builder.master("local[8]") \
.appName("poc-app") \
.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 5)
df1 = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "poc") \
.load() \
.selectExpr("CAST(value AS STRING)")
df2 = df1.select(from_json("value", schema).alias(
"sensors")).select("sensors.*")
df3=df2.select(df2.source,df2.temperature,from_unixtime(unix_timestamp(df2.time, 'yyyy-MM-dd HH:mm:ss')).alias('time'))
df4 = df3.groupBy(window(df3.time, "2 minutes","1 minutes"), df3.source).count()
query1 = df4.writeStream \
.outputMode("complete") \
.format("console") \
.option("checkpointLocation", "/tmp/temporary-" + str(uuid.uuid4())) \
.start()
query1.awaitTermination()
with mini-batch streaming you usually want to reduce the # of output partitions ... since you are doing some aggregation (wide transformation) every time you persist it will default to 200 partitions to disk because of
spark.conf.get("spark.sql.shuffle.partitions")
try lowering this config to a smaller output partition and place it at the beginning of your code so when the aggregation is performed it outputs 5 partitions to disk
spark.conf.set("spark.sql.shuffle.partitions", 5)
you can also get a feel by looking at the # of files in the output write stream directory as well as identifying the # of partitions in your aggregated df
df3.rdd.getNumPartitions()
btw since you are using a local mode for testing try setting to local[8] instead of local[4] so it increases the parallelism on your cpu cores (i assume you have 4)

How to calculate lag difference in Spark Structured Streaming?

I am writing a Spark Structured Streaming program. I need to create an additional column with the lag difference.
To reproduce my issue, I provide the code snippet. This code consumes data.json file stored in data folder:
[
{"id": 77,"type": "person","timestamp": 1532609003},
{"id": 77,"type": "person","timestamp": 1532609005},
{"id": 78,"type": "crane","timestamp": 1532609005}
]
Code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as func
from pyspark.sql.window import Window
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("Test") \
.master("local[2]") \
.getOrCreate()
schema = StructType([
StructField("id", IntegerType()),
StructField("type", StringType()),
StructField("timestamp", LongType())
])
ds = spark \
.readStream \
.format("json") \
.schema(schema) \
.load("data/")
diff_window = Window.partitionBy("id").orderBy("timestamp")
ds = ds.withColumn("prev_timestamp", func.lag(ds.timestamp).over(diff_window))
query = ds \
.writeStream \
.format('console') \
.start()
query.awaitTermination()
I get this error:
pyspark.sql.utils.AnalysisException: u'Non-time-based windows are not
supported on streaming DataFrames/Datasets;;\nWindow
[lag(timestamp#71L, 1, null) windowspecdefinition(host_id#68,
timestamp#71L ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1
PRECEDING) AS prev_timestamp#129L]
pyspark.sql.utils.AnalysisException: u'Non-time-based windows are not supported on streaming DataFrames/Datasets
Meaning that your window should be based on a timestamp column. So it you have a data point for each second, and you make a 30s window with a stride of 10s, your resultant window would create a new window column, with start and end columns which will contain timestamps with a difference of 30s.
You should use the window in this way:
words = words.withColumn('date_time', F.col('date_time').cast('timestamp'))
w = F.window('date_time', '30 seconds', '10 seconds')
words = words \
.withWatermark('date_format', '1 minutes') \
.groupBy(w).agg(F.mean('value'))

Resources