How to calculate lag difference in Spark Structured Streaming? - apache-spark

I am writing a Spark Structured Streaming program. I need to create an additional column with the lag difference.
To reproduce my issue, I provide the code snippet. This code consumes data.json file stored in data folder:
[
{"id": 77,"type": "person","timestamp": 1532609003},
{"id": 77,"type": "person","timestamp": 1532609005},
{"id": 78,"type": "crane","timestamp": 1532609005}
]
Code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as func
from pyspark.sql.window import Window
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("Test") \
.master("local[2]") \
.getOrCreate()
schema = StructType([
StructField("id", IntegerType()),
StructField("type", StringType()),
StructField("timestamp", LongType())
])
ds = spark \
.readStream \
.format("json") \
.schema(schema) \
.load("data/")
diff_window = Window.partitionBy("id").orderBy("timestamp")
ds = ds.withColumn("prev_timestamp", func.lag(ds.timestamp).over(diff_window))
query = ds \
.writeStream \
.format('console') \
.start()
query.awaitTermination()
I get this error:
pyspark.sql.utils.AnalysisException: u'Non-time-based windows are not
supported on streaming DataFrames/Datasets;;\nWindow
[lag(timestamp#71L, 1, null) windowspecdefinition(host_id#68,
timestamp#71L ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1
PRECEDING) AS prev_timestamp#129L]

pyspark.sql.utils.AnalysisException: u'Non-time-based windows are not supported on streaming DataFrames/Datasets
Meaning that your window should be based on a timestamp column. So it you have a data point for each second, and you make a 30s window with a stride of 10s, your resultant window would create a new window column, with start and end columns which will contain timestamps with a difference of 30s.
You should use the window in this way:
words = words.withColumn('date_time', F.col('date_time').cast('timestamp'))
w = F.window('date_time', '30 seconds', '10 seconds')
words = words \
.withWatermark('date_format', '1 minutes') \
.groupBy(w).agg(F.mean('value'))

Related

Why the performance of Redis is worse than Hive?

I'm using Hadoop to work on a big data project.
I can use spark to send some SQL command to Hive.
Since this process is slow, I try to write my data into Redis which is an open-source database and use spark to query my data from this database to speed up this process.
I have deployed redis server in my virtual machine, and I can use spark session to read, write and run sql command on redis by using spark-redis module.
https://github.com/RedisLabs/spark-redis
Here's my testing script. I use spark session to get table from hive and write into redis.
from pyspark.sql import SparkSession
import time
import pandas as pd
spark = SparkSession.builder \
.appName("read_and_write") \
.config("spark.sql.warehouse.dir", "/user/hive/warehouse") \
.enableHiveSupport() \
.getOrCreate()
# read table from hive
sparkDF = spark.sql("SELECT * FROM hive_table")
sparkDF.show()
# write table into redis
sparkDF.write.format("org.apache.spark.sql.redis") \
.option("table", "redis_table") \
.mode("overwrite") \
.save()
After writing process finish, I write two script to compare speed between redis and hive.
This script is to test hive:
from pyspark.sql import SparkSession
import time, json
spark = SparkSession.builder \
.appName("hive_spark_test") \
.config("hive.metastore.uris", "thrift://localhost:9083") \
.config("spark.debug.maxToStringFields", "500") \
.config("spark.sql.execution.arrow.enabled", True) \
.config("spark.sql.shuffle.partitions", 20) \
.config("spark.default.parallelism", 20) \
.config("spark.storage.memoryFraction", 0.5) \
.config("spark.shuffle.memoryFraction", 0.3) \
.config("spark.shuffle.consolidateFiles", False) \
.config("spark.shuffle.sort.bypassMergeThreshold", 200) \
.config("spark.shuffle.file.buffer", "32K") \
.config("spark.reducer.maxSizeInFlight", "48M") \
.enableHiveSupport() \
.getOrCreate()
for i in range(20):
# you can use your own sql command
sql_command = "SELECT testColumn1, SUM(testColumn2) AS testColumn2 FROM hive_table WHERE (date BETWEEN '2022-01-01' AND '2022-03-10') GROUP BY GROUPING SETS ((testColumn1))"
readDF = spark.sql(sql_command)
df_json = readDF.toJSON()
df_collect = df_json.collect()
res = [json.loads(i) for i in df_collect]
print(res)
Here's the result. Duration is 0.2s to 0.5s after few round.
enter image description here
This script is to test redis:
from pyspark.sql import SparkSession
import time, json
spark = SparkSession.builder \
.appName("redis_spark_test") \
.config("spark.redis.host", "localhost") \
.config("spark.redis.port", "6379") \
.config("spark.redis.max.pipeline.size", 200) \
.config("spark.redis.scan.count", 200) \
.config("spark.debug.maxToStringFields", "500") \
.config("spark.sql.execution.arrow.enabled", True) \
.config("spark.sql.shuffle.partitions", 20) \
.config("spark.default.parallelism", 20) \
.config("spark.storage.memoryFraction", 0.5) \
.config("spark.shuffle.memoryFraction", 0.3) \
.config("spark.shuffle.consolidateFiles", False) \
.config("spark.shuffle.sort.bypassMergeThreshold", 200) \
.config("spark.shuffle.file.buffer", "32K") \
.config("spark.reducer.maxSizeInFlight", "48M") \
.getOrCreate()
sql_command = """CREATE OR REPLACE TEMPORARY VIEW redis_table (
testColumn1 STRING,
testColumn2 INT,
testColumn3 STRING,
testColumn4 STRING,
date DATE,)
USING org.apache.spark.sql.redis OPTIONS (table 'redis_table')
"""
spark.sql(sql_command)
for i in range(20):
# you can use your own sql command
sql_command = "SELECT testColumn1, SUM(testColumn2) AS testColumn2 FROM redis_table WHERE (date BETWEEN '2022-01-01' AND '2022-03-10') GROUP BY GROUPING SETS ((testColumn1))"
readDF = spark.sql(sql_command)
df_json = readDF.toJSON()
df_collect = df_json.collect()
res = [json.loads(i) for i in df_collect]
print(res)
Here's the result. Duration is 1s to 2s after few round.
enter image description here
This result is conflicted with my survey. Redis should be faster than Hive, but I get the opposite result.
I want to know the reason and try to make Redis can run faster than Hive through Spark if that's possible.
Thank you.

Spark structed streaming window issue

I have a problem regrading the window in Spark Structed Streaming. I want to group the data i'm receiving continuously from kafka source in sliding window and count the number of data. The issue is that writestream streams the window dataframe each time there is data coming and update the count of the current window.
I'm using the following code to create the window:
#Define schema of the topic to be consumed
jsonSchema = StructType([ StructField("State", StringType(), True) \
, StructField("Value", StringType(), True) \
, StructField("SourceTimestamp", StringType(), True) \
, StructField("Tag", StringType(), True)
])
spark = SparkSession \
.builder \
.appName("StructuredStreaming") \
.config("spark.default.parallelism", "100") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "10.129.140.23:9092") \
.option("subscribe", "SIMULATOR.SUPERMAN.TOTO") \
.load() \
.select(from_json(col("value").cast("string"), jsonSchema).alias("data")) \
.select("data.*")
df = df.withColumn("time", current_timestamp())
Window = df \
.withColumn("window",window("time","4 seconds","1 seconds")).groupBy("window").count() \
.withColumn("time", current_timestamp())
#Write back to kafka
query = Window.select(to_json(struct("count","window","time")).alias("value")) \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "10.129.140.23:9092") \
.outputMode("update") \
.option("topic", "structed") \
.option("checkpointLocation", "/home/superman/notebook/checkpoint") \
.start()
The windows are not sorted and are updated each time there is a change in count. How can we wait for the end of the window and stream the final count one time. Instead of this output:
{"count":21,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":47,"window":{"start":"2019-05-13T09:39:12.000Z","end":"2019-05-13T09:39:16.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:13.000Z","end":"2019-05-13T09:39:17.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":19,"window":{"start":"2019-05-13T09:39:19.000Z","end":"2019-05-13T09:39:23.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":19,"window":{"start":"2019-05-13T09:39:18.000Z","end":"2019-05-13T09:39:22.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":37,"window":{"start":"2019-05-13T09:39:19.000Z","end":"2019-05-13T09:39:23.000Z"},"time":"2019-05-13T09:39:21.939Z"}
{"count":18,"window":{"start":"2019-05-13T09:39:21.000Z","end":"2019-05-13T09:39:25.000Z"},"time":"2019-05-13T09:39:21.939Z"}
I would like this:
{"count":47,"window":{"start":"2019-05-13T09:39:12.000Z","end":"2019-05-13T09:39:16.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:13.000Z","end":"2019-05-13T09:39:17.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:19.818Z"}
The expected ouput wait for the window to be closed based on comparaison between the end timestamp and the current time.

How to check if n consecutive events from kafka stream is greater or less than threshold limit

I an new to pyspark. I have written a pyspark program to read kafka stream using window operation. I am publishing the below message to kafka every second with different sources and temperatures along with the timestamp.
{"temperature":34,"time":"2019-04-17 12:53:02","source":"1010101"}
{"temperature":29,"time":"2019-04-17 12:53:03","source":"1010101"}
{"temperature":28,"time":"2019-04-17 12:53:04","source":"1010101"}
{"temperature":34,"time":"2019-04-17 12:53:05","source":"1010101"}
{"temperature":45,"time":"2019-04-17 12:53:06","source":"1010101"}
{"temperature":34,"time":"2019-04-17 12:53:07","source":"1010102"}
{"temperature":29,"time":"2019-04-17 12:53:08","source":"1010102"}
{"temperature":28,"time":"2019-04-17 12:53:09","source":"1010102"}
{"temperature":34,"time":"2019-04-17 12:53:10","source":"1010102"}
{"temperature":45,"time":"2019-04-17 12:53:11","source":"1010102"}
How do I check if n consecutive temperature records for a source crosses threshold limit (<30 and >40) and then publish the alerts to Kafka. Also please let me know if the below program is efficient to read the kafka stream or require any changes?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, FloatType, TimestampType
from pyspark.sql.functions import avg, window, from_json, from_unixtime, unix_timestamp
import uuid
schema = StructType([
StructField("source", StringType(), True),
StructField("temperature", FloatType(), True),
StructField("time", StringType(), True)
])
spark = SparkSession \
.builder.master("local[8]") \
.appName("test-app") \
.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 5)
df1 = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test") \
.load() \
.selectExpr("CAST(value AS STRING)")
df2 = df1.select(from_json("value", schema).alias(
"sensors")).select("sensors.*")
df3 = df2.select(df2.source, df2.temperature, from_unixtime(
unix_timestamp(df2.time, 'yyyy-MM-dd HH:mm:ss')).alias('time'))
df4 = df3.groupBy(window(df3.time, "2 minutes", "1 minutes"),
df3.source).agg(avg("temperature"))
query1 = df4.writeStream \
.outputMode("complete") \
.format("console") \
.option("checkpointLocation", "/tmp/temporary-" + str(uuid.uuid4())) \
.start()
query1.awaitTermination()

Spark Streaming Job is running very slow

I am running a spark streaming job in my local and it is taking approximately 4 to 5 min for one batch. Can someone suggest what could be the issue with the bellow code?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, FloatType, TimestampType
from pyspark.sql.functions import avg, window, from_json, from_unixtime, unix_timestamp
import uuid
schema = StructType([
StructField("source", StringType(), True),
StructField("temperature", FloatType(), True),
StructField("time", StringType(), True)
])
spark = SparkSession \
.builder.master("local[8]") \
.appName("poc-app") \
.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 5)
df1 = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "poc") \
.load() \
.selectExpr("CAST(value AS STRING)")
df2 = df1.select(from_json("value", schema).alias(
"sensors")).select("sensors.*")
df3=df2.select(df2.source,df2.temperature,from_unixtime(unix_timestamp(df2.time, 'yyyy-MM-dd HH:mm:ss')).alias('time'))
df4 = df3.groupBy(window(df3.time, "2 minutes","1 minutes"), df3.source).count()
query1 = df4.writeStream \
.outputMode("complete") \
.format("console") \
.option("checkpointLocation", "/tmp/temporary-" + str(uuid.uuid4())) \
.start()
query1.awaitTermination()
with mini-batch streaming you usually want to reduce the # of output partitions ... since you are doing some aggregation (wide transformation) every time you persist it will default to 200 partitions to disk because of
spark.conf.get("spark.sql.shuffle.partitions")
try lowering this config to a smaller output partition and place it at the beginning of your code so when the aggregation is performed it outputs 5 partitions to disk
spark.conf.set("spark.sql.shuffle.partitions", 5)
you can also get a feel by looking at the # of files in the output write stream directory as well as identifying the # of partitions in your aggregated df
df3.rdd.getNumPartitions()
btw since you are using a local mode for testing try setting to local[8] instead of local[4] so it increases the parallelism on your cpu cores (i assume you have 4)

why spark python udf execution time 10x difference on different partition strategy?

I got huge (over 10x~100x) execution time difference between 2 jobs with only difference on partition strategy, wanting to know why :)
Observation:
repartition by partition number with equalized record runs 10~100x slower than 2.
repartition by column: phone_country_code
from spark history, only difference are 1. got minor larger(10~20%) shuffle read size.
My environment:
Spark 1.6.1 on EMR 4.7
Python 2.7
submit job using pyspark
Spark Job:
python udf to parse phone number for time zone info
read data from redshift via spark-redshift and write back
code sample:
from pyspark import SparkContext, SparkConf
from pyspark.sql.types import DateType, TimestampType, StringType
from pyspark.sql import SQLContext
from pyspark.sql.functions import col, udf
conf = SparkConf().setAppName("extract_local_time")
sc = SparkContext(conf=conf)
sql_context = SQLContext(sc)
sc.addPyFile("s3://xxx/xxx.zip")
def local_time(phone_number, datetime_org):
from util import phonenumber_util
local_time = phonenumber_util.convert_to_local_datetime_by_phone_number(
phone_number,
datetime_org)
return local_time.replace(tzinfo=None)
local_time_func = udf(local_time, TimestampType())
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://xxx") \
.option("query", "select * from xxx") \
.option("tempdir", "s3n://xxx") \
.load()
# df = df.repartition(12*10) # partition strategy 1
df = df.repartition('phone_country_code') # partition strategy 2
df2 = df.withColumn("datetime_local", local_time_func(col("phone_number"), col("datetime")))
df2.registerTempTable("xxx")
sql_context.sql("SELECT * FROM xxx") \
.write.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://xxx") \
.option("tempdir", "s3n://xxx") \
.option("dbtable", "xxx") \
.mode("overwrite") \
.save()
data sample:
phone_number, phone_country_code
55-82981399971, 55
1-7073492922, 1
90-5395889859, 90
My guess:
some optimization on jvm-py level on udf that depends on partitions's record distribution?
Thanks for any further suggestions :)

Resources