why spark python udf execution time 10x difference on different partition strategy? - apache-spark

I got huge (over 10x~100x) execution time difference between 2 jobs with only difference on partition strategy, wanting to know why :)
Observation:
repartition by partition number with equalized record runs 10~100x slower than 2.
repartition by column: phone_country_code
from spark history, only difference are 1. got minor larger(10~20%) shuffle read size.
My environment:
Spark 1.6.1 on EMR 4.7
Python 2.7
submit job using pyspark
Spark Job:
python udf to parse phone number for time zone info
read data from redshift via spark-redshift and write back
code sample:
from pyspark import SparkContext, SparkConf
from pyspark.sql.types import DateType, TimestampType, StringType
from pyspark.sql import SQLContext
from pyspark.sql.functions import col, udf
conf = SparkConf().setAppName("extract_local_time")
sc = SparkContext(conf=conf)
sql_context = SQLContext(sc)
sc.addPyFile("s3://xxx/xxx.zip")
def local_time(phone_number, datetime_org):
from util import phonenumber_util
local_time = phonenumber_util.convert_to_local_datetime_by_phone_number(
phone_number,
datetime_org)
return local_time.replace(tzinfo=None)
local_time_func = udf(local_time, TimestampType())
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://xxx") \
.option("query", "select * from xxx") \
.option("tempdir", "s3n://xxx") \
.load()
# df = df.repartition(12*10) # partition strategy 1
df = df.repartition('phone_country_code') # partition strategy 2
df2 = df.withColumn("datetime_local", local_time_func(col("phone_number"), col("datetime")))
df2.registerTempTable("xxx")
sql_context.sql("SELECT * FROM xxx") \
.write.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://xxx") \
.option("tempdir", "s3n://xxx") \
.option("dbtable", "xxx") \
.mode("overwrite") \
.save()
data sample:
phone_number, phone_country_code
55-82981399971, 55
1-7073492922, 1
90-5395889859, 90
My guess:
some optimization on jvm-py level on udf that depends on partitions's record distribution?
Thanks for any further suggestions :)

Related

Why the performance of Redis is worse than Hive?

I'm using Hadoop to work on a big data project.
I can use spark to send some SQL command to Hive.
Since this process is slow, I try to write my data into Redis which is an open-source database and use spark to query my data from this database to speed up this process.
I have deployed redis server in my virtual machine, and I can use spark session to read, write and run sql command on redis by using spark-redis module.
https://github.com/RedisLabs/spark-redis
Here's my testing script. I use spark session to get table from hive and write into redis.
from pyspark.sql import SparkSession
import time
import pandas as pd
spark = SparkSession.builder \
.appName("read_and_write") \
.config("spark.sql.warehouse.dir", "/user/hive/warehouse") \
.enableHiveSupport() \
.getOrCreate()
# read table from hive
sparkDF = spark.sql("SELECT * FROM hive_table")
sparkDF.show()
# write table into redis
sparkDF.write.format("org.apache.spark.sql.redis") \
.option("table", "redis_table") \
.mode("overwrite") \
.save()
After writing process finish, I write two script to compare speed between redis and hive.
This script is to test hive:
from pyspark.sql import SparkSession
import time, json
spark = SparkSession.builder \
.appName("hive_spark_test") \
.config("hive.metastore.uris", "thrift://localhost:9083") \
.config("spark.debug.maxToStringFields", "500") \
.config("spark.sql.execution.arrow.enabled", True) \
.config("spark.sql.shuffle.partitions", 20) \
.config("spark.default.parallelism", 20) \
.config("spark.storage.memoryFraction", 0.5) \
.config("spark.shuffle.memoryFraction", 0.3) \
.config("spark.shuffle.consolidateFiles", False) \
.config("spark.shuffle.sort.bypassMergeThreshold", 200) \
.config("spark.shuffle.file.buffer", "32K") \
.config("spark.reducer.maxSizeInFlight", "48M") \
.enableHiveSupport() \
.getOrCreate()
for i in range(20):
# you can use your own sql command
sql_command = "SELECT testColumn1, SUM(testColumn2) AS testColumn2 FROM hive_table WHERE (date BETWEEN '2022-01-01' AND '2022-03-10') GROUP BY GROUPING SETS ((testColumn1))"
readDF = spark.sql(sql_command)
df_json = readDF.toJSON()
df_collect = df_json.collect()
res = [json.loads(i) for i in df_collect]
print(res)
Here's the result. Duration is 0.2s to 0.5s after few round.
enter image description here
This script is to test redis:
from pyspark.sql import SparkSession
import time, json
spark = SparkSession.builder \
.appName("redis_spark_test") \
.config("spark.redis.host", "localhost") \
.config("spark.redis.port", "6379") \
.config("spark.redis.max.pipeline.size", 200) \
.config("spark.redis.scan.count", 200) \
.config("spark.debug.maxToStringFields", "500") \
.config("spark.sql.execution.arrow.enabled", True) \
.config("spark.sql.shuffle.partitions", 20) \
.config("spark.default.parallelism", 20) \
.config("spark.storage.memoryFraction", 0.5) \
.config("spark.shuffle.memoryFraction", 0.3) \
.config("spark.shuffle.consolidateFiles", False) \
.config("spark.shuffle.sort.bypassMergeThreshold", 200) \
.config("spark.shuffle.file.buffer", "32K") \
.config("spark.reducer.maxSizeInFlight", "48M") \
.getOrCreate()
sql_command = """CREATE OR REPLACE TEMPORARY VIEW redis_table (
testColumn1 STRING,
testColumn2 INT,
testColumn3 STRING,
testColumn4 STRING,
date DATE,)
USING org.apache.spark.sql.redis OPTIONS (table 'redis_table')
"""
spark.sql(sql_command)
for i in range(20):
# you can use your own sql command
sql_command = "SELECT testColumn1, SUM(testColumn2) AS testColumn2 FROM redis_table WHERE (date BETWEEN '2022-01-01' AND '2022-03-10') GROUP BY GROUPING SETS ((testColumn1))"
readDF = spark.sql(sql_command)
df_json = readDF.toJSON()
df_collect = df_json.collect()
res = [json.loads(i) for i in df_collect]
print(res)
Here's the result. Duration is 1s to 2s after few round.
enter image description here
This result is conflicted with my survey. Redis should be faster than Hive, but I get the opposite result.
I want to know the reason and try to make Redis can run faster than Hive through Spark if that's possible.
Thank you.

PySpark Structured Streaming Query - query in dashbord visibility

I wrote some example code which connect to kafka broker, read data from topic and sink it to snappydata table.
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SQLContext, Row, SparkSession
from pyspark.sql.snappy import SnappySession
from pyspark.rdd import RDD
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.functions import col, explode, split
import time
import sys
def main(snappy):
logger = logging.getLogger('py4j')
logger.info("My test info statement")
sns = snappy.newSession()
df = sns \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "10.0.0.4:9092") \
.option("subscribe", "test_import3") \
.option("failOnDataLoss", "false") \
.option("startingOffsets", "latest") \
.load()
bdf = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
streamingQuery = bdf\
.writeStream\
.format("snappysink") \
.queryName("Devices3") \
.trigger(processingTime="30 seconds") \
.option("tablename","devices2") \
.option("checkpointLocation","/tmp") \
.start()
streamingQuery.awaitTermination()
if __name__ == "__main__":
from pyspark.sql.snappy import SnappySession
from pyspark import SparkContext, SparkConf
sc = SparkSession.builder.master("local[*]").appName("test").config("snappydata.connection", "10.0.0.4:1527").getOrCreate()
snc = SnappySession(sc)
main(snc)
I`m submitting it with command
/opt/snappydata/bin/spark-submit --master spark://10.0.0.4:1527 /path_to/file.py --conf snappydata.connection=10.0.0.4:1527
Everything works, data is readed from Kafka Topic and writed in snappydata table.
I don't understand why i don't see this streaming query in the SnappyData dashboard UI - after submitting pyspark code in the console i saw new Spark Master UI its started.
How can i connect to SnappyData internal Spark Master from pySpark it is possible?
SnappyData supports Python jobs to be submitted only in Smart Connector mode, which means it'll always be launched via a separate Spark Cluster to talk to SnappyData cluster. Hence, you see that your Python job is seen on this Spark cluster's UI and not on SnappyData's dashboard.

Pyspark Structured streaming processing

I am trying to make a structured streaming application with spark the main idea is to read from a kafka source, process the input, write back to another topic. i have successfully made spark read and write from and to kafka however my problem is with the processing part. I have tried the foreach function to capture every row and process it before writing back to kafka however it always only does the foreach part and never writes back to kafka. If i however remove the foreach part from the writestream it would continue writing but now i lost my processing.
if anyone can give me an example on how to do this with an example i would be extremely grateful.
here is my code
spark = SparkSession \
.builder \
.appName("StructuredStreamingTrial") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "KafkaStreamingSource") \
.load()
ds = df \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")\
.writeStream \
.outputMode("update") \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "StreamSink") \
.option("checkpointLocation", "./testdir")\
.foreach(foreach_function)
.start().awaitTermination()
and the foreach_function simply is
def foreach_function(df):
try:
print(df)
except:
print('fail')
pass
Processing the data before writing into Kafka sink in Pyspark based Structured Streaming API,we can easily handle with UDF function for any kind of complex transformation .
example code is in below . This code is trying to read the JSON format message Kafka topic and parsing the message to convert the message from JSON into CSV format and rewrite into another topic. You can handle any processing transformation in place of 'json_formatted' function .
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.sql.column import Column, _to_java_column
from pyspark.sql.functions import col, struct
from pyspark.sql.functions import udf
import json
import csv
import time
import os
# Spark Streaming context :
spark = SparkSession.builder.appName('pda_inst_monitor_status_update').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Creating readstream DataFrame :
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "KafkaStreamingSource") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
df1.registerTempTable("test")
def json_formatted(s):
val_dict = json.loads(s)
return str([
val_dict["after"]["ID"]
, val_dict["after"]["INST_NAME"]
, val_dict["after"]["DB_UNIQUE_NAME"]
, val_dict["after"]["DBNAME"]
, val_dict["after"]["MON_START_TIME"]
, val_dict["after"]["MON_END_TIME"]
]).strip('[]').replace("'","").replace('"','')
spark.udf.register("JsonformatterWithPython", json_formatted)
squared_udf = udf(json_formatted)
df1 = spark.table("test")
df2 = df1.select(squared_udf("value"))
# Declaring the Readstream Schema DataFrame :
df2.coalesce(1).writeStream \
.writeStream \
.outputMode("update") \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "StreamSink") \
.option("checkpointLocation", "./testdir")\
.start()
ssc.awaitTermination()

Spark Streaming Job is running very slow

I am running a spark streaming job in my local and it is taking approximately 4 to 5 min for one batch. Can someone suggest what could be the issue with the bellow code?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, FloatType, TimestampType
from pyspark.sql.functions import avg, window, from_json, from_unixtime, unix_timestamp
import uuid
schema = StructType([
StructField("source", StringType(), True),
StructField("temperature", FloatType(), True),
StructField("time", StringType(), True)
])
spark = SparkSession \
.builder.master("local[8]") \
.appName("poc-app") \
.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 5)
df1 = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "poc") \
.load() \
.selectExpr("CAST(value AS STRING)")
df2 = df1.select(from_json("value", schema).alias(
"sensors")).select("sensors.*")
df3=df2.select(df2.source,df2.temperature,from_unixtime(unix_timestamp(df2.time, 'yyyy-MM-dd HH:mm:ss')).alias('time'))
df4 = df3.groupBy(window(df3.time, "2 minutes","1 minutes"), df3.source).count()
query1 = df4.writeStream \
.outputMode("complete") \
.format("console") \
.option("checkpointLocation", "/tmp/temporary-" + str(uuid.uuid4())) \
.start()
query1.awaitTermination()
with mini-batch streaming you usually want to reduce the # of output partitions ... since you are doing some aggregation (wide transformation) every time you persist it will default to 200 partitions to disk because of
spark.conf.get("spark.sql.shuffle.partitions")
try lowering this config to a smaller output partition and place it at the beginning of your code so when the aggregation is performed it outputs 5 partitions to disk
spark.conf.set("spark.sql.shuffle.partitions", 5)
you can also get a feel by looking at the # of files in the output write stream directory as well as identifying the # of partitions in your aggregated df
df3.rdd.getNumPartitions()
btw since you are using a local mode for testing try setting to local[8] instead of local[4] so it increases the parallelism on your cpu cores (i assume you have 4)

How to calculate lag difference in Spark Structured Streaming?

I am writing a Spark Structured Streaming program. I need to create an additional column with the lag difference.
To reproduce my issue, I provide the code snippet. This code consumes data.json file stored in data folder:
[
{"id": 77,"type": "person","timestamp": 1532609003},
{"id": 77,"type": "person","timestamp": 1532609005},
{"id": 78,"type": "crane","timestamp": 1532609005}
]
Code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as func
from pyspark.sql.window import Window
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("Test") \
.master("local[2]") \
.getOrCreate()
schema = StructType([
StructField("id", IntegerType()),
StructField("type", StringType()),
StructField("timestamp", LongType())
])
ds = spark \
.readStream \
.format("json") \
.schema(schema) \
.load("data/")
diff_window = Window.partitionBy("id").orderBy("timestamp")
ds = ds.withColumn("prev_timestamp", func.lag(ds.timestamp).over(diff_window))
query = ds \
.writeStream \
.format('console') \
.start()
query.awaitTermination()
I get this error:
pyspark.sql.utils.AnalysisException: u'Non-time-based windows are not
supported on streaming DataFrames/Datasets;;\nWindow
[lag(timestamp#71L, 1, null) windowspecdefinition(host_id#68,
timestamp#71L ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1
PRECEDING) AS prev_timestamp#129L]
pyspark.sql.utils.AnalysisException: u'Non-time-based windows are not supported on streaming DataFrames/Datasets
Meaning that your window should be based on a timestamp column. So it you have a data point for each second, and you make a 30s window with a stride of 10s, your resultant window would create a new window column, with start and end columns which will contain timestamps with a difference of 30s.
You should use the window in this way:
words = words.withColumn('date_time', F.col('date_time').cast('timestamp'))
w = F.window('date_time', '30 seconds', '10 seconds')
words = words \
.withWatermark('date_format', '1 minutes') \
.groupBy(w).agg(F.mean('value'))

Resources