I'm using Hadoop to work on a big data project.
I can use spark to send some SQL command to Hive.
Since this process is slow, I try to write my data into Redis which is an open-source database and use spark to query my data from this database to speed up this process.
I have deployed redis server in my virtual machine, and I can use spark session to read, write and run sql command on redis by using spark-redis module.
https://github.com/RedisLabs/spark-redis
Here's my testing script. I use spark session to get table from hive and write into redis.
from pyspark.sql import SparkSession
import time
import pandas as pd
spark = SparkSession.builder \
.appName("read_and_write") \
.config("spark.sql.warehouse.dir", "/user/hive/warehouse") \
.enableHiveSupport() \
.getOrCreate()
# read table from hive
sparkDF = spark.sql("SELECT * FROM hive_table")
sparkDF.show()
# write table into redis
sparkDF.write.format("org.apache.spark.sql.redis") \
.option("table", "redis_table") \
.mode("overwrite") \
.save()
After writing process finish, I write two script to compare speed between redis and hive.
This script is to test hive:
from pyspark.sql import SparkSession
import time, json
spark = SparkSession.builder \
.appName("hive_spark_test") \
.config("hive.metastore.uris", "thrift://localhost:9083") \
.config("spark.debug.maxToStringFields", "500") \
.config("spark.sql.execution.arrow.enabled", True) \
.config("spark.sql.shuffle.partitions", 20) \
.config("spark.default.parallelism", 20) \
.config("spark.storage.memoryFraction", 0.5) \
.config("spark.shuffle.memoryFraction", 0.3) \
.config("spark.shuffle.consolidateFiles", False) \
.config("spark.shuffle.sort.bypassMergeThreshold", 200) \
.config("spark.shuffle.file.buffer", "32K") \
.config("spark.reducer.maxSizeInFlight", "48M") \
.enableHiveSupport() \
.getOrCreate()
for i in range(20):
# you can use your own sql command
sql_command = "SELECT testColumn1, SUM(testColumn2) AS testColumn2 FROM hive_table WHERE (date BETWEEN '2022-01-01' AND '2022-03-10') GROUP BY GROUPING SETS ((testColumn1))"
readDF = spark.sql(sql_command)
df_json = readDF.toJSON()
df_collect = df_json.collect()
res = [json.loads(i) for i in df_collect]
print(res)
Here's the result. Duration is 0.2s to 0.5s after few round.
enter image description here
This script is to test redis:
from pyspark.sql import SparkSession
import time, json
spark = SparkSession.builder \
.appName("redis_spark_test") \
.config("spark.redis.host", "localhost") \
.config("spark.redis.port", "6379") \
.config("spark.redis.max.pipeline.size", 200) \
.config("spark.redis.scan.count", 200) \
.config("spark.debug.maxToStringFields", "500") \
.config("spark.sql.execution.arrow.enabled", True) \
.config("spark.sql.shuffle.partitions", 20) \
.config("spark.default.parallelism", 20) \
.config("spark.storage.memoryFraction", 0.5) \
.config("spark.shuffle.memoryFraction", 0.3) \
.config("spark.shuffle.consolidateFiles", False) \
.config("spark.shuffle.sort.bypassMergeThreshold", 200) \
.config("spark.shuffle.file.buffer", "32K") \
.config("spark.reducer.maxSizeInFlight", "48M") \
.getOrCreate()
sql_command = """CREATE OR REPLACE TEMPORARY VIEW redis_table (
testColumn1 STRING,
testColumn2 INT,
testColumn3 STRING,
testColumn4 STRING,
date DATE,)
USING org.apache.spark.sql.redis OPTIONS (table 'redis_table')
"""
spark.sql(sql_command)
for i in range(20):
# you can use your own sql command
sql_command = "SELECT testColumn1, SUM(testColumn2) AS testColumn2 FROM redis_table WHERE (date BETWEEN '2022-01-01' AND '2022-03-10') GROUP BY GROUPING SETS ((testColumn1))"
readDF = spark.sql(sql_command)
df_json = readDF.toJSON()
df_collect = df_json.collect()
res = [json.loads(i) for i in df_collect]
print(res)
Here's the result. Duration is 1s to 2s after few round.
enter image description here
This result is conflicted with my survey. Redis should be faster than Hive, but I get the opposite result.
I want to know the reason and try to make Redis can run faster than Hive through Spark if that's possible.
Thank you.
Related
when I use Windows local spark like below, it work and Can see "df.count()"
spark = SparkSession \
.builder \
.appName("Structured Streaming ") \
.master("local[*]") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
.option("subscribe", kafka_topic_name) \
.option("startingOffsets", "latest") \
.load()
flower_df1 = df.selectExpr("CAST(value AS STRING)", "timestamp")
flower_schema_string = "sepal_length DOUBLE,sepal_length DOUBLE,sepal_length DOUBLE,sepal_length DOUBLE,species STRING"
flower_df2 = flower_df1.select(from_csv(col("value"), flower_schema_string).alias("flower"), "timestamp").select("flower.*", "timestamp")
flower_df2.createOrReplaceTempView("flower_find")
song_find_text = spark.sql("SELECT * FROM flower_find")
flower_agg_write_stream = song_find_text \
.writeStream \
.option("truncate", "false") \
.format("memory") \
.outputMode("update") \
.queryName("testedTable") \
.start()
while True:
df = spark.sql("SELECT * FROM testedTable")
print(df.count())
time.sleep(1)
But when I use my Virtual Box's Ubuntu's Spark, NEVER SEE any data.
below is the modification I made when I using Ubuntu's Spark.
SparkSession's master URL: "spark://192.168.15.2:7077"
Insert code flower_agg_write_stream.awaitTermination() above "while True:"
Did I do something wrong?
ADD.
when run modification code, log appears as below:
...
org.apache.spark.sql.AnalysisException: Table or view not found: testedTable;
...
unfortunately, I already try createOrReplaceGlobalTempView(). but it doesn't work too.
I have a problem regrading the window in Spark Structed Streaming. I want to group the data i'm receiving continuously from kafka source in sliding window and count the number of data. The issue is that writestream streams the window dataframe each time there is data coming and update the count of the current window.
I'm using the following code to create the window:
#Define schema of the topic to be consumed
jsonSchema = StructType([ StructField("State", StringType(), True) \
, StructField("Value", StringType(), True) \
, StructField("SourceTimestamp", StringType(), True) \
, StructField("Tag", StringType(), True)
])
spark = SparkSession \
.builder \
.appName("StructuredStreaming") \
.config("spark.default.parallelism", "100") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "10.129.140.23:9092") \
.option("subscribe", "SIMULATOR.SUPERMAN.TOTO") \
.load() \
.select(from_json(col("value").cast("string"), jsonSchema).alias("data")) \
.select("data.*")
df = df.withColumn("time", current_timestamp())
Window = df \
.withColumn("window",window("time","4 seconds","1 seconds")).groupBy("window").count() \
.withColumn("time", current_timestamp())
#Write back to kafka
query = Window.select(to_json(struct("count","window","time")).alias("value")) \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "10.129.140.23:9092") \
.outputMode("update") \
.option("topic", "structed") \
.option("checkpointLocation", "/home/superman/notebook/checkpoint") \
.start()
The windows are not sorted and are updated each time there is a change in count. How can we wait for the end of the window and stream the final count one time. Instead of this output:
{"count":21,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":47,"window":{"start":"2019-05-13T09:39:12.000Z","end":"2019-05-13T09:39:16.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:13.000Z","end":"2019-05-13T09:39:17.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":19,"window":{"start":"2019-05-13T09:39:19.000Z","end":"2019-05-13T09:39:23.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":19,"window":{"start":"2019-05-13T09:39:18.000Z","end":"2019-05-13T09:39:22.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":37,"window":{"start":"2019-05-13T09:39:19.000Z","end":"2019-05-13T09:39:23.000Z"},"time":"2019-05-13T09:39:21.939Z"}
{"count":18,"window":{"start":"2019-05-13T09:39:21.000Z","end":"2019-05-13T09:39:25.000Z"},"time":"2019-05-13T09:39:21.939Z"}
I would like this:
{"count":47,"window":{"start":"2019-05-13T09:39:12.000Z","end":"2019-05-13T09:39:16.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:13.000Z","end":"2019-05-13T09:39:17.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:19.818Z"}
The expected ouput wait for the window to be closed based on comparaison between the end timestamp and the current time.
I have a spark shell which invokes pyscript and has created a global temp view
This is what I am doing in first spark shell script
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Spark SQL Parllel load example") \
.config("spark.jars","/u/user/graghav6/sqljdbc4.jar") \
.config("spark.dynamicAllocation.enabled","true") \
.config("spark.shuffle.service.enabled","true") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.sql.shuffle.partitions","50") \
.config("hive.metastore.uris", "thrift://xxxxx:9083") \
.config("spark.sql.join.preferSortMergeJoin","true") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.enableHiveSupport() \
.getOrCreate()
#after doing some transformation I am trying to create a global temp view of dataframe as:
df1.createGlobalTempView("df1_global_view")
spark.stop()
exit()
This is my second spark shell script:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Spark SQL Parllel load example") \
.config("spark.jars","/u/user/graghav6/sqljdbc4.jar") \
.config("spark.dynamicAllocation.enabled","true") \
.config("spark.shuffle.service.enabled","true") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.sql.shuffle.partitions","50") \
.config("hive.metastore.uris", "thrift://xxxx:9083") \
.config("spark.sql.join.preferSortMergeJoin","true") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.enableHiveSupport() \
.getOrCreate()
newSparkSession = spark.newSession()
#reading dta from the global temp view
data_df_save = newSparkSession.sql(""" select * from global_temp.df1_global_view""")
data_df_save.show()
newSparkSession.close()
exit()
I am getting below error:
Stdoutput pyspark.sql.utils.AnalysisException: u"Table or view not found: `global_temp`.`df1_global_view`; line 1 pos 15;\n'Project [*]\n+- 'UnresolvedRelation `global_temp`.`df1_global_view`\n"
Looks like I am missing something. How can I shared the same global temp view across multiple sessions?
Am I closing the spark session incorrectly in first spark shell?
I have found couple of answers already on stack-overflow but was not able to figure out the cause.
You're using createGlobalTempView so it's a temporary view and won't be available after you close the app.
In other words, it will be available in another SparkSession, but not in another PySpark application.
I am writing a Spark Structured Streaming program. I need to create an additional column with the lag difference.
To reproduce my issue, I provide the code snippet. This code consumes data.json file stored in data folder:
[
{"id": 77,"type": "person","timestamp": 1532609003},
{"id": 77,"type": "person","timestamp": 1532609005},
{"id": 78,"type": "crane","timestamp": 1532609005}
]
Code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as func
from pyspark.sql.window import Window
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("Test") \
.master("local[2]") \
.getOrCreate()
schema = StructType([
StructField("id", IntegerType()),
StructField("type", StringType()),
StructField("timestamp", LongType())
])
ds = spark \
.readStream \
.format("json") \
.schema(schema) \
.load("data/")
diff_window = Window.partitionBy("id").orderBy("timestamp")
ds = ds.withColumn("prev_timestamp", func.lag(ds.timestamp).over(diff_window))
query = ds \
.writeStream \
.format('console') \
.start()
query.awaitTermination()
I get this error:
pyspark.sql.utils.AnalysisException: u'Non-time-based windows are not
supported on streaming DataFrames/Datasets;;\nWindow
[lag(timestamp#71L, 1, null) windowspecdefinition(host_id#68,
timestamp#71L ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1
PRECEDING) AS prev_timestamp#129L]
pyspark.sql.utils.AnalysisException: u'Non-time-based windows are not supported on streaming DataFrames/Datasets
Meaning that your window should be based on a timestamp column. So it you have a data point for each second, and you make a 30s window with a stride of 10s, your resultant window would create a new window column, with start and end columns which will contain timestamps with a difference of 30s.
You should use the window in this way:
words = words.withColumn('date_time', F.col('date_time').cast('timestamp'))
w = F.window('date_time', '30 seconds', '10 seconds')
words = words \
.withWatermark('date_format', '1 minutes') \
.groupBy(w).agg(F.mean('value'))
I got huge (over 10x~100x) execution time difference between 2 jobs with only difference on partition strategy, wanting to know why :)
Observation:
repartition by partition number with equalized record runs 10~100x slower than 2.
repartition by column: phone_country_code
from spark history, only difference are 1. got minor larger(10~20%) shuffle read size.
My environment:
Spark 1.6.1 on EMR 4.7
Python 2.7
submit job using pyspark
Spark Job:
python udf to parse phone number for time zone info
read data from redshift via spark-redshift and write back
code sample:
from pyspark import SparkContext, SparkConf
from pyspark.sql.types import DateType, TimestampType, StringType
from pyspark.sql import SQLContext
from pyspark.sql.functions import col, udf
conf = SparkConf().setAppName("extract_local_time")
sc = SparkContext(conf=conf)
sql_context = SQLContext(sc)
sc.addPyFile("s3://xxx/xxx.zip")
def local_time(phone_number, datetime_org):
from util import phonenumber_util
local_time = phonenumber_util.convert_to_local_datetime_by_phone_number(
phone_number,
datetime_org)
return local_time.replace(tzinfo=None)
local_time_func = udf(local_time, TimestampType())
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://xxx") \
.option("query", "select * from xxx") \
.option("tempdir", "s3n://xxx") \
.load()
# df = df.repartition(12*10) # partition strategy 1
df = df.repartition('phone_country_code') # partition strategy 2
df2 = df.withColumn("datetime_local", local_time_func(col("phone_number"), col("datetime")))
df2.registerTempTable("xxx")
sql_context.sql("SELECT * FROM xxx") \
.write.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://xxx") \
.option("tempdir", "s3n://xxx") \
.option("dbtable", "xxx") \
.mode("overwrite") \
.save()
data sample:
phone_number, phone_country_code
55-82981399971, 55
1-7073492922, 1
90-5395889859, 90
My guess:
some optimization on jvm-py level on udf that depends on partitions's record distribution?
Thanks for any further suggestions :)