Delta lake stream-stream left outer join using watermarks - apache-spark

My aim here is to:
Read two delta lake tables as streams.
Add a timestamp column to the data frames.
Create a watermark using the timestamp columns.
Use the watermarks for an Outer left join.
Write the join as a stream to a delta lake table.
Here is my code:
from delta import *
from pyspark.sql.functions import *
import pyspark
my_working_dir = "/working"
spark = (
pyspark.sql.SparkSession.builder.appName("test_delta_stream").master("yarn").getOrCreate()
)
df_stream_1 = spark.readStream.format("delta").load(f"{my_working_dir}/table_1").alias("table_1")
df_stream_2 = spark.readStream.format("delta").load(f"{my_working_dir}/table_2").alias("table_2")
df_stream_1 = df_stream_1.withColumn("inserted_stream_1", current_timestamp())
df_stream_2 = df_stream_2.withColumn("inserted_stream_2", current_timestamp())
stream_1_watermark = df_stream_1.selectExpr(
"ID AS STREAM_1_ID", "inserted_stream_1"
).withWatermark("inserted_stream_1", "10 seconds")
stream_2_watermark = df_stream_2.selectExpr(
"ID AS STREAM_2_ID", "inserted_stream_2"
).withWatermark("inserted_stream_2", "10 seconds")
left_join_df = stream_1_watermark.join(
stream_2_watermark,
expr(
"""
STREAM_1_ID = STREAM_2_ID AND
inserted_stream_1 <= inserted_stream_2 + interval 1 hour
"""
),
"leftOuter",
).select("*")
left_join_df.writeStream.format("delta").outputMode("append").option(
"checkpointLocation", f"{my_working_dir}/output_delta_table/mycheckpoint"
).start(f"{my_working_dir}/output_delta_table")
Unfortunately, I get the error:
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/some-dir/python/pyspark/sql/streaming.py", line 1493, in start
return self._sq(self._jwrite.start(path))
File "/some-dir/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/some-dir/python/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: Stream-stream LeftOuter join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition;
It seems that even when I create watermarks, I still cannot do a left outer join. Any ideas why I am getting that error even though I clearly have defined a watermark on the joining columns?

Related

Writing a dictionary of Spark data frames to S3 bucket

Suppose we have a dictionary of PySpark dataframes. Is there a way to write this dictionary to an S3 bucket? The purpose of this is to read these PySpark data frames and then convert them into pandas data frames. Below is some code and the errors I get:
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
#spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)
df1 = rdd.toDF()
df1.printSchema()
columns = ["language","users_count"]
data = [("C", "2000"), ("Java", "10000"), ("Lisp", "300")]
#spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)
df2 = rdd.toDF()
df2.printSchema()
spark_dict = {df1: '1', df2: '2'}
import boto3
import pickle
s3_resource = boto3.resource('s3')
bucket='test'
key='pickle_list.pkl'
pickle_byte_obj = pickle.dumps(spark_dict)
try:
s3_resource.Object(bucket,key).put(Body=pickle_byte_obj)
except:
print("Error in writing to S3 bucket")
with this error:
An error was encountered:
can't pickle _thread.RLock objects
Traceback (most recent call last):
TypeError: can't pickle _thread.RLock objects
Also tried dumping the dictionary of PySpark data frames to a json file:
import json
flatten_dfs_json = json.dumps(spark_dict)
and got this error:
An error was encountered:
Object of type DataFrame is not JSON serializable
Traceback (most recent call last):
File "/usr/lib64/python3.7/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib64/python3.7/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib64/python3.7/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/usr/lib64/python3.7/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type DataFrame is not JSON serializable
Suppose we have a dictionary of PySpark dataframes. Is there a way to write this dictionary to an S3 bucket?
Yes (you might need to configure access key and secret key)
df.write.format('json').save('s3a://bucket-name/path')
The purpose of this is to read these PySpark data frames and then convert them into pandas data frames.
My 2 cents: This sounds wrong to me, you don't have to convert data to Pandas, that defeats the purpose of using Spark at the first place.

How to take a subset of parquet files to create a deltatable using deltalake_rs python library

I am using deltalake 0.4.5 Python library to read .parquet files into a deltatable and then convert into a pandas dataframe, following the instructions here: https://pypi.org/project/deltalake/.
Here's the Python code to do this:
from deltalake import DeltaTable
table_path = "s3://example_bucket/data/poc"
dt = DeltaTable(table_path)
files = dt.files() # OK, returns the list of parquet files with full s3 path
# ['s3://example_bucket/data/poc/y=2021/m=4/d=13/h=16/part-00001-8765abc67.parquet',
# 's3://example_bucket/data/poc/y=2021/m=4/d=13/h=16/part-00002-7643adc87.parquet',
# ........]
total_file_count = len(files0) # OK, returns 115530
pt = dt.to_pyarrow_table() # hangs
df = dt.to_pyarrow_table().to_pandas() # hangs
I believe it hangs because of the number of files is high 115K+.
So for my PoC, I wanted to read files only for a day or hour. So, I tried to set the table_path variable up to the hour, but it gives Not a Delta table error as, showing below:
table_path = "s3://example_bucket/data/poc"
dt = DeltaTable(table_path)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib64/python3.7/site-packages/deltalake/table.py", line 19, in __init__
self._table = RawDeltaTable(table_path, version=version)
deltalake.PyDeltaTableError: Not a Delta table
How can I achieve this?
If deltalake Python library can't be used to achieve this, what other tools/libraries are there I should try?

load jalali date from string in pyspark

I need to load jalali date from string and then, return it as gregorian date string. I'm using the following code:
def jalali_to_gregorian(col, format=None):
if not format:
format = "%Y/%m/d"
gre = jdatetime.datetime.strptime(col, format=format).togregorian()
return gre.strftime(format=format)
# register the function
spark.udf.register("jalali_to_gregorian", jalali_to_gregorian, StringType())
# load the date and show it:)
df = df.withColumn("financial_date", jalali_to_gregorian(df.PersianCreateDate))
df.select(['PersianCreateDate', 'financial_date']).show()
it throws ValueError: time data 'Column<PersianCreateDate>' does not match format '%Y/%m/%d' error at me.
the string from the column is a match and I have tested it. this is a problem from how spark is sending the column value to my function. anyway to solve it?
to test:
df=spark.createDataFrame([('1399/01/02',),('1399/01/01',)],['jalali'])
df = df.withColumn("gre", jalali_to_gregorian(df.jalali))
df.show()
should result in
+----------+----------+
| jalali| gre|
+----------+----------+
|1399/01/02|2020/03/20|
|1399/01/01|2020/03/21|
+----------+----------+
instead, I'm thrown at with:
Fail to execute line 2: df = df.withColumn("financial_date", jalali_to_gregorian(df.jalali))
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6468469233020961307.py", line 375, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 2, in <module>
File "<stdin>", line 7, in jalali_to_gregorian
File "/usr/local/lib/python2.7/dist-packages/jdatetime/__init__.py", line 929, in strptime
(date_string, format))
ValueError: time data 'Column<jalali>' does not match format '%Y/%m/%d''%Y/%m/%d'
Your problem is that you're trying to apply function to the column, not to the values inside the column.
The code that you have used: spark.udf.register("jalali_to_gregorian", jalali_to_gregorian, StringType()) registers your function for use in the Spark SQL (via spark.sql(...), not in the pyspark.
To get function that you can use inside withColumn, select, etc., you need to create a wrapper function that is done with udf function and this function should be used in the withColumn:
from pyspark.sql.functions import udf
jalali_to_gregorian_udf = udf(jalali_to_gregorian, StringType())
df = df.withColumn("gre", jalali_to_gregorian_udf(df.jalali))
>>> df.show()
+----------+----------+
| jalali| gre|
+----------+----------+
|1399/01/02|2020/03/21|
|1399/01/01|2020/03/20|
+----------+----------+
See documentation for more details.
You also have the error in the time format - instead of format = "%Y/%m/d" it should be format = "%Y/%m/%d".
P.S. If you're running on Spark 3.x, then I recommend to look to the vectorized UDFs (aka, Pandas UDFs) - they are much faster than usual UDFs, and will provide better performance if you have a lot of data.

Delta Lake Spark compaction after merge operation gives 'DeltaTable' object has no attribute '_get_object_id' error

I am doing delta-lake merge operation using python api and pyspark . After doing the merge operation I call the compaction operation but the compaction gives the following error:
Error:
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 170, in load
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1248, in __call__
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1212, in _build_args
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1199, in _get_args
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_collections.py", line 501, in convert
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1248, in __call__
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1218, in _build_args
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1218, in <listcomp>
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 298, in get_command_part
AttributeError: 'DeltaTable' object has no attribute '_get_object_id'
Code
delta_table = "delta_lake_path"
df = spark.read.csv("s3n://input_file.csv",header=True)
delta_table = DeltaTable.forPath(spark, delta_table)
delta_table.merge(df, "df.id = delta_table.id" ).whenNotMatchedInsertAll().execute()
#compaction
spark.read.format("delta").load(delta_table).repartition(1).write.option("dataChange",
"False").format("delta").mode("overwrite").save(delta_table)
Can anyone suggest me why the spark session is not able to create another delta table instance .
I need to perform both merge and compaction in the same script since I want to run the compaction only on the partitions in which the merge is performed . The partitions are derived from the unique values present in the dataframe df created from input_file.csv
I think your problem lies with delta_table variable - at first it is a string containing delta lake path, but then you are creating a delta table object trying to pass it into .load() method. Separating those variables could help:
delta_table_path = "delta_lake_path"
df = spark.read.csv("s3n://input_file.csv",header=True)
delta_table = DeltaTable.forPath(spark, delta_table_path)
delta_table.merge(df, "df.id = delta_table.id" ).whenNotMatchedInsertAll().execute()
#compaction
spark.read.format("delta").load(delta_table_path).repartition(1).write.option("dataChange",
"False").format("delta").mode("overwrite").save(delta_table_path )

get columns post group by in pyspark with dataframes

I see couple of posts post1 and post2 which are relevant to my question. However while following post1 solution I am running into below error.
joinedDF = df.join(df_agg, "company")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/spark/python/pyspark/sql/dataframe.py", line 1050, in join
jdf = self._jdf.join(other._jdf, on, how)
AttributeError: 'NoneType' object has no attribute '_jdf'
Entire code snippet
df = spark.read.format("csv").option("header", "true").load("/home/ec2-user/techcrunch/TechCrunchcontinentalUSA.csv")
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()
joinedDF = df.join(df_agg, "company")
on the second line you have .show at the end
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()
remove it like this:
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False)
and your code should work.
You used an action on that df and assigned it to df_agg variable, thats why your variable is NoneType(in python) or Unit(in scala)

Resources