upsert (merge) delta with spark structured streaming - apache-spark

I need to upsert data in real time (with spark structured streaming) in python
This data is read in realtime (format csv) and then is written as a delta table (here we want to update the data that's why we use merge into from delta)
I am using delta engine with databricks
I coded this:
from delta.tables import *
spark = SparkSession.builder \
.config("spark.sql.streaming.schemaInference", "true")\
.appName("SparkTest") \
.getOrCreate()
sourcedf= spark.readStream.format("csv") \
.option("header", True) \
.load("/mnt/user/raw/test_input") #csv data that we read in real time
spark.conf.set("spark.sql.shuffle.partitions", "1")
spark.createDataFrame([], sourcedf.schema) \
.write.format("delta") \
.mode("overwrite") \
.saveAsTable("deltaTable")
def upsertToDelta(microBatchOutputDF, batchId):
microBatchOutputDF.createOrReplaceTempView("updates")
microBatchOutputDF._jdf.sparkSession().sql("""
MERGE INTO deltaTable t
USING updates s
ON s.Id = t.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
sourcedf.writeStream \
.format("delta") \
.foreachBatch(upsertToDelta) \
.outputMode("update") \
.option("checkpointLocation", "/mnt/user/raw/checkpoints/output")\
.option("path", "/mnt/user/raw/PARQUET/output") \
.start() \
.awaitTermination()
but nothing gets written as expected in the output path , the checkpoint path gets filled in as expected , a display in the delta table gives me results too
display(table("deltaTable"))
in the spark UI I see the writestream step :
sourcedf.writeStream \ .format("delta") \ ....
first at Snapshot.scala:156+details
RDD: Delta Table State #1 - dbfs:/user/hive/warehouse/deltatable/_delta_log
any idea how to fix this so I can upsert csv data into delta tables in S3 in real time with spark
Best regards

Apologies for a late reply, but just in case anyone else has the same problem. I have found the below worked for me, I wonder is it because you didn't use "cloudFiles" on your readstream to make use of autoloader?:
%python
sourcedf= spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("cloudFiles.includeExistingFiles","true") \
.schema(csvSchema) \
.load("/mnt/user/raw/test_input")
%sql
CREATE TABLE IF NOT EXISTS deltaTable(
col1 int NOT NULL,
col2 string NOT NULL,
col3 bigint,
col4 int
)
USING DELTA
LOCATION '/mnt/user/raw/PARQUET/output'
%python
def upsertToDelta(microBatchOutputDF, batchId):
microBatchOutputDF.createOrReplaceTempView("updates")
microBatchOutputDF._jdf.sparkSession().sql("""
MERGE INTO deltaTable t
USING updates s
ON s.Id = t.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
%python
sourcedf.writeStream \
.format("delta") \
.foreachBatch(upsertToDelta) \
.outputMode("update") \
.option("checkpointLocation", "/mnt/user/raw/checkpoints/output") \
.start("/mnt/user/raw/PARQUET/output")

Related

How to filter data using spark.read in place?

I try read data in Delta format from ADLS. I want read some portion of that data using filter in place. Same approach worked for me during reading JDBC format
query = f"""
select * from {table_name}
where
createdate < to_date('{createdate}','YYYY-MM-DD HH24:MI:SS') or
modifieddate < to_date('{modifieddate}','YYYY-MM-DD HH24:MI:SS')
"""
return spark.read \
.format("jdbc") \
.option("url", url) \
.option("query", query) \
.option("user", username) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
So I tried to create in similar way reading delta using query but it reads whole table.
return spark.read \
.format("delta") \
.option("query", query) \
.load(path)
How could I solve this issue without reading full df and then filter it?
Thanks in advance!
Spark uses a functionality called predicate pushdown to optimize queries.
In the first case, the filters can be passed on to the oracle database.
Delta does not work that way. There can be optimisations through data skipping and Z-ordering, but since you are essentially querying parquet files, you have to read the all of them in memory and filter afterwards.

Spark Structured Streaming inconsistent output to multiple sinks

I am using spark structured streaming to read data from Kafka and apply some udf to the dataset. The code as below :
calludf = F.udf(lambda x: function_name(x))
dfraw = spark.readStream.format('kafka') \
.option('kafka.bootstrap.servers', KAFKA_CONSUMER_IP) \
.option('subscribe', topic_name) \
.load()
df = dfraw.withColumn("value", F.col('value').cast('string')).withColumn('value', calludf(F.col('value')))
ds = df.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format('console') \
.option('truncate', False) \
.start()
dsf = df.selectExpr("CAST (value AS STRING)") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_CONSUMER_IP) \
.option("topic", topic_name_two) \
.option("checkpointLocation", checkpoint_location) \
.start()
ds.awaitTermination()
dsf.awaitTermination()
Now the problem is that I am getting 10 dataframes as input. 2 of them failed due to some issue with the data which is understandable. The console displays rest of the 8 processed dataframes BUT only 6 of those 8 processed dataframes are written to the Kafka topic using dsf steaming query. Even though I have added checkpoint location to it but it is still not working.
PS: Do let me know if you have any suggestion regarding the code as well. I am new to spark structured streaming so maybe there is something wrong with the way I am doing it.

How to make Spark streams execute sequentially

Issue
I have a job that executes two streams in total but I want the last one to start after the first stream has finished since the first stream saves events from the readstream in a DeltaTable that serve as input for the second stream. The problem is that what is being added in the first stream is not available in the second stream, in the current notebook run, because they start simultaneously.
Is there a way to enforce the order while running it from the same notebook?
I've tried the awaitTermination function but discovered this does not solve my problem. Some pseudocode:
def main():
# Read eventhub
metricbeat_df = spark \
.readStream \
.format("eventhubs") \
.options(**eh_conf) \
.load()
# Save raw events
metricbeat_df.writeStream \
.trigger({"once": True}) \
.format("delta") \
.partitionBy("year", "month", "day") \
.outputMode("append") \
.option("checkpointLocation", "dbfs:/...") \
.queryName("query1") \
.table("my_db.raw_events")
# Parse events
metricbeat_df = spark.readStream \
.format("delta") \
.option("ignoreDeletes", True) \
.table("my_db.raw_events")
# *Do some transformations here*
metricbeat_df.writeStream \
.trigger({"once": True}) \
.format("delta") \
.partitionBy("year", "month", "day") \
.outputMode("append") \
.option("checkpointLocation", "dbfs:/...") \
.queryName("query2") \
.table("my_db.joined_bronze_events")
TLDR
To summarize the issue: when I run the code above, query1 and query2 start at the same time which causes that my_db.joined_bronze_events is a bit behind my_db.raw_events because what is being added in query1 is not available in query2 in the current run (it will be in the next run of course).
Is there a way to enforce that query2 will not start until query1 has finished while still running it in the same notebook?
As you are using the option Trigger.once, you can make use of the processAllAvailable method in your StreamingQuery:
def main():
# Read eventhub
# note that I have changed the variable name to metricbeat_df1
metricbeat_df1 = spark \
.readStream \
.format("eventhubs") \
.options(**eh_conf) \
.load()
# Save raw events
metricbeat_df1.writeStream \
.trigger({"once": True}) \
.format("delta") \
.partitionBy("year", "month", "day") \
.outputMode("append") \
.option("checkpointLocation", "dbfs:/...") \
.queryName("query1") \
.table("my_db.raw_events") \
.processAllAvailable()
# Parse events
# note that I have changed the variable name to metricbeat_df2
metricbeat_df2 = spark.readStream \
.format("delta") \
.option("ignoreDeletes", True) \
.table("my_db.raw_events")
# *Do some transformations here*
metricbeat_df2.writeStream \
.trigger({"once": True}) \
.format("delta") \
.partitionBy("year", "month", "day") \
.outputMode("append") \
.option("checkpointLocation", "dbfs:/...") \
.queryName("query2") \
.table("my_db.joined_bronze_events") \
.processAllAvailable()
Note, that I have changed the dataframe names as they should not be the same for both streaming queries.
The method processAllAvailable is described as:
"Blocks until all available data in the source has been processed and committed to the sink. This method is intended for testing. Note that in the case of continually arriving data, this method may block forever. Additionally, this method is only guaranteed to block until data that has been synchronously appended data to a org.apache.spark.sql.execution.streaming.Source prior to invocation. (i.e. getOffset must immediately reflect the addition)."

Databricks: Structured Stream fails with TimeoutException

I want to create a structured stream in databricks with a kafka source.
I followed the instructions as described here. My script seems to start, however it fails with the first element of the stream. The stream itsellf works fine and produces results and works (in databricks) when I use confluent_kafka, thus there seems to be a different issue I am missing:
After the initial stream is processed, the script times out:
java.util.concurrent.TimeoutException: Stream Execution thread for stream [id = 80afdeed-9266-4db4-85fa-66ccf261aee4,
runId = b564c626-9c74-42a8-8066-f1f16c7ab53d] failed to stop within 36000 milliseconds (specified by spark.sql.streaming.stopTimeout). See the cause on what was being executed in the streaming query thread.`
WHAT I TRIED: looking at SO and finding this answer, to which I included
spark.conf.set("spark.sql.streaming.stopTimeout", 36000)
into my setup - which changed nothing.
Any input is highly appreciated!
from pyspark.sql import functions as F
from pyspark.sql.types import *
# Define a data schema
schema = StructType() \
.add('PARAMETERS_TEXTVALUES_070_VALUES', StringType())\
.add('ID', StringType())\
.add('PARAMETERS_TEXTVALUES_001_VALUES', StringType())\
.add('TIMESTAMP', TimestampType())
df = spark \
.readStream \
.format("kafka") \
.option("host", "stream.xxx.com") \
.option("port", 12345)\
.option('kafka.bootstrap.servers', 'stream.xxx.com:12345') \
.option('subscribe', 'stream_test.json') \
.option("startingOffset", "earliest") \
.load()
df_word = df.select(F.col('key').cast('string'),
F.from_json(F.col('value').cast('string'), schema).alias("parsed_value"))
df_word \
.writeStream \
.format("parquet") \
.option("path", "dbfs:/mnt/streamfolder/stream/") \
.option("checkpointLocation", "dbfs:/mnt/streamfolder/check/") \
.outputMode("append") \
.start()
my stream output data looks like this:
"PARAMETERS_TEXTVALUES_070_VALUES":'something'
"ID":"47575963333908"
"PARAMETERS_TEXTVALUES_001_VALUES":12345
"TIMESTAMP": "2020-10-22T15:06:42.507+02:00"
Furthermore, stream and check folders are filled with 0-b files, except for metadata, which includes the ìd from the error above.
Thanks and stay safe.

Upsert data in postgresql using spark structured streaming

I am trying to run a structured streaming application using (py)spark. My data is read from a Kafka topic and then I am running windowed aggregation on event time.
# I have been able to create data frame pn_data_df after reading data from Kafka
Schema of pn_data_df
|
- id StringType
- source StringType
- source_id StringType
- delivered_time TimeStamp
windowed_report_df = pn_data_df.filter(pn_data_df.source == 'campaign') \
.withWatermark("delivered_time", "24 hours") \
.groupBy('source_id', window('delivered_time', '15 minute')) \
.count()
windowed_report_df = windowed_report_df \
.withColumn('start_ts', unix_timestamp(windowed_report_df.window.start)) \
.withColumn('end_ts', unix_timestamp(windowed_report_df.window.end)) \
.selectExpr('CAST(source_id as LONG)', 'start_ts', 'end_ts', 'count')
I am writing this windowed aggregation to my postgresql database which I have already created.
CREATE TABLE pn_delivery_report(
source_id bigint not null,
start_ts bigint not null,
end_ts bigint not null,
count integer not null,
unique(source_id, start_ts)
);
Writing to postgresql using spark jdbc allows me to either Append or Overwrite. Append mode fails if there is an existing composite key existing in the database, and Overwrite just overwrites entire table with current batch output.
def write_pn_report_to_postgres(df, epoch_id):
df.write \
.mode('append') \
.format('jdbc') \
.option("url", "jdbc:postgresql://db_endpoint/db") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "pn_delivery_report") \
.option("user", "postgres") \
.option("password", "PASSWORD") \
.save()
windowed_report_df.writeStream \
.foreachBatch(write_pn_report_to_postgres) \
.option("checkpointLocation", '/home/hadoop/campaign_report_df_windowed_checkpoint') \
.outputMode('update') \
.start()
How can I execute a query like
INSERT INTO pn_delivery_report (source_id, start_ts, end_ts, COUNT)
VALUES (1001, 125000000001, 125000050000, 128),
(1002, 125000000001, 125000050000, 127) ON conflict (source_id, start_ts) DO
UPDATE
SET COUNT = excluded.count;
in foreachBatch.
Spark has a jira feature ticket open for it, but it seems that it has not been prioritised till now.
https://issues.apache.org/jira/browse/SPARK-19335
that's worked for me:
def _write_streaming(self,
df,
epoch_id
) -> None:
df.write \
.mode('append') \
.format("jdbc") \
.option("url", f"jdbc:postgresql://localhost:5432/postgres") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", 'table_test') \
.option("user", 'user') \
.option("password", 'password') \
.save()
df_stream.writeStream \
.foreachBatch(_write_streaming) \
.start() \
.awaitTermination()
You need to add ".awaitTermination()" at the end.

Resources