How to print/log outputs within foreachBatch function? - apache-spark

Using table streaming, I am trying to write stream using foreachBatch
df.writestream
.format("delta")
.foreachBatch(WriteStreamToDelta)
...
WriteStreamToDelta looks like
def WriteStreamToDelta(microDF, batch_id):
microDFWrangled = microDF."some_transformations"
print(microDFWrangled.count()) <-- How do I achieve the equivalence of this?
microDFWrangled.writeStream...
I would like to view the number of rows in
Notebook, below the writeStream cell
Driver Log
Create a list to append number of rows for each micro batch.

Related

How to process a large delta table with UDF?

I have a delta table with about 300 billion rows. Now I am performing some operations on a column using UDF and creating another column
My code is something like this
def my_udf(data):
return pass
udf_func = udf(my_udf, StringType())
data = spark.sql("""SELECT * FROM large_table """)
data = data.withColumn('new_column', udf_func(data.value))
The issue now is this take a long amount of time as Spark will process all 300 billion rows and then write the output. Is there a way where we can do some Mirco batching and write output of those regularly to the output delta table
The first rule usually is to avoid UDFs as much of possible - what kind of transformation do you need to perform that isn't available in the Spark itself?
Second rule - if you can't avoid using UDF, at least use Pandas UDFs that process data in batches, and don't have so big serialization/deserialization overhead - usual UDFs are handling data row by row, encoding & decoding data for each of them.
If your table was built over the time, and consists of many files, you can try to use Spark Structured Streaming with Trigger.AvailableNow (requires DBR 10.3 or 10.4), something like this:
maxNumFiles = 10 # max number of parquet files processed at once
df = spark.readStream \
.option("maxFilesPerTrigger", maxNumFiles) \
.table("large_table")
df = df.withColumn('new_column', udf_func(data.value))
df.writeStream \
.option("checkpointLocation", "/some/path") \
.trigger(availableNow=True) \
.toTable("my_destination_table")
this will read the source table chunk by chunk, apply your transformation, and write data into a destination table.

How to get new/updated records from Delta table after upsert using merge?

Is there any way to get updated/inserted rows after upsert using merge to Delta table in spark streaming job?
val df = spark.readStream(...)
val deltaTable = DeltaTable.forName("...")
def upsertToDelta(events: DataFrame, batchId: Long) {
deltaTable.as("table")
.merge(
events.as("event"),
"event.entityId == table.entityId")
.whenMatched()
.updateExpr(...))
.whenNotMatched()
.insertAll()
.execute()
}
df
.writeStream
.format("delta")
.foreachBatch(upsertToDelta _)
.outputMode("update")
.start()
I know I can create another job to read updates from the delta table. But is it possible to do the same job? From what I can see, execute() returns Unit.
You can enable Change Data Feed on the table, and then have another stream or batch job to fetch the changes, so you'll able to receive information on what rows has changed/deleted/inserted. It could be enabled with:
ALTER TABLE table_name SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
if thable isn't registered, you can use path instead of table name:
ALTER TABLE delta.`path` SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
The changes will be available if you add the .option("readChangeFeed", "true") option when reading stream from a table:
spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.table("table_name")
and it will add three columns to table describing the change - the most important is _change_type (please note that there are two different types for update operation).
If you're worried about having another stream - it's not a problem, as you can run multiple streams inside the same job - you just don't need to use .awaitTermination, but something like spark.streams.awaitAnyTermination() to wait on multiple streams.
P.S. But maybe this answer will change if you explain why you need to get changes inside the same job?

How to convert Row to Dictionary in foreach() in pyspark?

I have a dataframe generated from Spark which I want to use for writeStream and also want to save in a database.
I have the following code:
output = (
spark_event_df
.writeStream
.outputMode('update')
.foreach(writerClass(**job_config_data))
.trigger(processingTime="2 seconds")
.start()
)
output.awaitTermination()
As I am using foreach(), writerClass gets a Row and I can not convert it into a dictionary in python.
How can I get a python datatype(preferably dictionary) from the Row in my writerClass so that I can manipulate that according to my needs and save into database?
If you're just looking to save to a database as part of your stream, you could do that using foreachBatch and the built-in JDBC writer. Just do your transformations to shape your data according to the desired output schema, then:
def writeBatch(input, batch_id):
(input
.write
.format("jdbc")
.option("url", url)
.option("dbtable", tbl)
.mode("append")
.save())
output = (spark_event_df
.writeStream
.foreachBatch(writeBatch)
.start())
output.awaitTermination()
If you absolutely need custom logic for writing to your database, that is not supported by the built-in JDBC writer, then you should use the DataFrame foreachPartition method to write your rows in bulk rather than one at a time. If you're using this method, then you can convert the Row objects into a dict by just calling asDict

rdd of DataFrame could not change partition number in Spark Structured Streaming python

I have following code of pyspark in Spark Structured Streaming to get dataFrame from Redis
def process(stream_batch, batch_id):
stream_batch.persist()
length = stream_batch.count()
record_rdd = stream_batch.rdd.map(lambda x: b_to_ndarray(x['data']))
# b_to_ndarray is a single thread method to convert bytes in Redis to ndarray
record_rdd = record_rdd.coalesce(4) # does not work
print(record_rdd.getNumPartitions()) # output 1
# some other code
Why? How to fix it? The code in main is
loadedDf = spark.readStream.format('redis')...
query = loadedDf.writeStream \
.foreachBatch(process).start()
query.awaitTermination()
Since the partitionNum is 1 at the first place, coalesce operation does not allow to generate fewer partitions. So no matter how you call it, it would be 1 partition. Unless you use repartition

structured streaming writing to multiple streams

my Scenario
Gets data from a stream and call a UDF which return a json string. one of the attribute in JSON string is UniqueId, which UDF is generating as guid.newGuid() (C#).
DataFrame output of UDF is written to multiple streams/sinks based on some fiter.
issue:
each sink is getting a new value for the UniqueId which was generated by UDF. How can i maintain the same UniqueId for all sinks.
If each sink is getting different values for UniqueId, does that mean my UDF is getting called multiple times for each sink?
If UDF is getting invoked twice, what is the option to get it called once and then just write same data to different sinks
inData = spark.readstream().format("eventhub")
udfdata = indata.select(from_json(myudf("column"), schema)).as("result").select(result.*)
filter1 = udfdata.filter("column =='filter1'")
filter 2 = udfdata.filter("column =='filter2'")
# write filter1 to two differnt sinks
filter1.writestream().format(delta).start(table1)
filter1.writestream().format(eventhub).start()
# write filter2 to two differnt sinks
filter2.writestream().format(delta).start(table2)
filter2.writestream().format(eventhub).start()
Each time you call .writestream()....start() you are creating a new independent streaming query.
This means that for each output sink you define Spark will read again from the input source and process the dataframe.
If you want to read and process only one time and then output to multiple sink you can use foreachBatch sink as a workaround:
inData = spark.readstream().format("eventhub")
udfdata = indata.select(from_json(myudf("column"), schema)).as("result").select(result.*)
udfdata.writeStream().foreachBatch(filter_and_output).start()
def filter_and_output(udfdata, batchId):
# At this point udfdata is a batch dataframe, no more a streaming dataframe
udfdata.cache()
filter1 = udfdata.filter("column =='filter1'")
filter2 = udfdata.filter("column =='filter2'")
# write filter1
filter1.write().format(delta).save(table1)
filter1.write().format(eventhub).save()
# write filter2
filter2.write().format(delta).save(table2)
filter2.write().format(eventhub).save()
udfdata.unpersist()
You can learn more about foreachBatch in the Spark Structured Streaming documentation.
To answer your questions
If you use foreachBatch your data will be processed only once and you will have the same UniqueId for all sinks
Yes
Using foreachBatch will solve the issue

Resources