How to process a large delta table with UDF? - apache-spark

I have a delta table with about 300 billion rows. Now I am performing some operations on a column using UDF and creating another column
My code is something like this
def my_udf(data):
return pass
udf_func = udf(my_udf, StringType())
data = spark.sql("""SELECT * FROM large_table """)
data = data.withColumn('new_column', udf_func(data.value))
The issue now is this take a long amount of time as Spark will process all 300 billion rows and then write the output. Is there a way where we can do some Mirco batching and write output of those regularly to the output delta table

The first rule usually is to avoid UDFs as much of possible - what kind of transformation do you need to perform that isn't available in the Spark itself?
Second rule - if you can't avoid using UDF, at least use Pandas UDFs that process data in batches, and don't have so big serialization/deserialization overhead - usual UDFs are handling data row by row, encoding & decoding data for each of them.
If your table was built over the time, and consists of many files, you can try to use Spark Structured Streaming with Trigger.AvailableNow (requires DBR 10.3 or 10.4), something like this:
maxNumFiles = 10 # max number of parquet files processed at once
df = spark.readStream \
.option("maxFilesPerTrigger", maxNumFiles) \
.table("large_table")
df = df.withColumn('new_column', udf_func(data.value))
df.writeStream \
.option("checkpointLocation", "/some/path") \
.trigger(availableNow=True) \
.toTable("my_destination_table")
this will read the source table chunk by chunk, apply your transformation, and write data into a destination table.

Related

How to print/log outputs within foreachBatch function?

Using table streaming, I am trying to write stream using foreachBatch
df.writestream
.format("delta")
.foreachBatch(WriteStreamToDelta)
...
WriteStreamToDelta looks like
def WriteStreamToDelta(microDF, batch_id):
microDFWrangled = microDF."some_transformations"
print(microDFWrangled.count()) <-- How do I achieve the equivalence of this?
microDFWrangled.writeStream...
I would like to view the number of rows in
Notebook, below the writeStream cell
Driver Log
Create a list to append number of rows for each micro batch.

How to get new/updated records from Delta table after upsert using merge?

Is there any way to get updated/inserted rows after upsert using merge to Delta table in spark streaming job?
val df = spark.readStream(...)
val deltaTable = DeltaTable.forName("...")
def upsertToDelta(events: DataFrame, batchId: Long) {
deltaTable.as("table")
.merge(
events.as("event"),
"event.entityId == table.entityId")
.whenMatched()
.updateExpr(...))
.whenNotMatched()
.insertAll()
.execute()
}
df
.writeStream
.format("delta")
.foreachBatch(upsertToDelta _)
.outputMode("update")
.start()
I know I can create another job to read updates from the delta table. But is it possible to do the same job? From what I can see, execute() returns Unit.
You can enable Change Data Feed on the table, and then have another stream or batch job to fetch the changes, so you'll able to receive information on what rows has changed/deleted/inserted. It could be enabled with:
ALTER TABLE table_name SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
if thable isn't registered, you can use path instead of table name:
ALTER TABLE delta.`path` SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
The changes will be available if you add the .option("readChangeFeed", "true") option when reading stream from a table:
spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.table("table_name")
and it will add three columns to table describing the change - the most important is _change_type (please note that there are two different types for update operation).
If you're worried about having another stream - it's not a problem, as you can run multiple streams inside the same job - you just don't need to use .awaitTermination, but something like spark.streams.awaitAnyTermination() to wait on multiple streams.
P.S. But maybe this answer will change if you explain why you need to get changes inside the same job?

Databricks spark best practice for writing a structured stream to a lot of sinks?

I'm using databricks spark 3.x, and I am reading a very large number of streams (100+), and each stream has its own contract, and needs to be written out to its own delta/parquet/sql/whatever table. While this is a lot of streams, the activity per stream is low - some streams might see only hundreds of records a day. I do want to stream because I am aiming for a fairly low-latency approach.
Here's what I'm talking about (code abbreviated for simplicity; I'm using checkpoints, output modes, etc. correctly).
Assume a schemas variable contains the schema for each topic. I've tried this approach, where I create a ton of individual streams, but it takes a lot of compute and most of it is wasted:
def batchprocessor(topic, schema):
def F(df, batchId):
sql = f'''
MERGE INTO SOME TABLE
USING SOME MERGE TABLE ON SOME CONDITION
WHEN MATCHED
UPDATE SET *
WHEN NOT MATCHED
INSERT *
'''
df.createOrReplaceTempView(f"SOME MERGE TABLE")
df._jdf.sparkSession().sql(sql)
return F
for topic in topics:
query = (spark
.readStream
.format("delta")
.load(f"/my-stream-one-table-per-topic/{topic}")
.withColumn('json', from_json(col('value'),schemas[topic]))
.select(col('json.*'))
.writeStream
.format("delta")
.foreachBatch(batchProcessor(topic, schema))
.start())
I also tried to create just one stream that did a ton of filtering, but performance was pretty abysmal even in a test environment where I pushed a single message to a single topic:
def batchprocessor(df, batchId):
df.cache()
for topic in topics:
filteredDf = (df.filter(f"topic == '{topic}'")
.withColumn('json', from_json(col('value'),schemas[topic]))
.select(col('json.*')))
sql = f'''
MERGE INTO SOME TABLE
USING SOME MERGE TABLE ON SOME CONDITION
WHEN MATCHED
UPDATE SET *
WHEN NOT MATCHED
INSERT *
'''
filteredDf.createOrReplaceTempView(f"SOME MERGE TABLE")
filteredDf._jdf.sparkSession().sql(sql)
df.unpersist()
query = (spark
.readStream
.format("delta")
.load(f"/my-stream-all-topics-in-one-but-partitioned")
.writeStream
.format("delta")
.foreachBatch(batchProcessor)
.start())
Is there any good way to essentially demultiplex a stream like this? It's already partitioned, so I assume the query planner isn't doing too much redundant work, but it seems like there's a huge amount of overhead nonetheless.
I ran a bunch of benchmarks, and option 2 is more efficient. I don't entirely know why yet.
Ultimately, performance still wasn't what I wanted - each topic runs in order, no matter the size, so a single record on each topic would lead the FIFO scheduler to queue up a lot of very inefficient small operations. I solved that using parallelisation:
import threading
def writeTable(table, df, poolId, sc):
sc.setLocalProperty("spark.scheduler.pool", poolId)
df.write.mode('append').format('delta').saveAsTable(table)
sc.setLocalProperty("spark.scheduler.pool", None)
def processBatch(df, batchId):
df.cache()
dfsToWrite = {}
for row in df.select('table').distinct().collect():
table = row.table
filteredDf = df.filter(f"table = '{table}'")
dfsToWrite[table] = filteredDf
threads = []
for table,df in dfsToWrite.items():
threads.append(threading.Thread(target=writeTable,args=(table, df,table,spark.sparkContext)))
for t in threads:
t.start()
for t in threads:
t.join()
df.unpersist()

rdd of DataFrame could not change partition number in Spark Structured Streaming python

I have following code of pyspark in Spark Structured Streaming to get dataFrame from Redis
def process(stream_batch, batch_id):
stream_batch.persist()
length = stream_batch.count()
record_rdd = stream_batch.rdd.map(lambda x: b_to_ndarray(x['data']))
# b_to_ndarray is a single thread method to convert bytes in Redis to ndarray
record_rdd = record_rdd.coalesce(4) # does not work
print(record_rdd.getNumPartitions()) # output 1
# some other code
Why? How to fix it? The code in main is
loadedDf = spark.readStream.format('redis')...
query = loadedDf.writeStream \
.foreachBatch(process).start()
query.awaitTermination()
Since the partitionNum is 1 at the first place, coalesce operation does not allow to generate fewer partitions. So no matter how you call it, it would be 1 partition. Unless you use repartition

pyspark df.count() taking a very long time (or not working at all)

I have the following code that is simply doing some joins and then outputting the data;
from pyspark.sql.functions import udf, struct
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkConf
from pyspark.sql.functions import broadcast
conf = SparkConf()
conf.set('spark.logConf', 'true')
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("Generate Parameters") \
.getOrCreate()
spark.sparkContext.setLogLevel("OFF")
df1 = spark.read.parquet("/location/mydata")
df1 = df1.select([c for c in df1.columns if c in ['sender','receiver','ccc,'cc','pr']])
df2 = spark.read.csv("/location/mydata2")
cond1 = [(df1.sender == df2._c1) | (df1.receiver == df2._c1)]
df3 = df1.join(broadcast(df2), cond1)
df3 = df3.select([c for c in df3.columns if c in['sender','receiver','ccc','cc','pr']])
df1 is 1,862,412,799 rows and df2 is 8679 rows
when I then call;
df3.count()
It just seems to sit there with the following
[Stage 33:> (0 + 200) / 200]
Assumptions for this answer:
df1 is the dataframe containing 1,862,412,799 rows.
df2 is the dataframe containing 8679 rows.
df1.count() returns a value quickly (as per your comment)
There may be three areas where the slowdown is occurring:
The imbalance of data sizes (1,862,412,799 vs 8679):
Although spark is amazing at handling large quantities of data, it doesn't deal well with very small sets. If not specifically set, Spark attempts to partition your data into multiple parts and on small files this can be excessively high in comparison to the actual amount of data each part has. I recommend trying to use the following and see if it improves speed.
df2 = spark.read.csv("/location/mydata2")
df2 = df2.repartition(2)
Note: The number 2 here is just an estimated number, based on how many partitions would suit the amount of rows that are in that set.
Broadcast Cost:
The delay in the count may be due to the actual broadcast step. Your data is being saved and copied to every node within your cluster before the join, this all happening together once count() is called. Depending on your infrastructure, this could take some time. If the above repartition doesn't work, try removing the broadcast call. If that ends up being the delay, it may be good to confirm that there are no bottlenecks within your cluster or if it's necessary.
Unexpected Merge Explosion
I do not imply that this is an issue, but it is always good to check that the merge condition you have set is not creating unexpected duplicates. It is a possibility that this may be happening and creating the slow down you are experiencing when actioning the processing of df3.

Resources