Databricks spark best practice for writing a structured stream to a lot of sinks? - apache-spark

I'm using databricks spark 3.x, and I am reading a very large number of streams (100+), and each stream has its own contract, and needs to be written out to its own delta/parquet/sql/whatever table. While this is a lot of streams, the activity per stream is low - some streams might see only hundreds of records a day. I do want to stream because I am aiming for a fairly low-latency approach.
Here's what I'm talking about (code abbreviated for simplicity; I'm using checkpoints, output modes, etc. correctly).
Assume a schemas variable contains the schema for each topic. I've tried this approach, where I create a ton of individual streams, but it takes a lot of compute and most of it is wasted:
def batchprocessor(topic, schema):
def F(df, batchId):
sql = f'''
MERGE INTO SOME TABLE
USING SOME MERGE TABLE ON SOME CONDITION
WHEN MATCHED
UPDATE SET *
WHEN NOT MATCHED
INSERT *
'''
df.createOrReplaceTempView(f"SOME MERGE TABLE")
df._jdf.sparkSession().sql(sql)
return F
for topic in topics:
query = (spark
.readStream
.format("delta")
.load(f"/my-stream-one-table-per-topic/{topic}")
.withColumn('json', from_json(col('value'),schemas[topic]))
.select(col('json.*'))
.writeStream
.format("delta")
.foreachBatch(batchProcessor(topic, schema))
.start())
I also tried to create just one stream that did a ton of filtering, but performance was pretty abysmal even in a test environment where I pushed a single message to a single topic:
def batchprocessor(df, batchId):
df.cache()
for topic in topics:
filteredDf = (df.filter(f"topic == '{topic}'")
.withColumn('json', from_json(col('value'),schemas[topic]))
.select(col('json.*')))
sql = f'''
MERGE INTO SOME TABLE
USING SOME MERGE TABLE ON SOME CONDITION
WHEN MATCHED
UPDATE SET *
WHEN NOT MATCHED
INSERT *
'''
filteredDf.createOrReplaceTempView(f"SOME MERGE TABLE")
filteredDf._jdf.sparkSession().sql(sql)
df.unpersist()
query = (spark
.readStream
.format("delta")
.load(f"/my-stream-all-topics-in-one-but-partitioned")
.writeStream
.format("delta")
.foreachBatch(batchProcessor)
.start())
Is there any good way to essentially demultiplex a stream like this? It's already partitioned, so I assume the query planner isn't doing too much redundant work, but it seems like there's a huge amount of overhead nonetheless.

I ran a bunch of benchmarks, and option 2 is more efficient. I don't entirely know why yet.
Ultimately, performance still wasn't what I wanted - each topic runs in order, no matter the size, so a single record on each topic would lead the FIFO scheduler to queue up a lot of very inefficient small operations. I solved that using parallelisation:
import threading
def writeTable(table, df, poolId, sc):
sc.setLocalProperty("spark.scheduler.pool", poolId)
df.write.mode('append').format('delta').saveAsTable(table)
sc.setLocalProperty("spark.scheduler.pool", None)
def processBatch(df, batchId):
df.cache()
dfsToWrite = {}
for row in df.select('table').distinct().collect():
table = row.table
filteredDf = df.filter(f"table = '{table}'")
dfsToWrite[table] = filteredDf
threads = []
for table,df in dfsToWrite.items():
threads.append(threading.Thread(target=writeTable,args=(table, df,table,spark.sparkContext)))
for t in threads:
t.start()
for t in threads:
t.join()
df.unpersist()

Related

How to process a large delta table with UDF?

I have a delta table with about 300 billion rows. Now I am performing some operations on a column using UDF and creating another column
My code is something like this
def my_udf(data):
return pass
udf_func = udf(my_udf, StringType())
data = spark.sql("""SELECT * FROM large_table """)
data = data.withColumn('new_column', udf_func(data.value))
The issue now is this take a long amount of time as Spark will process all 300 billion rows and then write the output. Is there a way where we can do some Mirco batching and write output of those regularly to the output delta table
The first rule usually is to avoid UDFs as much of possible - what kind of transformation do you need to perform that isn't available in the Spark itself?
Second rule - if you can't avoid using UDF, at least use Pandas UDFs that process data in batches, and don't have so big serialization/deserialization overhead - usual UDFs are handling data row by row, encoding & decoding data for each of them.
If your table was built over the time, and consists of many files, you can try to use Spark Structured Streaming with Trigger.AvailableNow (requires DBR 10.3 or 10.4), something like this:
maxNumFiles = 10 # max number of parquet files processed at once
df = spark.readStream \
.option("maxFilesPerTrigger", maxNumFiles) \
.table("large_table")
df = df.withColumn('new_column', udf_func(data.value))
df.writeStream \
.option("checkpointLocation", "/some/path") \
.trigger(availableNow=True) \
.toTable("my_destination_table")
this will read the source table chunk by chunk, apply your transformation, and write data into a destination table.

How to get new/updated records from Delta table after upsert using merge?

Is there any way to get updated/inserted rows after upsert using merge to Delta table in spark streaming job?
val df = spark.readStream(...)
val deltaTable = DeltaTable.forName("...")
def upsertToDelta(events: DataFrame, batchId: Long) {
deltaTable.as("table")
.merge(
events.as("event"),
"event.entityId == table.entityId")
.whenMatched()
.updateExpr(...))
.whenNotMatched()
.insertAll()
.execute()
}
df
.writeStream
.format("delta")
.foreachBatch(upsertToDelta _)
.outputMode("update")
.start()
I know I can create another job to read updates from the delta table. But is it possible to do the same job? From what I can see, execute() returns Unit.
You can enable Change Data Feed on the table, and then have another stream or batch job to fetch the changes, so you'll able to receive information on what rows has changed/deleted/inserted. It could be enabled with:
ALTER TABLE table_name SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
if thable isn't registered, you can use path instead of table name:
ALTER TABLE delta.`path` SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
The changes will be available if you add the .option("readChangeFeed", "true") option when reading stream from a table:
spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.table("table_name")
and it will add three columns to table describing the change - the most important is _change_type (please note that there are two different types for update operation).
If you're worried about having another stream - it's not a problem, as you can run multiple streams inside the same job - you just don't need to use .awaitTermination, but something like spark.streams.awaitAnyTermination() to wait on multiple streams.
P.S. But maybe this answer will change if you explain why you need to get changes inside the same job?

Change filter/where condition when restarting a Structured Streaming query reading data from Delta Table

In Structured Streaming, will the checkpoints keep track of which data has already been processed from a Delta Table?
def fetch_data_streaming(source_table: str):
print("Fetching now")
streamingInputDF = (
spark
.readStream
.format("delta")
.option("maxBytesPerTrigger",1024)
.table(source_table)
.where("measurementId IN (1351,1350)")
.where("year >= '2021'")
)
query = (
streamingInputDF
.writeStream
.outputMode("append")
.option("checkpointLocation", "/streaming_checkpoints/5")
.foreachBatch(customWriter)
.start()
.awaitTermination()
)
return query
def customWriter(batchDF,batchId):
print(batchId)
print(batchDF.count())
batchDF.show(10)
length = batchDF.count()
print("batchId,batch size:",batchId,length)
If I change the where clause in the streamingInputDF to add more measurentId, the structured streaming job doesn't always acknowledge the change and fetch the new data values. It continues to run as if nothing has changed, whereas at times it starts fetching new values.
Isn't the checkpoint supposed to identify the change?
Edit: Schema of delta table:
col_name
data_type
measurementId
int
year
int
time
timestamp
q
smallint
v
string
"In structured streaming, will the checkpoints will keep track of which data has already been processed?"
Yes, the Structured Streaming job will store the read version of the Delta table in its checkpoint files to avoid producing duplicates.
Within the checkpoint directory in the folder "offsets", you will see that Spark stored the progress per batchId. For example it will look like below:
v1
{"batchWatermarkMs":0,"batchTimestampMs":1619695775288,"conf":[...]}
{"sourceVersion":1,"reservoirId":"d910a260-6aa2-4a7c-9f5c-1be3164127c0","reservoirVersion":2,"index":2,"isStartingVersion":true}
Here, the important part is the "reservoirVersion":2 which tells you that the streaming job has consumed all data from the Delta Table as of version 2.
Re-starting your Structured Streaming query with an additional filter condition will therefore not be applied to historic records but only to those that were added to the Delta Table after version 2.
In order to see this behavior in action you can use below code and analyse the content in the checkpoint files.
val deltaPath = "file:///tmp/delta/table"
val checkpointLocation = "file:///tmp/checkpoint/"
// run the following two lines once
val deltaDf = Seq(("1", "foo1"), ("2", "foo2"), ("3", "foo2")).toDF("id", "value")
deltaDf.write.format("delta").mode("append").save(deltaPath)
// run this code for the first time, then add filter condition, then run again
val query = spark.readStream
.format("delta")
.load(deltaPath)
.filter(col("id").isin("1")) // in the second run add "2"
.writeStream
.format("console")
.outputMode("append")
.option("checkpointLocation", checkpointLocation)
.start()
query.awaitTermination()
Now, if you append some more data to the Delta table while the streaming query is shut down and then restart is with the new filter condition it will be applied to the new data.

Spark streaming performance issue. Every minute process time increasing

We are facing performance issue in my streaming application. I am reading my data from kafka topic using DirectStream and converting the data into dataframe. After doing some aggregation operation in dataframe ,I am saving the result into registerTempTable. The registerTempTable will be used for next minute dataframe compression. And compared result will be saved in HDFS and data will be overwritten in existing registerTempTable.
In this I am facing performance issue. My streaming job ruining in first min 15sec, second min 18 sec and third min 20 sec like that it keeps increasing the processing time. Over period my streaming job will queued.
About my application.
Streaming will run on every 60 sec.
Spark version 2.1.1( I am using pyspark)
Kafka topic have four partitions.
To solve the issue, I have tried below steps.
Step 1: While submitting my job I am giving “spark.sql.shuffle.partitions=4” .
Step 2: while saving my dataframe as textfile I am using coalesce(4).
When I see my spark url every min run, save as file stage is doubling up like first min 14 stages, second min 28 stages and third min 42 stages.
Spark UI Result.
Hi,
Thanks for reply,
Sorry, i am new to spark. I am not sure what exactly I need change, can you please help me. do I need do unpersist my "df_data"?
I cache and unpersist my "df_data" data frame also. But still, i am facing the same issue.
Do I need to enable my checkpoint? like adding "ssc.checkpoint("/user/test/checkpoint")" this code.createDirectStream will support checkpoint? Or i need to enable offset value? please let me know what change is required here.
if __name__ == "__main__":
sc = SparkContext(appName="PythonSqlNetworkWordCount")
sc.setLogLevel("ERROR")
sparkSql = SQLContext(sc)
ssc = StreamingContext(sc, 60)
zkQuorum = {"metadata.broker.list" : "m2.hdp.com:9092"}
topic = ["kpitmumbai"]
kvs = KafkaUtils.createDirectStream(ssc,topic,zkQuorum)
schema = StructType([StructField('A', StringType(), True), StructField('B', LongType(), True), StructField('C', DoubleType(), True), StructField('D', LongType(), True)])
first_empty_df = sqlCtx.createDataFrame(sc.emptyRDD(), schema)
first_empty_df.registerTempTable("streaming_tbl")
lines = kvs.map(lambda x :x[1])
lines.foreachRDD(lambda rdd: empty_rdd() if rdd.count() == 0 else
CSV(rdd))
ssc.start()
ssc.awaitTermination()
def CSV(rdd1):
spark = getSparkSessionInstance(rdd1.context.getConf())
psv_data = rdd1.map(lambda l: l.strip("\s").split("|") )
data_1 = psv_data.map(lambda l : Row(
A=l[0],
B=l[1],
C=l[2],
D=l[3])
hasattr(data_1 ,"toDF")
df_2= data_1.toDF()
df_last_min_data = sqlCtx.sql("select A,B,C,D,sample from streaming_tbl")#first time will be empty and next min onwards have values
df_data = df_2.groupby(['A','B']).agg(func.sum('C').alias('C'),func.sum('D').alias('D'))
df_data.registerTempTable("streaming_tbl")
con=(df_data.A==df_last_min_data.A) & (df_data.B==df_last_min_data.B)
path1=path="/user/test/str" + str(Starttime).replace(" ","").replace("-","").replace(":","")
df_last_min_data.join(df_data,con,"inner").select(df_last_min_data.A,df_last_min_data.b,df_data.C,df_data.D).write.csv(path=path1,mode="append")
Once again thanks for the reply.
After doing some aggregation operation in dataframe ,I am saving the result into registerTempTable. The registerTempTable will be used for next minute dataframe compression. And compared result will be saved in HDFS and data will be overwritten in existing registerTempTable.
The most likely problem is you don't checkpoint the table and lineage keeps growing with each iteration. This makes each iteration more and more expensive, especially when data is not cached.
Overall if you need stateful operations you should prefer existing stateful transformations. Both "old" streaming and structured streaming come with their own variants, which can be used in a variety of scenarios.

Can spark execute table operations on remote nodes? (vs. row operations)

Most of spark's Dataset functions are per-row operations. However, I'd like to distribute execution of ML tasks to run on Spark -- most ML tasks are naturally operations that are functions of tables, and not natually functions of rows. (I've looked at MLLib -- its way too limited, and in many cases execution is made orders of magnitude slower in spark by distribute operations over many cores that could otherwise fit on a single core).
Its important that ML algorithms process collections of rows, not single rows, and so I'd like to materialize a table into memory on a node. (I pinky promise it will fit into core). How can I do this?
Functionally, I'd like to do:
def mlsubtask(table, arg2, arg3):
data = table.collect()
...
sc = SparkContext(...)
sqlctx = SQLContext(sc)
...
df = sqlctx.sql("SELECT ...")
results = sc.parallelize([(df,arg2,arg3),(df,arg2,arg3),(df,arg2,arg3)]).map(mlsubtask).collect()
If can perform execution like this:
sc = SparkContext(...)
sqlctx = SQLContext(sc)
...
df = sqlctx.sql("SELECT ...")
df = df.collect()
results = sc.parallelize([(df,arg2,arg3),(df,arg2,arg3),(df,arg2,arg3)]).map(mlsubtask).collect()
... but this brings the data to the client, which in then re-serialized and quite inefficient.
For a single task:
def mlsubtask(iter_rows):
data_table = list(iter_rows) # Or other way of bringing into memory.
...
df.repartition(1).mapPartitions(mlsubtask)

Resources