can we do pivot when we read data from delta in a streaming way and write to delta table(streaming sink) or pl suggest alternate to parse array col - apache-spark

I wanted to parse a array column of type: ["key: value key1: value1 key2: value2 .. key_35: value_35\n", ""] and then take these key columns and put that as column names along with other columns and value as the corresponding key columns' values
I am reading from silver layer of delta lake and writing back to again silver layer of delta lake
This is the code I am using but I am getting error like Queries with streaming sources must be executed with writeStream.start(); tahoe
dfInput = (spark.readStream
.option("maxFilesPerTrigger", 1)
.format("delta")
.load(path)
.withColumn("status", col("status").getItem(0))
.select(col("date"), col("id"), col("status"))
.withColumn("status", trim(col("status")))
.withColumn("status", split(col("status"), " "))
)
(dfInput.writeStream
.trigger(processingTime='1 seconds')
.option("checkpointLocation", "checkpointPath")
.foreachBatch(forEachFunc)
.start()
)
def forEachFunc(split_df,batch_id):
dfParsed = (dfInput
.withColumn("kv", explode(col("status")))
.withColumn("key", concat(lit("status_"), split(col("kv"), ":")[0]))
.withColumn("val", split(col("kv"), ":")[1]))
(dfParsed.groupBy("date", "id")
.pivot("key")
.agg(first("val"))
.write.format("delta")
.mode("append")
.save(destPath))
pass
I tried pivoting inside foreachBatch function on seeing stack over flow solution (Pivot a streaming dataframe pyspark) and also this link (https://www.mssqltips.com/sqlservertip/6563/pivot-transformations-for-spark-streaming/#:~:text=Spark%20streaming%20supports%20wide%20range,am%20going%20to%20describe%20here.) but I am still getting error - Queries with streaming sources must be executed with writeStream.start(); tahoe

Related

PySpark dataframe not getting saved in Hive due to file format mismatch

I want to write the streaming data from kafka topic to hive table.
I am able to create dataframes by reading kafka topic, but the data is not getting written to Hive Table due to file-format mismatch. I have specified dataframe.format("parquet") and the hive table is created with stored as parquet.
Below is the code snippet:
def hive_write_batch_data(data, batchId):
data.write.format("parquet").mode("append").saveAsTable(table)
def write_to_hive(data,kafka_sink_name):
global table
table = kafka_sink_name
data.select(col("key"),col("value"),col("offset")) \
.writeStream.foreachBatch(hive_write_batch_data) \
.start().awaitTermination()
if __name__ == '__main__':
kafka_sink_name = sys.argv[1]
kafka_config = {
....
..
}
spark = SparkSession.builder.appName("Test Streaming").enableHiveSupport().getOrCreate()
df = spark.readStream \
.format("kafka") \
.options(**kafka_config) \
.load()
df1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)","offset","timestamp","partition")
write_to_hive(df1,kafka_sink_name)
Hive table is created as Parquet:
CREATE TABLE test.kafka_test(
key string,
value string,
offset bigint)
STORED AS PARQUET;
It is giving me the Error:
pyspark.sql.utils.AnalysisException: "The format of the existing table test.kafka_test is `HiveFileFormat`. It doesn\'t match the specified format `ParquetFileFormat`.;"
How do I write the dataframe to hive table ?
I dropped the hive table, and ran the Spark-streaming job. Table got created with the correct format.

How to process a large delta table with UDF?

I have a delta table with about 300 billion rows. Now I am performing some operations on a column using UDF and creating another column
My code is something like this
def my_udf(data):
return pass
udf_func = udf(my_udf, StringType())
data = spark.sql("""SELECT * FROM large_table """)
data = data.withColumn('new_column', udf_func(data.value))
The issue now is this take a long amount of time as Spark will process all 300 billion rows and then write the output. Is there a way where we can do some Mirco batching and write output of those regularly to the output delta table
The first rule usually is to avoid UDFs as much of possible - what kind of transformation do you need to perform that isn't available in the Spark itself?
Second rule - if you can't avoid using UDF, at least use Pandas UDFs that process data in batches, and don't have so big serialization/deserialization overhead - usual UDFs are handling data row by row, encoding & decoding data for each of them.
If your table was built over the time, and consists of many files, you can try to use Spark Structured Streaming with Trigger.AvailableNow (requires DBR 10.3 or 10.4), something like this:
maxNumFiles = 10 # max number of parquet files processed at once
df = spark.readStream \
.option("maxFilesPerTrigger", maxNumFiles) \
.table("large_table")
df = df.withColumn('new_column', udf_func(data.value))
df.writeStream \
.option("checkpointLocation", "/some/path") \
.trigger(availableNow=True) \
.toTable("my_destination_table")
this will read the source table chunk by chunk, apply your transformation, and write data into a destination table.

How to get new/updated records from Delta table after upsert using merge?

Is there any way to get updated/inserted rows after upsert using merge to Delta table in spark streaming job?
val df = spark.readStream(...)
val deltaTable = DeltaTable.forName("...")
def upsertToDelta(events: DataFrame, batchId: Long) {
deltaTable.as("table")
.merge(
events.as("event"),
"event.entityId == table.entityId")
.whenMatched()
.updateExpr(...))
.whenNotMatched()
.insertAll()
.execute()
}
df
.writeStream
.format("delta")
.foreachBatch(upsertToDelta _)
.outputMode("update")
.start()
I know I can create another job to read updates from the delta table. But is it possible to do the same job? From what I can see, execute() returns Unit.
You can enable Change Data Feed on the table, and then have another stream or batch job to fetch the changes, so you'll able to receive information on what rows has changed/deleted/inserted. It could be enabled with:
ALTER TABLE table_name SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
if thable isn't registered, you can use path instead of table name:
ALTER TABLE delta.`path` SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
The changes will be available if you add the .option("readChangeFeed", "true") option when reading stream from a table:
spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.table("table_name")
and it will add three columns to table describing the change - the most important is _change_type (please note that there are two different types for update operation).
If you're worried about having another stream - it's not a problem, as you can run multiple streams inside the same job - you just don't need to use .awaitTermination, but something like spark.streams.awaitAnyTermination() to wait on multiple streams.
P.S. But maybe this answer will change if you explain why you need to get changes inside the same job?

Change filter/where condition when restarting a Structured Streaming query reading data from Delta Table

In Structured Streaming, will the checkpoints keep track of which data has already been processed from a Delta Table?
def fetch_data_streaming(source_table: str):
print("Fetching now")
streamingInputDF = (
spark
.readStream
.format("delta")
.option("maxBytesPerTrigger",1024)
.table(source_table)
.where("measurementId IN (1351,1350)")
.where("year >= '2021'")
)
query = (
streamingInputDF
.writeStream
.outputMode("append")
.option("checkpointLocation", "/streaming_checkpoints/5")
.foreachBatch(customWriter)
.start()
.awaitTermination()
)
return query
def customWriter(batchDF,batchId):
print(batchId)
print(batchDF.count())
batchDF.show(10)
length = batchDF.count()
print("batchId,batch size:",batchId,length)
If I change the where clause in the streamingInputDF to add more measurentId, the structured streaming job doesn't always acknowledge the change and fetch the new data values. It continues to run as if nothing has changed, whereas at times it starts fetching new values.
Isn't the checkpoint supposed to identify the change?
Edit: Schema of delta table:
col_name
data_type
measurementId
int
year
int
time
timestamp
q
smallint
v
string
"In structured streaming, will the checkpoints will keep track of which data has already been processed?"
Yes, the Structured Streaming job will store the read version of the Delta table in its checkpoint files to avoid producing duplicates.
Within the checkpoint directory in the folder "offsets", you will see that Spark stored the progress per batchId. For example it will look like below:
v1
{"batchWatermarkMs":0,"batchTimestampMs":1619695775288,"conf":[...]}
{"sourceVersion":1,"reservoirId":"d910a260-6aa2-4a7c-9f5c-1be3164127c0","reservoirVersion":2,"index":2,"isStartingVersion":true}
Here, the important part is the "reservoirVersion":2 which tells you that the streaming job has consumed all data from the Delta Table as of version 2.
Re-starting your Structured Streaming query with an additional filter condition will therefore not be applied to historic records but only to those that were added to the Delta Table after version 2.
In order to see this behavior in action you can use below code and analyse the content in the checkpoint files.
val deltaPath = "file:///tmp/delta/table"
val checkpointLocation = "file:///tmp/checkpoint/"
// run the following two lines once
val deltaDf = Seq(("1", "foo1"), ("2", "foo2"), ("3", "foo2")).toDF("id", "value")
deltaDf.write.format("delta").mode("append").save(deltaPath)
// run this code for the first time, then add filter condition, then run again
val query = spark.readStream
.format("delta")
.load(deltaPath)
.filter(col("id").isin("1")) // in the second run add "2"
.writeStream
.format("console")
.outputMode("append")
.option("checkpointLocation", checkpointLocation)
.start()
query.awaitTermination()
Now, if you append some more data to the Delta table while the streaming query is shut down and then restart is with the new filter condition it will be applied to the new data.

How to save kafka data into different location based on a column value in spark structured streaming?

I have a usecase in which I am consuming data from Kafka using spark structured streaming. I have multiple topics to subscribe and based on the topic name the dataframe should be dumped to a defined location(different location for different topics). I saw if this can be solved using some kind of split/filter function in spark dataframe but could not find any.
As of now I am only subscribed to one topic and I am using my own written method to dump the data into a location in parquet's format. Here is the code I am currently using :
def save_as_parquet(cast_dataframe: DataFrame,output_path:
String,checkpointLocation: String): Unit = {
val query = cast_dataframe.writeStream
.format("parquet")
.option("failOnDataLoss",true)
.option("path",output_path)
.option("checkpointLocation",checkpointLocation)
.start()
.awaitTermination()
}
When I will be subscribed to different topics, then this cast_dataframe will also have values from different topics. I wish to dump the data from a topic to only the location it is assigned location. How can this be done ?
As explained in the official documentation Dataset to be written might contain optional topic column, which can be used for message routing:
* The topic column is required if the “topic” configuration option is not specified.
The value column is the only required option. If a key column is not specified then a null valued key column will be automatically added (see Kafka semantics on how null valued key values are handled). If a topic column exists then its value is used as the topic when writing the given row to Kafka, unless the “topic” configuration option is set i.e., the “topic” configuration option overrides the topic column.
According to the documentation each Row from Kafka source has the following schema:
Column
Type
key
binary
value
binary
topic
string
...
...
Assuming you are reading from multiple topics using the source option
val kafkaInputDf = spark.readStream.format("kafka").[...]
.option("subscribe", "topic1, topic2, topic3")
.start()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "topic")
you can then apply a filter to on the column topic to split the data accordingly:
val df1 = kafkaInputDf.filter(col("topic") === "topic1")
val df2 = kafkaInputDf.filter(col("topic") === "topic2")
val df3 = kafkaInputDf.filter(col("topic") === "topic3")
Then you can sink those three streaming Dataframes df1, df2 and df3 into their required sinks. As this will create three in parallel running streaming queries it is important that each writeStream get its own checkpoint location.

Resources