I'm streaming data from kafka and trying to merge ~30 million records to delta lake table.
def do_the_merge(microBatchDF, partition):
deltaTable.alias("target")\
.merge(microBatchDF.alias("source"), "source.id1= target.id2 and source.id= target.id")\
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
I see that spark is stuck on task for almost an hour on the task named SynapseLoggingShim
once this stage completes, then writing to delta table actually starts and takes one more
I'm trying to understand what this SynapseLoggingShim stage does ?
Answering question myself, the synapseLoggingShim scala was waiting on the merge task to complete.
It's just a open telemetry wrapper to collect the metrics.
The problem is , we are bottlenecked by the source ! The event hub that we are reading has 32 partitions and spark parallelism is constrained the event hub partitions.
In simple words, increasing the spark cores doesn't help in decreasing the time as the source event hub limits the parallelism as per the topic partition count.
Related
I am using Spark Structured Streaming on Databricks Cluster to extract data from Azure Event Hub, process it, and write it to snowflake using ForEachBatch with Epoch_Id/ Batch_Id passed to the foreach batch function.
My code looks something like below:
ehConf = {}
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(EVENT_HUB_CONNECTION_STRING)
ehConf['eventhubs.consumerGroup'] = consumergroup
# Read stream data from event hub
spark_df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
Some transformations...
Write to Snowflake
def foreach_batch_function(df, epoch_id):
df.write\
.format(SNOWFLAKE_SOURCE_NAME)\
.options(**sfOptions)\
.option("dbtable", snowflake_table)\
.mode('append')\
.save()
processed_df.writeStream.outputMode('append').\
trigger(processingTime='10 seconds').\
option("checkpointLocation",f"checkpoint/P1").\
foreachBatch(foreach_batch_function).start()
Currently I am facing 2 issues:
When node failure occurs. Although on spark official web, it is mentioned that when one uses ForeachBatch along with epoch_id/batch_id during recovery form node failure there shouldn't be any duplicates, but I do find duplicates getting populated in my snowflake tables. Link for reference: [Spark Structured Streaming ForEachBatch With Epoch Id][1].
I am encountering errors a.)TransportClient: Failed to send RPC RPC 5782383376229127321 to /30.62.166.7:31116: java.nio.channels.ClosedChannelException and b.)TaskSchedulerImpl: Lost executor 1560 on 30.62.166.7: worker decommissioned: Worker Decommissioned very frequently on my databricks cluster. No matter how many executors I allocate or how much executors memory I increase, the clusters reaches to max worker limit and I receive one of the two error with duplicates being populated in my snowflake table after its recovery.
Any solution/ suggestion to any of the above points would be helpful.
Thanks in advance.
foreachBatch is by definition not idempotent because when currently executed batch fails, then it's retries, and partial results could be observed, and this is matching your observations. Idempotent writes in foreachBatch are applicable only for Delta Lake tables, not for all sink types (in some cases, like, Cassandra, it could work as well). I'm not so familiar with Snowflake, but maybe you can implement something similar to other database - write data into a temporary table (each batch will do an overwrite) and then merge from that temporary table into a target table.
Regarding 2nd issue - it looks like you're using autoscaling cluster - in this case, workers could be decommissioned because cluster managers detects that cluster isn't fully loaded. To avoid that you can disable autoscaling, and use fixed size cluster.
I have a Spark Structured Streaming job reading from Kafka that has task durations that vary greatly.
I don't know why this is the case since the topic partitions are not skewed, and I am using maxOffsetsPerTrigger on the readStream to cap the limit. I think each executor should be getting the same amount of data.
Yet it is common for a stage to have a minimum task duration of 0.8s and maximum of 12s. In the Spark UI under Event Timeline I can see the green bars for Executor Computing Time show the variation.
Details of the job:
is running on Spark-Kubernetes
uses PySpark via Jupyter Notebook
reads from a Kafka topic with n partitions
creates n executors to match the topic partition number
sets maxOffsetsPerTrigger on the readStream
has enough memory and CPU
to isolate where the lag is happening, the output sink is noop but normally this would be a Kafka sink
How can I even out the task durations?
For every spark.streaming.blockInterval (say, 1 minute) receivers listen to streaming sources for data. Suppose the current micro-batch is taking an unnaturally long time to complete (by intention, say 20 min). During this micro-batch, would the Receivers still listens to the streaming source and store it in Spark memory?
The current pipeline runs in Azure Databricks by using Spark Structured Streaming.
Can anyone help me understand this!
With the above scenario the Spark will continue to consume/pull data from Kafka and micro batches will continue to pile up and eventually cause Out of memory (OOM) issues.
In order to avoid the scenario enable back pressure setting,
spark.streaming.backpressure.enabled=true
https://spark.apache.org/docs/latest/streaming-programming-guide.html
For more details on Spark back pressure feature
I am using Spark structured streaming to get streaming data from Kafka. I need to aggregate various metrics (Say 6 metrics) and write as parquet files. I do see that there is a huge delay between metric 1 and metric 2. For example, if metric 1 is updated recently, metric 2 is one hour old data. How do I improve this performance to work in parallel?
Also, I write Parquet files which should be read by another application. How do I purge old parquet information constantly? Should I have a different application for it?
Dataset<String> lines_topic = spark.readStream().format("kafka").option("kafka.bootstrap.servers", bootstrapServers)
Dataset<Row> data= lines_topic.select(functions.from_json(lines_topic.col("value"), schema).alias(topics)); data.withWatermark(---).groupBy(----).count(); query = data.writeStream().format("parquet").option("path",---).option("truncate", "false").outputMode("append").option("checkpointLocation", checkpointFile).start();
Since each query is running independently from the others you need to ensure you're giving each query enough resources to execute. What could be happening is if you're using the default FIFO scheduler then all triggers are running sequentially vs in parallel.
Just as described here you should set a FAIR scheduler on your SparkContext and then define new pools for each query.
// Run streaming query1 in scheduler pool1
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df.writeStream.queryName("query1").format("parquet").start(path1)
// Run streaming query2 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query2").format("orc").start(path2)
Also, in terms of purging old parquet files you may want to partition the data and then periodically delete old partitions as needed. Otherwise you can't just delete rows if all the data is being written to the same output path.
Below is the scenario I would need suggestions on,
Scenario:
Data ingestion is done through Nifi into Hive tables.
Spark program would have to perform ETL operations and complex joins on the data in Hive.
Since the data ingested from Nifi is continuous streaming, I would like the Spark jobs to run every 1 or 2 mins on the ingested data.
Which is the best option to use?
Trigger spark-submit jobs every 1 min using a scheduler?
How do we reduce the over head and time lag in submitting the job recursively to the spark cluster? Is there a better way to run a single program recursively?
Run a spark streaming job?
Can spark-streaming job get triggered automatically every 1 min and process the data from hive? [Can Spark-Streaming be triggered only time based?]
Is there any other efficient mechanism to handle such scenario?
Thanks in Advance
If you need something that runs every minute you better use spark-streaming and not batch.
You may want to get the data directly from kafka and not from hive table, since it is faster.
As for your questions what is better batch / stream. You can think of spark streaming as micro batch process that runs every "batch interval".
Read this : https://spark.apache.org/docs/latest/streaming-programming-guide.html