Spark Structured Stream Scalability and Duplicates Issue - apache-spark

I am using Spark Structured Streaming on Databricks Cluster to extract data from Azure Event Hub, process it, and write it to snowflake using ForEachBatch with Epoch_Id/ Batch_Id passed to the foreach batch function.
My code looks something like below:
ehConf = {}
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(EVENT_HUB_CONNECTION_STRING)
ehConf['eventhubs.consumerGroup'] = consumergroup
# Read stream data from event hub
spark_df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
Some transformations...
Write to Snowflake
def foreach_batch_function(df, epoch_id):
df.write\
.format(SNOWFLAKE_SOURCE_NAME)\
.options(**sfOptions)\
.option("dbtable", snowflake_table)\
.mode('append')\
.save()
processed_df.writeStream.outputMode('append').\
trigger(processingTime='10 seconds').\
option("checkpointLocation",f"checkpoint/P1").\
foreachBatch(foreach_batch_function).start()
Currently I am facing 2 issues:
When node failure occurs. Although on spark official web, it is mentioned that when one uses ForeachBatch along with epoch_id/batch_id during recovery form node failure there shouldn't be any duplicates, but I do find duplicates getting populated in my snowflake tables. Link for reference: [Spark Structured Streaming ForEachBatch With Epoch Id][1].
I am encountering errors a.)TransportClient: Failed to send RPC RPC 5782383376229127321 to /30.62.166.7:31116: java.nio.channels.ClosedChannelException and b.)TaskSchedulerImpl: Lost executor 1560 on 30.62.166.7: worker decommissioned: Worker Decommissioned very frequently on my databricks cluster. No matter how many executors I allocate or how much executors memory I increase, the clusters reaches to max worker limit and I receive one of the two error with duplicates being populated in my snowflake table after its recovery.
Any solution/ suggestion to any of the above points would be helpful.
Thanks in advance.

foreachBatch is by definition not idempotent because when currently executed batch fails, then it's retries, and partial results could be observed, and this is matching your observations. Idempotent writes in foreachBatch are applicable only for Delta Lake tables, not for all sink types (in some cases, like, Cassandra, it could work as well). I'm not so familiar with Snowflake, but maybe you can implement something similar to other database - write data into a temporary table (each batch will do an overwrite) and then merge from that temporary table into a target table.
Regarding 2nd issue - it looks like you're using autoscaling cluster - in this case, workers could be decommissioned because cluster managers detects that cluster isn't fully loaded. To avoid that you can disable autoscaling, and use fixed size cluster.

Related

Spark stuck on SynapseLoggingShim.sala while writing into delta table

I'm streaming data from kafka and trying to merge ~30 million records to delta lake table.
def do_the_merge(microBatchDF, partition):
deltaTable.alias("target")\
.merge(microBatchDF.alias("source"), "source.id1= target.id2 and source.id= target.id")\
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
I see that spark is stuck on task for almost an hour on the task named SynapseLoggingShim
once this stage completes, then writing to delta table actually starts and takes one more
I'm trying to understand what this SynapseLoggingShim stage does ?
Answering question myself, the synapseLoggingShim scala was waiting on the merge task to complete.
It's just a open telemetry wrapper to collect the metrics.
The problem is , we are bottlenecked by the source ! The event hub that we are reading has 32 partitions and spark parallelism is constrained the event hub partitions.
In simple words, increasing the spark cores doesn't help in decreasing the time as the source event hub limits the parallelism as per the topic partition count.

Spark Structured Streaming: join stream with data that should be read every micro batch

I have a stream from HDFS and I need to join it with my metadata that is also in HDFS, both Parquets.
My metadata sometimes got updated and I need to join with fresh and most recent, that means read metadata from HDFS every stream micro batch ideally.
I tried to test this, but unfortunately Spark reads metadata once that cache files(supposedly), even if I tried with spark.sql.parquet.cacheMetadata=false.
Is there a way how to read every micro batch? Foreach Writer is not what I'm looking for?
Here's code examples:
spark.sql("SET spark.sql.streaming.schemaInference=true")
spark.sql("SET spark.sql.parquet.cacheMetadata=false")
val stream = spark.readStream.parquet("/tmp/streaming/")
val metadata = spark.read.parquet("/tmp/metadata/")
val joinedStream = stream.join(metadata, Seq("id"))
joinedStream.writeStream.option("checkpointLocation", "/tmp/streaming-test/checkpoint").format("console").start()
/tmp/metadata/ got updated with spark append mode.
As far as I understand, with metadata accessing through JDBC jdbc source and spark structured streaming, Spark will query each micro batch.
As far as I found, there are two options:
Create temp view and refresh it using interval:
metadata.createOrReplaceTempView("metadata")
and trigger refresh in separate thread:
spark.catalog.refreshTable("metadata")
NOTE: in this case spark will read the same path only, it does not work if you need read metadata from different folders on HDFS, e.g. with timestamps etc.
Restart stream with interval as Tathagata Das suggested
This way is not suitable for me, since my metadata might be refreshed several times per hour.

How does spark copy data between cassandra tables?

Can anyone please explain the internal working of spark when reading data from one table and writing it to another in cassandra.
Here is my use case:
I am ingesting data coming in from an IOT platform into cassandra through a kafka topic. I have a small python script that parses each message from kafka to get the tablename it belongs to, prepares a query and writes it to cassandra using datastax's cassandra-driver for python. With that script I am able to ingest around 300000 records per min into cassandra. However my incoming data rate is 510000 records per minute so kafka consumer lag keeps on increasing.
Python script is already making concurrent calls to cassandra. If I increase the number of python executors, cassandra-driver starts failing because cassandra nodes become unavailable to it. I am assumin there is a limit of cassandra calls per sec that I am hitting there. Here is the error message that I get:
ERROR Operation failed: ('Unable to complete the operation against any hosts', {<Host: 10.128.1.3 datacenter1>: ConnectionException('Pool is shutdown',), <Host: 10.128.1.1 datacenter1>: ConnectionException('Pool is shutdown',)})"
Recently, I ran a pyspark job to copy data from a couple of columns in one table to another. The table had around 168 million records in it. Pyspark job completed in around 5 hours. So it processed over 550000 records per min.
Here is the pyspark code I am using:
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=sourcetable, keyspace=sourcekeyspace)\
.load().cache()
df.createOrReplaceTempView("data")
query = ("select dev_id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, " + field + " as value from data " )
vgDF = spark.sql(query)
vgDF.show(50)
vgDF.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table=newtable, keyspace=newkeyspace)\
.save()
Versions:
Cassandra 3.9.
Spark 2.1.0.
Datastax's spark-cassandra-connector 2.0.1
Scala version 2.11
Cluster:
Spark setup with 3 workers and 1 master node.
3 worker nodes also have a cassandra cluster installed. (each cassandra node with one spark worker node)
Each worker was allowed 10 GB ram and 3 cores.
So I am wondering:
Does spark read all the data from cassandra first and then writes it to the new table or is there some kind of optimization in spark cassandra connector that allows it to move the data around cassandra tables without reading all the records?
If I replace my python script with a spark streaming job in which I parse the packet to get the table name for cassandra, will that help me ingest data more quickly into cassandra?
Spark connector is optimized because it parallelize processing and reading/inserting data into nodes that are owns the data. You may get better throughput by using Cassandra Spark Connector, but this will require more resources.
Talking about your task - 300000 inserts/minute is 5000/second, and this is not very big number frankly speaking - you can increase throughput by putting different optimizations:
Using asynchronous calls to submit requests. You only need to make sure that you submit more requests that could be handled by one connection (but you can also increase this number - I'm not sure how to do it in Python, but please check Java driver doc to get an idea).
use correct consistency level (LOCAL_ONE should give you very good performance)
use correct load balancing policy
you can run several copies of your script in parallel, making sure that they are all in the same Kafka consumer group.

Structured streaming performance and purging the parquet files

I am using Spark structured streaming to get streaming data from Kafka. I need to aggregate various metrics (Say 6 metrics) and write as parquet files. I do see that there is a huge delay between metric 1 and metric 2. For example, if metric 1 is updated recently, metric 2 is one hour old data. How do I improve this performance to work in parallel?
Also, I write Parquet files which should be read by another application. How do I purge old parquet information constantly? Should I have a different application for it?
Dataset<String> lines_topic = spark.readStream().format("kafka").option("kafka.bootstrap.servers", bootstrapServers)
Dataset<Row> data= lines_topic.select(functions.from_json(lines_topic.col("value"), schema).alias(topics)); data.withWatermark(---).groupBy(----).count(); query = data.writeStream().format("parquet").option("path",---).option("truncate", "false").outputMode("append").option("checkpointLocation", checkpointFile).start();
Since each query is running independently from the others you need to ensure you're giving each query enough resources to execute. What could be happening is if you're using the default FIFO scheduler then all triggers are running sequentially vs in parallel.
Just as described here you should set a FAIR scheduler on your SparkContext and then define new pools for each query.
// Run streaming query1 in scheduler pool1
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df.writeStream.queryName("query1").format("parquet").start(path1)
// Run streaming query2 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query2").format("orc").start(path2)
Also, in terms of purging old parquet files you may want to partition the data and then periodically delete old partitions as needed. Otherwise you can't just delete rows if all the data is being written to the same output path.

SparkStreaming keep processing even no data in kafka

I'm using Spark Steaming to consume data from Kafka with the code snippet like :
rdd.foreachRdd{rdd=>rdd.foreachPartition{...}}
I'm using foreachPartition because I need to create connection with Hbase, I don't wanna open/close connection by each record.
But I found that when there is no data in Kafka, spark streaming is still processing foreachRdd and foreachPartition.
This caused many Hbase connections were created even though there were no any data were consumed. I really don't like this, how should I make Spark stop doing this when there is no data was consumed from Kafka please.
Simply check that there are items in the RDD. So your code could be:
rdd.foreachRdd{rdd=> if(rdd.isEmpty == false) rdd.foreachPartition{...}}

Resources