Databricks streaming from DELTA to KAFKA keeps showing "Stream initializing..." - azure

I used Autoloader to implement an ingestion process from the RAW layer (in normal parquet files) into a BRONZE layer (DELTA). I'm getting from 1 to 2,000 files per minute, for a total amount of about 10 million records per hour in total, since some files contain just one record while others contain several thousand records each. This process works well with a 4+1 cluster made of DS3V2 virtual machines on Azure.
Now I need to read from the same bronze layer, and send the data to a Kafka queue. So I provisioned an EventHub queue and implemented the following code:
bootstrap_servers = f"{eh_namespace_name}.servicebus.windows.net:9093"
from pyspark.sql.functions import struct, to_json
df_for_kafka_bronze_streaming_wt = (spark
.readStream
.format("delta")
.load(bronze_root_wt)
)
bronze_write_eh_sasl_wt = 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule' \
+ f' required username="$ConnectionString" password="{bronze_write_connection_string_wt}";'
bronze_write_kafka_options_wt = {
"kafka.bootstrap.servers": bootstrap_servers,
"kafka.sasl.mechanism": "PLAIN",
"kafka.security.protocol": "SASL_SSL",
"kafka.sasl.jaas.config": bronze_write_eh_sasl_wt,
"topic": bronze_topic_wt,
}
(
df_for_kafka_bronze_streaming_wt
.select(to_json(struct("*")).alias("value"))
.writeStream
.trigger(processingTime="60 seconds")
.format("kafka")
.option("checkpointLocation", bronze_checkpoint_kafka_wt)
.outputMode("append")
.options(**bronze_write_kafka_options_wt)
.start()
)
It seems that it works, in fact about 15 minutes (not less) after I run the command above, the queue starts receiving messages. However the problem is that, even after 5 hours, the command above keeps saying "Stream initializing..." and sometimes after a few more hours, it stops with a generic error or an "out of memory" error even if the cluster log don't report any issues
The export from bronze to kafka is run on another identical cluster (4+1 DS3V2, dedicated to this task).
I suppose I my bronze dataframe is outputting in a wrong way, but I need some help to better troubleshooting this error.

Related

How to generate one hour length parquet files with Spark Structured Streaming (Python)?

I want to generate parquet files every hour with all the information received during that hour for further processing using Spark NLP.
I have streaming data coming from Kafka and when I writeStream I set the trigger processing time to one hour, it just generates many small parquet files. I've read that setting coalesce to 1 plus the trigger would generate a big parquet file but it still gives me many small parquet files.
I've also read that one can also set the minimum number of rows too, but the amount of rows I receive changes from hour to hour.
This is how I'm writing my stream:
df.writeStream \
.format("parquet") \
.option("checkpointLocation", "s3a://datalake-twitter-app/spark_checkpoints/") \
.option("path", "s3a://datalake-twitter-app/raw_datalake/") \
.trigger(processingTime='60 minutes') \
.start()
Any idea on how I can write one hour long parquet files with all the information received during the given hour using Spark Structured Streaming? Maybe I should use something different?
I thought that it could also be that is better to focus on the reading files from the S3 bucket on the other Spark process, and read the files within the previous hour.

Restarting a PySpark job doesn't get the records which were inserted into Kafka Topic while the pyspark consumer is down

I am running a pyspark job and the data streaming is from Kafka.
I am trying to replicate a scenario in my windows system to find out what happens when the consumer goes down while the data is continuously being fed into Kafka.
Here is what i expect.
producer is started and produces message 1, 2 and 3.
consumer is online and consumes messages 1, 2 and 3.
Now the consumer goes down for some reason while the producer produces messages 4, 5 and 6 and so on...
when the consumer comes up, it is my expectation that it should read where it left off. So the consumer must be able to read from message 4, 5 , 6 and so on....
My pyspark application is not able to achieve what i expect. here is how I created a Spark Session.
session.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "clickapijson")
.option("startingoffsets" , "latest") \
.load()
I googled and gathered quite a bit of information. It seems like the groupID is relevant here. Kafka maintains the track of offsets read by each consumer in a particular groupID. If a consumer subscribes to a topic with a groupId, say, G1, kafka registers this group and consumerID and keeps a track of this groupID and ConsumerID. If at all, the consumer has to go down for some reason, and restarts with the same groupID, then the kafka will have the information of the already read offsets so the consumer will read the data from where it left off.
This is exactly happening when i use the following command to invoke consumer job in CLI.
kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic "clickapijson" --consumer-property group.id=test
Now when my producer produces the messages 1,2 and 3, consumer is able to consume. I killed the running consumer job(CLI .bat file) after the 3 rd message is read. My producer produces the message 4, 5 and 6 and so on....
Now I bring back my consumer job (CLI .bat file) and it able to read the data from where it left off ( from message 4). This is behaving as I expect.
I am unable to do the same thing in pyspark.
when I am including the option("group.id" , "test"), it throws an error saying Kafka option group.id is not supported as user-specified consumer groups are not used to track offsets.
Upon observing the console output, each time my pyspark consumer job is kicked off, it is creating a new groupID. If my pyspark job has run previously with a groupID and failed, when it is restarted it is not picking up the same old groupID. It is randomly getting a new groupID. Kafka has the offset information of the previous groupID but not the current newly generated groupID. Hence my pyspark application is not able to read the data fed into Kafka while it was down.
If this is the case, then wont I lose my data when the consumer job has gone down due to some failure?
How can i give my own groupid to the pyspark application or how can i restart my pyspark application with same old groupid?
In the current Spark version (2.4.5) it is not possible to provide your own group.id as it gets automatically created by Spark (as you already observed). The full details on the offset management in Spark reading from Kafka is given here and summarised below:
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:
group.id: Kafka source will create a unique group id for each query automatically.
auto.offset.reset: Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off.
enable.auto.commit: Kafka source doesn’t commit any offset.
For Spark to be able to remember where it left off reading from Kafka, you need to have checkpointing enabled and provide a path location to store the checkpointing files. In Python this would look like:
aggDF \
.writeStream \
.outputMode("complete") \
.option("checkpointLocation", "path/to/HDFS/dir") \
.format("memory") \
.start()
More details on checkpointing are given in the Spark docs on Recovering from Failures with Checkpointing.

Spark structured streaming maxOffsetsPerTrigger does not seem to work

I had an issue with a Spark structured streaming (SSS) application, that had crashed due to a program bug and did not process over the weekend. When I restarted it, there were many messages on the topics to reprocess (about 250'000 messages each on 3 topics which need to be joined).
On restart, the application crashed again with an OutOfMemory exception. I learned from the docs that maxOffsetsPerTrigger configuration on the read stream is supposed to help exactly in those cases. I changed the PySpark code (running on SSS 2.4.3 btw) to have something like the following for all 3 topics
rawstream = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topicName)
.option("maxOffsetsPerTrigger", 10000L)
.option("startingOffsets", "earliest")
.load()
My expectation would be that now the SSS query would load ~33'000 offsets from each of the topics and join them in the first batch. Then in the second batch it would clean the state records from the first batch with are subject to expiration due to watermark (which would clean up most of the records from the first batch) and then read another ~33k from each topic. So after ~8 batches it should have processed the lag, with a "reasonable" amount of memory.
But the application still kept crashing with OOM, and when I checked the DAG in the application master UI, it reported that it again tried to read all 250'000 messages.
Is there something more that I need to configure? How can I check that this option is really used? (when I check the plan, unfortunately it is truncated and just shows (Options: [includeTimestamp=true,subscribe=IN2,inferSchema=true,failOnDataLoss=false,kafka.b...), I couldn't find out how to show the part after the dots)

How to set optimal config values - trigger time, maxOffsetsPerTrigger - for Spark Structured Streaming while reading messages from Kafka?

I have a Structured Streaming Application reading messages from Kafka. The total count of messages per day is approximately 18 Billion with peak message count per minute = 12,500,000.
The Max message size is 2 KB.
How do I make sure my Structured Streaming app is able to handle this much volume and velocity of data? Basically, I just want to know how to set the optimal trigger time, maxOffsetsPerTrigger, or any other config which makes the job proceed smoothly, and is able to handle failures and restarts.
You can run the spark structured streaming application in either fixed interval micro-batches or continuous. Here are some of the options you can use for tuning streaming applications.
Kafka Configurations:
Number of partitions in Kafka:
You can increase the number of partitions in Kafka. As a result more number of consumers can read data simultaneously. Set this to appropriate number based on input rate and number of bootstrap servers.
Spark Streaming Configurations:
driver and executor memory configuration:
Calculate the size of data(#records * size of each message) in each batch and set the memory accordingly.
Number of executors:
Set the number of executors to number of partitions in kafka topic. This increases parallelism. Number of tasks which read data simultaneously.
Limit number of offsets:
Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topic Partitions of different volume.
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topicName")
.option("startingOffsets", "latest")
.option("maxOffsetsPerTrigger", "1000000")
.load()
Recovering from Failures with Check-pointing:
In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using check-pointing and write-ahead logs.
finalDF
.writeStream
.outputMode("complete")
.option("checkpointLocation", "path/to/HDFS/dir")
.format("memory")
.start()
Trigger:
The trigger settings of a streaming query defines the timing of streaming data processing, whether the query is going to executed as micro-batch query with a fixed batch interval or as a continuous processing query.

Structured Streaming 2.1.0 stream to Parquet creates many small files

I am running Structured Streaming (2.1.0) on a 3-node yarn cluster and stream json records to parquet. My code fragment looks like this:
val query = ds.writeStream
.format("parquet")
.option("checkpointLocation", "/data/kafka_streaming.checkpoint")
.start("/data/kafka_streaming.parquet")
I notice it creates thousands of small files quickly for only 1,000 records. I suspect it has to do frequency of trigger. So I changed it:
val query = ds.writeStream
.format("parquet")
.option("checkpointLocation", "/data/kafka_streaming.checkpoint")
.**trigger(ProcessingTime("60 seconds"))**
.start("/data/kafka_streaming.parquet")
The difference is very obvious. Now I can see much smaller number of files created for the same number of records.
My question: Is there any way to have low latency for the trigger and keep smaller number of larger output files?

Resources