When using netcat to process logs in spark streaming, it is dropping last few lines? - apache-spark

I have a spark-streaming service, where I am processing and detecting anomalies on the basis of some offline generated model. I feed data into this service from a log file, which is streamed using the following command
tail -f <logfile>| nc -lk 9999
Here the spark streaming service is taking data from port 9999. However, I observe that the last few lines are being dropped, i.e. spark streaming does not receive those log lines or they are not processed.
However, I also observed that if I simply take the logfile as standard input instead of tailing it, no lines are dropped:
nc -q 10 -lk 9999 < logfile
Can anyone explain why this behavior is happening? And what could be a better resolution to the problem of streaming log data to spark streaming instance?

In Spark Streaming, data comes in over the wire, and constitutes a block on every block interval. This block is replicated on other machines (according to your storage level as soon as formed. Once a batch interval elapses, each block formed since the last batch interval tick forms part of a new RDD. It is once you have formed this RDD that you can schedule a job, so the data collected during the batch interval n is then processed during batch interval n+1.
So, the possible culprits for "losing a bit of data towards the end" could be:
you are observing your input file at the same time as you are monitoring the input for Spark. If you consider your monitoring at instant t, a bit after n batch intervals have elapsed, your log file has produced the data for n batches and then some ("a little bit more"). Except, the beginning of the next batch (n+1) is at this stage in the data collection phase, in the form of blocks on your Receiver. No data has been lost, the processing of batch n+1 has simply not started yet.
or your application assumes it's receiving a similar number of elements in each RDD and does not process the potentially (much) smaller last batch's RDD correctly.
or you're stopping your application or data before the last batch interval elapses (you need to wait n+1 batch intervals to see the processing of n batches of data).
or there is something weird occurring with the system clock of your executors. Have you thought of synchronizing them with ntp ?

Related

Does Spark configs spark.streaming.receiver.maxRate has any effect in a Kafka Beam pipeline

I was wondering if somebody has any experience with rate limiting in Beam KafkaIO component when the runner is a SparkRunner. The versions I am using are:Beam 2.29, Spark 3.2.0 and Kafka client 2.5.0?
I have the Beam parameter maxRecordsPerBatch set to a large number, 100000000. But even when the pipeline stops for 45 minutes, this value is never hit. But when there is a high burst of data above the normal, the Kafka lag increases till it eventually catches up. In the SparkUI I see that parameter batchIntervalMillis=300000 (5 min) is not reached, batches take a maximum of 3 min. It looks like the KafkaIO stops reading at some point, even when the lag is very large. My Kafka parameters --fetchMaxWaitMs=1000
--maxPollRecords=5000 should be able to bring plenty of data. Specially because KafkaIO creates one consumer per partition. In my system there are multiple topics with a total of 992 partitions and my spark.default.parallelism=600. Some partitions have very little data, while others have a large number. Topics are per region and when a region goes down the data is sent through another region/topic. That is when the lag happens.
Does the configuration values for spark.streaming.receiver.maxRate and spark.streaming.receiver.maxRatePerPartition plus spark.streaming.backpressure.enabled play any role at all?
For what I have seen, it looks like Beam controls the whole reading from Kafka with the operator KafkaIO. This component creates its own consumers, therefore the rate of the consumer can only be set by using consumer configs which include fetchMaxWaitMs and maxPollRecords.
The only way those Spark parameters could have any effect if in the rest of the pipeline after the IO source. But I am not sure.
So I finally figure out how it all works. First I want to state that the Spark configuration values: spark.streaming.receiver.maxRate, spark.streaming.receiver.maxRatePerPartition, spark.streaming.backpressure.enabled do not play a factor in Beam because they only work if you are using the source operators from Spark to read from Kafka. Since Beam has its own operator KafkaIO they do not play a role.
So Beam has a set of parameters defined in the class SparkPipelineOptions that are used in the SparkRunner to setup reading from Kafka. Those parameters are:
#Description("Minimum time to spend on read, for each micro-batch.")
#Default.Long(200)
Long getMinReadTimeMillis();
#Description(
"A value between 0-1 to describe the percentage of a micro-batch dedicated to reading from UnboundedSource.")
#Default.Double(0.1)
Double getReadTimePercentage();
Beam create a SourceDStream object that it will pass to spark to use as a source to read from Kafka. In this class the method boundReadDuration returns the result of calculating the larger of two reading values:
proportionalDuration and lowerBoundDuration. The first one is calculated by multiplying BatchIntervalMillis from readTimePercentage. The second is just the value in mills from minReadTimeMillis. Below is the code from SourceDStream. The time duration returned from this function will be used to read from Kafka alone the rest of the time will be allocated to the other tasks in the pipeline.
Last but no least the following parameter also control how many records are process during a batch maxRecordsPerBatch. The pipeline would not process more than those records in a single batch.
private Duration boundReadDuration(double readTimePercentage, long minReadTimeMillis) {
long batchDurationMillis = ssc().graph().batchDuration().milliseconds();
Duration proportionalDuration = new Duration(Math.round(batchDurationMillis * readTimePercentage));
Duration lowerBoundDuration = new Duration(minReadTimeMillis);
Duration readDuration = proportionalDuration.isLongerThan(lowerBoundDuration) ? proportionalDuration: lowerBoundDuration;
LOG.info("Read duration set to: " + readDuration);
return readDuration;
}

Spark structured streaming asynchronous batch blocking

I’m using Apache Spark structured streaming for reading from Kafka. Sometimes my micro batches get processed in a greater time than specified, due to heavy writes IO operations. I was wondering if there’s an option of starting the next batch before the first one has finished, but make the second batch blocked by the first?
I mean that if the first one took 7 seconds and the batch is set for 5 seconds, then start the second batch on the fifth second. But if the second batch finishes block it so it won’t write before it’s previous batch (because of the will to keep the correct messages order).
No. Next batch only starts if previous completed. I think you mean term interval. It would become a mess otherwise.
See https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers

Spark Streaming Execution Flow

I am a newbie to Spark Streaming and I have some doubts regarding the same like
Do we need always more than one executor or with one we can do our job
I am pulling data from kafka using createDirectStream which is receiver less method and batch duration is one minute , so is my data is received for one batch and then processed during other batch duration or it is simultaneously processed
If it is processed simultaneously then how is it assured that my processing is finished in the batch duration
How to use the that web UI to monitor and debugging
Do we need always more than one executor or with one we can do our job
It depends :). If you have a very small volume of traffic coming in, it could very well be that one machine code suffice in terms of load. In terms of fault tolerance that might not be a very good idea, since a single executor could crash and make your entire stream fault.
I am pulling data from kafka using createDirectStream which is
receiver less method and batch duration is one minute , so is my data
is received for one batch and then processed during other batch
duration or it is simultaneously processed
Your data is read once per minute, processed, and only upon the completion of the entire job will it continue to the next. As long as your batch processing time is less than one minute, there shouldn't be a problem. If processing takes more than a minute, you will start to accumulate delays.
If it is processed simultaneously then how is it assured that my
processing is finished in the batch duration?
As long as you don't set spark.streaming.concurrentJobs to more than 1, a single streaming graph will be executed, one at a time.
How to use the that web UI to monitor and debugging
This question is generally too broad for SO. I suggest starting with the Streaming tab that gets created once you submit your application, and start diving into each batch details and continuing from there.
To add a bit more on monitoring
How to use the that web UI to monitor and debugging
Monitor your application in the Streaming tab on localhost:4040, the main metrics to look for are Processing Time and Scheduling Delay. Have a look at the offical doc : http://spark.apache.org/docs/latest/streaming-programming-guide.html#monitoring-applications
batch duration is one minute
Your batch duration a bit long, try to adjust it with lower values to improve your latency. 4 seconds can be a good start.
Also it's a good idea to monitor these metrics on Graphite and set alerts. Have a look at this post https://stackoverflow.com/a/29983398/3535853

Spark Streaming Kafka backpressure

We have a Spark Streaming application, it reads data from a Kafka queue in receiver and does some transformation and output to HDFS. The batch interval is 1min, we have already tuned the backpressure and spark.streaming.receiver.maxRate parameters, so it works fine most of the time.
But we still have one problem. When HDFS is totally down, the batch job will hang for a long time (let us say the HDFS is not working for 4 hours, and the job will hang for 4 hours), but the receiver does not know that the job is not finished, so it is still receiving data for the next 4 hours. This causes OOM exception, and the whole application is down, we lost a lot of data.
So, my question is: is it possible to let the receiver know the job is not finishing so it will receive less (or even no) data, and when the job finished, it will start receiving more data to catch up. In the above condition, when HDFS is down, the receiver will read less data from Kafka and block generated in the next 4 hours is really small, the receiver and the whole application is not down, after the HDFS is ok, the receiver will read more data and start catching up.
You can enable back pressure by setting the property spark.streaming.backpressure.enabled=true. This will dynamically modify your batch sizes and will avoid situations where you get an OOM from queue build up. It has a few parameters:
spark.streaming.backpressure.pid.proportional - response signal to error in last batch size (default 1.0)
spark.streaming.backpressure.pid.integral - response signal to accumulated error - effectively a dampener (default 0.2)
spark.streaming.backpressure.pid.derived - response to the trend in error (useful for reacting quickly to changes, default 0.0)
spark.streaming.backpressure.pid.minRate - the minimum rate as implied by your batch frequency, change it to reduce undershoot in high throughput jobs (default 100)
The defaults are pretty good but I simulated the response of the algorithm to various parameters here

What is spark.streaming.receiver.maxRate? How does it work with batch interval

I am working with spark 1.5.2. I understand what a batch interval is, essentially the interval after which the processing part should start on the data received from the receiver.
But I do not understand what is spark.streaming.receiver.maxRate. From some research it is apparently an important parameter.
Lets consider a scenario. my batch interval is set to 60s. And spark.streaming.receiver.maxRate is set to 60*1000. What if I get 60*2000 records in 60s due to some temporary load. What would happen? Will the additional 60*1000 records be dropped? Or would the processing happen twice during that batch interval?
Property spark.streaming.receiver.maxRate applies to number of records per second.
The receiver max rate is applied when receiving data from the stream - that means even before batch interval applies. In other words you will never get more records per second than set in spark.streaming.receiver.maxRate. The additional records will just "stay" in the stream (e.g. Kafka, network buffer, ...) and get processed in the next batch.

Resources