Can we change the unit of spark stream batch interval? - apache-spark

When we initial a spark stream context, we will use code like :
ssc = StreamingContext(sc, 1)
The 1 here is batch interval means 1 second here. The unit of batch interval here is time (second). But can we change the interval to something else? For example, the number of files.
Like we have a folder, there will be files comes in but we do not know when. What we want is that as soon as there is a file, we process it , so here the interval is not a specific time range, I hope it is the number of files.
Can we do that?

That's not possible. Spark Streaming essentially executes batch jobs repeatedly in a given time interval. Additionally, all window operations are time-based as well, so the notion of time cannot be ignored in Spark Streaming.
In your case you would try to optimize the job for the lowest processing time possible and then just have several batches with 0 records when there are no new files available.

Related

Increase the output size of Spark Structured Streaming job

Context : I have a Spark Structured Streaming job with Kafka as source and S3 as sink. The outputs in S3 are again picked up as input in other MapReduce jobs.
I, therefore, want to increase the output size of the files on S3 so that the MapReduce job works efficiently.
Currently, because of small input size, the MapReduce jobs are taking way too long to complete.
Is there a way to configure the streaming job to wait for at least 'X' number of records to process?
Probably you want to wait micro batch trigger till sufficient data are available at source . You can use minOffsetsPerTrigger option to wait for sufficient data available in kafka .
Make sure to set sufficient maxTriggerDelay time as per your application need .
No there is not in reality.
No for Spark prior to 3.x.
Yes and No for Spark 3.x which equates to No effectively.
minOffsetsPerTrigger was introduced but has a catch as per below. That means the overall answer still remains No.
From the manuals:
Minimum number of offsets to be processed per trigger interval. The
specified total number of offsets will be proportionally split across
topicPartitions of different volume. Note, if the maxTriggerDelay is
exceeded, a trigger will be fired even if the number of available
offsets doesn't reach minOffsetsPerTrigger.

How to configure backpreasure in Spark 3 Structure Stream Kafka/Files source with Trigger.Once option

In Spark 3 Behave of backpressure option on Kafka and File Source for trigger.once scenario was changed.
But I have a question.
How can I configure backpressure to my job when I want to use TriggerOnce?
In spark 2.4 I have a use case, to backfill some data and then start the stream.
So I use trigger once, but my backfill scenario can be very very big and sometimes create too big a load on my disks because of shuffles and to driver memory because FileIndex cached there.
SO I use max maxOffsetsPerTrigger and maxFilesPerTrigger to control how much data my spark can process. that's how I configure backpressure.
And now you remove this ability, so assume someone can suggest a new way to go?
Trigger.Once ignores these options right now (in Spark 3), so it always will read everything on the first load.
You can workaround that - for example, you can start stream with trigger set to periodic, with some value like, 1 hour, and don't execute .awaitTermination, but have a parallel loop that will check if first batch is done, and stop the stream. Or you can set it to continuous mode, and then check if batches having 0 rows, and then terminate the stream. After that initial load you can switch stream back to Trigger.Once

Apache Spark Delay Between Jobs

my as you can see, my small application has 4 jobs which run for a total duration of 20.2 seconds, however there is a big delay between job 1 and 2 causing the total time to be over a minute. Job number 1 runJob at SparkHadoopMapReduceWriter.scala:88 is performing a bulkupload of HFiles into a HBase table. Here is the code I used to load to load the files
val outputDir = new Path(HBaseUtils.getHFilesStorageLocation(resolvedTableName))
val job = Job.getInstance(hBaseConf)
job.getConfiguration.set(TableOutputFormat.OUTPUT_TABLE, resolvedTableName)
job.setOutputFormatClass(classOf[HFileOutputFormat2])
job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
job.setMapOutputValueClass(classOf[KeyValue])
val connection = ConnectionFactory.createConnection(job.getConfiguration)
val hBaseAdmin = connection.getAdmin
val table = TableName.valueOf(Bytes.toBytes(resolvedTableName))
val tab = connection.getTable(table).asInstanceOf[HTable]
val bulkLoader = new LoadIncrementalHFiles(job.getConfiguration)
preBulkUploadCallback.map(callback => callback())
bulkLoader.doBulkLoad(outputDir, hBaseAdmin, tab, tab.getRegionLocator)
If anyone has any ideas, I would be very greatful
I can see there are 26 tasks in job 1 which is based on the number of hfiles created. Even though the job 2 shows completed in 2s, it takes some time to copy these files to target location and that's why you are getting a delay between job 2 and 3. This can be avoided by reducing the number of tasks in job 1.
Decrease the number of Regions for the output table in Hbase, which will result in reducing the number of task for your second job.
TableOutputFormat determines the split based on the number of regions for a given table in Hbase
Job number 1 runJob at SparkHadoopMapReduceWriter.scala:88 is performing a bulkupload
This is not quite true. This job merely creates HFiles outside of HBase. The gap you see between this job and the next one could be explained by the actual bulk loading at bulkLoader.doBulkLoad. This operation involves only metadata trasfer and usually performs faster (from my experience), so you should check the driver logs to see where it hangs.
Thanks for your input guys, I lowered the number of HFiles created in task 0. This has decreased the lag by about 20%. I used
HFileOutputFormat2.configureIncrementalLoad(job, tab, tab.getRegionLocator)
which automatically calculates the number of reduce tasks to match the current number of regions for the table. I will say that we are are using HBase backed by S3 in AWS EMR instead of the classical HDFS. I'm am going to investigate now whether this could be contributing to the lag.

Spark streaming : batch processing time slowly increase

I use spark with the cassandra spark connector and direct kafka.
And I seed batch procession increasing slowly over the time.
Even when there is nothing to process incoming from kafka.
I think it is about a few milliseconds by batch, but after a long time, a batch can take several more seconds until it reaches the batch interval and finally crash.
I thought first it was a memory leak, but I think the processing time would be less linear but exponentially instead.
I don't really know if it is stages that become longer and longer or the latency
between stage that increases.
I use spark 1.4.0
Any pointers about this?
EDIT :
A attentive look at the evolution of the processing time of each batch, comparing total jobs processing time.
And it appears that even if batch processing time increases, the job processing time are not increasing.
exemple : for a batch that take 7s the sum of each job processing time is 1.5s. (as shown in the image below)
Is it because the computing time driver side increases, and not the computing time executor side?
And this driver computing time is not shown in job processing ui?
If it's the case how can correct it?
I finally found the solution to my problem.
I had this code in the function that add filter and transform to my rdd.
TypeConverter.registerConverter(new SomethingToOptionConverter[EventCC])
TypeConverter.registerConverter(new OptionToSomethingConverter[EventCC])
Because it's call at each batch there is a lot of time the same object inside TypeConverter.
And I don't really know how it works Cassandra Spark converter but it's look like to make reflection internaly whit objects.
And make slow reflection x time batch make all the processing time of the batch increasing.

How a Spark Streaming application be loaded and run?

hi i am new to spark and spark streaming.
from the official document i could understand how to manipulate input data and save them.
the problem is the quick example of Spark Streaming quick examplemade me confuse
i knew the the job should get data from the DStream you have setted and do something on them, but since its running 24/7. how will the application be loaded and run?
will it run every n seconds or just run once at the beginning and then enter the cycle of [read-process-loop]?
BTW, i am using python, so i checked the python code of that example, if its the latter case, how spark's executor knews the which code snipnet is the loop part ?
Spark Streaming is actually a microbatch processing. That means each interval, which you can customize, a new batch is executed.
Look at the coding of the example, which you have mentioned
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc,1)
You define a streaming context, which a micro-batch interval of 1 second.
That is the subsequent coding, which uses the streaming context
lines = ssc.socketTextStream("localhost", 9999)
...
gets executed every second.
The streaming process gets initially triggerd by this line
ssc.start() # Start the computation

Resources