How a Spark Streaming application be loaded and run? - apache-spark

hi i am new to spark and spark streaming.
from the official document i could understand how to manipulate input data and save them.
the problem is the quick example of Spark Streaming quick examplemade me confuse
i knew the the job should get data from the DStream you have setted and do something on them, but since its running 24/7. how will the application be loaded and run?
will it run every n seconds or just run once at the beginning and then enter the cycle of [read-process-loop]?
BTW, i am using python, so i checked the python code of that example, if its the latter case, how spark's executor knews the which code snipnet is the loop part ?

Spark Streaming is actually a microbatch processing. That means each interval, which you can customize, a new batch is executed.
Look at the coding of the example, which you have mentioned
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc,1)
You define a streaming context, which a micro-batch interval of 1 second.
That is the subsequent coding, which uses the streaming context
lines = ssc.socketTextStream("localhost", 9999)
...
gets executed every second.
The streaming process gets initially triggerd by this line
ssc.start() # Start the computation

Related

Spark Structured Streaming - How to ignore checkpoint?

I'm reading messages from Kafka stream using microbatching (readStream), processing them and writing results to another Kafka topic via writeStream. The job (streaming query) is designed to run "forever", processing microbatches of size 10 seconds (of processing time). The checkpointDirectory option is set, since Spark requires checkpointing.
However, when I try to submit another query with the same source stream (same topic etc.) but possibly different processing algorithm), Spark finishes the previous running query and creates a new one with the same ID (so it starts from the very same offset on which the previous job "finished").
How to tell Spark that the second job is different from the first one, so there is no need to restore from checkpoint (i.e. intended behaviour is to create a completely new streaming query not connected to previous one, and keep the previous one running)?
You can achieve independence of the two streaming queries by setting the checkpointLocation option in their respective writeStream call. You should not set the checkpoint location centrally in the SparkSession.
That way, they can run independently and will not interfere from each other.

Spark Streaming first job requires more time than following jobs

I noticed that when I start the Spark Streaming application the first job takes more time than the following ones even when there are no input data. I also noticed that the first job when input data arrives requires a processing time greater than the following. Is there a reason for this behavior?
Thank You

Processing Batch files coming every 2 minutes in system using spark Streaming

I am trying to process my near real time batch csv files using spark streaming. I am reading these files in batches of 100 files and doing some operation and writing to output files. I am using spark.readstream and spark.writestream functions to read and write streaming files.
I am trying to find out how I can stop the spark streaming?
stream_df = spark.readstream.csv(filepath_directory).
.option("maxFilesPerTrigger", 10)
query = spark.writeStream.format("parquet")..outputMode("append").option("output filepath")
I faced the issue while streaming job is running, due to some reason
my job is failing or some exceptions are.
I tried try, except.
I am planning to stop the query in case of any exceptions and process the same code again.
I used query.stop(), is this the right way to stop streaming job?
I read one post, I am not sure whether it is for Dstream and how to execute this code in my pyspark code.
https://www.linkedin.com/pulse/how-shutdown-spark-streaming-job-gracefully-lan-jiang

Spark streaming with predefined ordering

I have two streaming dataframes - firstDataFrame and secondDataframe. I want to stream firstDataframe completely. And if the first streaming finishes successfully, only then i would like to stream the other dataframe
For example, in the below code, I would like the first streaming action to execute completely and only then the second to begin
firstDataframe.writeStream.format("console").start
secondDataframe.writeStream.format("console").start
Spark follows FIFO job scheduling by default. This means it would give priority to the first streaming job. However, if the first streaming job does not require all the available resources, it would start the second streaming job in parallel. I essentially want to avoid this parallelism. Is there a way to do this?
Reference: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

Can we change the unit of spark stream batch interval?

When we initial a spark stream context, we will use code like :
ssc = StreamingContext(sc, 1)
The 1 here is batch interval means 1 second here. The unit of batch interval here is time (second). But can we change the interval to something else? For example, the number of files.
Like we have a folder, there will be files comes in but we do not know when. What we want is that as soon as there is a file, we process it , so here the interval is not a specific time range, I hope it is the number of files.
Can we do that?
That's not possible. Spark Streaming essentially executes batch jobs repeatedly in a given time interval. Additionally, all window operations are time-based as well, so the notion of time cannot be ignored in Spark Streaming.
In your case you would try to optimize the job for the lowest processing time possible and then just have several batches with 0 records when there are no new files available.

Resources