Processing Batch files coming every 2 minutes in system using spark Streaming - apache-spark

I am trying to process my near real time batch csv files using spark streaming. I am reading these files in batches of 100 files and doing some operation and writing to output files. I am using spark.readstream and spark.writestream functions to read and write streaming files.
I am trying to find out how I can stop the spark streaming?
stream_df = spark.readstream.csv(filepath_directory).
.option("maxFilesPerTrigger", 10)
query = spark.writeStream.format("parquet")..outputMode("append").option("output filepath")
I faced the issue while streaming job is running, due to some reason
my job is failing or some exceptions are.
I tried try, except.
I am planning to stop the query in case of any exceptions and process the same code again.
I used query.stop(), is this the right way to stop streaming job?
I read one post, I am not sure whether it is for Dstream and how to execute this code in my pyspark code.
https://www.linkedin.com/pulse/how-shutdown-spark-streaming-job-gracefully-lan-jiang

Related

How to get spark streaming to continue where spark batch left off

I have monthly directories of parquet files (~10TB each directory). Files are being atomically written to this directory every minute or so. When we get to a new month, a new directory is created and data is written there. Once data is written, it cannot be moved.
I easily run batch queries on this data using spark (batch mode). I can also easily run spark streaming queries.
I am wondering how I can reconcile the two modes: batch and stream.
For example: Lets say I run a batch query on the data. I get the results of the query and do something with them. I can then checkpoint this dataframe. Now let's say I want to start a streaming job to only process new files relative to what was processed in the batch job, ie. only files not processed in the batch job should now be processed.
Is this possible with spark streaming? If start a spark streaming job and use the same checkpoint that the batch job used, will it proceed as I want it to?
Or, with the batch job, do I need to keep track of what files were processed and then somehow pass this to spark streaming so it can know to not process these.
This seems like a pretty common problem, so I am asking here to see what some other big data software developers have done.
I apologize for not having any code to post in this question, but I hope that my explanation is all it takes for someone to see a potential solution. If needed, I can come up with some snippets

How to implement exactly-once processing when reading reading from directory using Spark Structured Streaming?

I'd like to use the concept of stream processing to read files from a local directory and then publish to Apache Kafka. I thought about using Spark Structured Streaming.
How is the checkpointing implemented when the streaming fails after reading 50 lines of a file. Will it start from 51st line of the file when started next time or will it again read from the start of the file?
Also, will we have any issues if we use checkpointing in structured streaming when there is any upgrade or any change in the code.
when the streaming fails after reading 50 lines of a file. Will it start from 51st line of the file when started next time Or will it again read from the start of the file.
Either the entire file is fully processed or none at all. That's how FileFormat works in general in Spark SQL and has little to do with Spark Structured Streaming in particular (since they share the underlying execution infrastructure).
In short, the engine "will it again read from the start of the file."
That also to say that there is no concept of a single line while processing files in Spark Structured Streaming. You process a streaming DataFrame that is an entire file (or even a couple of files) all at once and whether you want to process the dataset line by line or in its entirety is up to you, a Spark developer.
Also, will we have any issues if we use checkpointing in structured streaming when there is any upgrade or any change in the code.
In theory, you should not. The purpose of the new checkpointing mechanism in Spark Structured Streaming (as compared to the legacy Spark Streaming's) was to allow for restarts and upgrades in a more comfortable way. The checkpointing uses just a little information (usually stored in JSON files) to restart processing from the point of the last successful checkpoint.

Alternate to recursively Running Spark-submit jobs

Below is the scenario I would need suggestions on,
Scenario:
Data ingestion is done through Nifi into Hive tables.
Spark program would have to perform ETL operations and complex joins on the data in Hive.
Since the data ingested from Nifi is continuous streaming, I would like the Spark jobs to run every 1 or 2 mins on the ingested data.
Which is the best option to use?
Trigger spark-submit jobs every 1 min using a scheduler?
How do we reduce the over head and time lag in submitting the job recursively to the spark cluster? Is there a better way to run a single program recursively?
Run a spark streaming job?
Can spark-streaming job get triggered automatically every 1 min and process the data from hive? [Can Spark-Streaming be triggered only time based?]
Is there any other efficient mechanism to handle such scenario?
Thanks in Advance
If you need something that runs every minute you better use spark-streaming and not batch.
You may want to get the data directly from kafka and not from hive table, since it is faster.
As for your questions what is better batch / stream. You can think of spark streaming as micro batch process that runs every "batch interval".
Read this : https://spark.apache.org/docs/latest/streaming-programming-guide.html

Using Spark in while loop to process log files

I have a server that generate some log files every 1 second and I want to process this file using Apache Spark.
I write a spark application using python and in a while loop I process a group of log files.
I stop sparkContext in each iteration and start it for next step.
My question is that what is the best approach for this kind of application that runs infinitely and process batches or group of generated files. should I use a infinite while loop or should I run my code in cron job or even scheduling frameworks like airflow?
The best possible way to solve this is to use "Spark Streaming". Spark streaming enables you to process live data streams.Spark streaming currently works with Kafka,Flume,HDFS,S3,Amazon Kinesis and Twitter.Hence,you should first insert these logs into Kafka and then write a Spark streaming program which processes live stream of logs.This is a cleaner solution instead of using infinite loops and starting and stopping SparkContext multiple times.

How a Spark Streaming application be loaded and run?

hi i am new to spark and spark streaming.
from the official document i could understand how to manipulate input data and save them.
the problem is the quick example of Spark Streaming quick examplemade me confuse
i knew the the job should get data from the DStream you have setted and do something on them, but since its running 24/7. how will the application be loaded and run?
will it run every n seconds or just run once at the beginning and then enter the cycle of [read-process-loop]?
BTW, i am using python, so i checked the python code of that example, if its the latter case, how spark's executor knews the which code snipnet is the loop part ?
Spark Streaming is actually a microbatch processing. That means each interval, which you can customize, a new batch is executed.
Look at the coding of the example, which you have mentioned
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc,1)
You define a streaming context, which a micro-batch interval of 1 second.
That is the subsequent coding, which uses the streaming context
lines = ssc.socketTextStream("localhost", 9999)
...
gets executed every second.
The streaming process gets initially triggerd by this line
ssc.start() # Start the computation

Resources