I have a spark batch job that talks with cassandra. After the batch job gets completed , I need to verify few entries in cassandra and the cycle continues for 2-3 times. How do I know when the batch job ends ? I don't want to track the status of batch job by adding entry in db.
How to write functional test in spark ?
Related
I have monthly directories of parquet files (~10TB each directory). Files are being atomically written to this directory every minute or so. When we get to a new month, a new directory is created and data is written there. Once data is written, it cannot be moved.
I easily run batch queries on this data using spark (batch mode). I can also easily run spark streaming queries.
I am wondering how I can reconcile the two modes: batch and stream.
For example: Lets say I run a batch query on the data. I get the results of the query and do something with them. I can then checkpoint this dataframe. Now let's say I want to start a streaming job to only process new files relative to what was processed in the batch job, ie. only files not processed in the batch job should now be processed.
Is this possible with spark streaming? If start a spark streaming job and use the same checkpoint that the batch job used, will it proceed as I want it to?
Or, with the batch job, do I need to keep track of what files were processed and then somehow pass this to spark streaming so it can know to not process these.
This seems like a pretty common problem, so I am asking here to see what some other big data software developers have done.
I apologize for not having any code to post in this question, but I hope that my explanation is all it takes for someone to see a potential solution. If needed, I can come up with some snippets
I noticed that when I start the Spark Streaming application the first job takes more time than the following ones even when there are no input data. I also noticed that the first job when input data arrives requires a processing time greater than the following. Is there a reason for this behavior?
Thank You
I am trying to optimize a Spark Streaming application which collects data from a Kafka cluster, processes it and saves the results to various database tables. The Jobs tab in the Spark UI shows the duration of each job as well as the time it was submitted.
I would expect that for a specific batch, a job starts processing when the previous job is done. However, in the attached screenshot, the "Submitted" time of a job is not right after the previous job finishes. For example, job 1188 has a duration of 1 second and it was submitted at 12:02:12. I would expect that the next job would be submitted one second later, or at least close to it, but instead it was submitted six seconds later.
Any ideas on how this delay can be explained? These jobs belong to the same batch and are done sequentially. I know that there is some scheduling delay between jobs and tasks, but I would not expect it to be that large. Moreover, the Event Timeline of a Stage does not show large Scheduling Delay.
I am using Pyspark in a Standalone mode.
Below is the scenario I would need suggestions on,
Scenario:
Data ingestion is done through Nifi into Hive tables.
Spark program would have to perform ETL operations and complex joins on the data in Hive.
Since the data ingested from Nifi is continuous streaming, I would like the Spark jobs to run every 1 or 2 mins on the ingested data.
Which is the best option to use?
Trigger spark-submit jobs every 1 min using a scheduler?
How do we reduce the over head and time lag in submitting the job recursively to the spark cluster? Is there a better way to run a single program recursively?
Run a spark streaming job?
Can spark-streaming job get triggered automatically every 1 min and process the data from hive? [Can Spark-Streaming be triggered only time based?]
Is there any other efficient mechanism to handle such scenario?
Thanks in Advance
If you need something that runs every minute you better use spark-streaming and not batch.
You may want to get the data directly from kafka and not from hive table, since it is faster.
As for your questions what is better batch / stream. You can think of spark streaming as micro batch process that runs every "batch interval".
Read this : https://spark.apache.org/docs/latest/streaming-programming-guide.html
my question is that.
I use spark-streaming read data from kafka with directSteam Api,process rdd then update zookeeper offset manually.
data from kafka will read and insert into hive table.
now I meet a question.
sometime the hive-meta store process exit for some reason.(now the hive-metastore is single)
some batch job will fail for that reason and the spark streaming job won't exit just log some warning.
then when I restart the hive metastore process, the program go on and the new batch job will succeed.
but I find that the failed batch read data from kafka is missing.
I see the meta data from the job detail.
image that one batch job read 20 offset from kafka.
the batch1 job read offset 1 20,
the batch2 job read offset 21 40
if batch1 job fail ,the batch2 succeed, the failed1 job 's data will missied.
how can I do this?
how can I rerun the failed batch job?