Triggering spark job from UI - apache-spark

Requirement:
Trigger a spark job from UI by user action (say submit button click).
Once the spark job is finished, the summary of the status has to be displayed in UI.
Design approach:
1. Once the user initiates a job run by clicking the submit button from UI, we will make an insert into a Impala queue table using Impala JDBC.
The simplified structure of the queue table is as follows:
JOB_RUN_QUEUE (REQUEST_ID, STATUS, INPUT_PARAM_1, INPUT_PARAM_2, SUMMARY)
The initial request will have STATUS='SUBMIT'
Oozie will be configured to orchestrate the request handling and job spark execution.
Once Oozie finds entry into the queue table JOB_RUN_QUEUE with status='SUBMIT' it will pull the arguments from the queue table and trigger the spark job.
It will update the status in the queue table to 'IN PROGRESS'. Upon successfull completion it will update the summary and status in the queue table.
On failure it will update the status to FAILURE.
UI will read the data from the queue table and display on UI.
Questions:
1. Is there any alternative ad better design approach.
2. D I need to have a queue mechanism to the initial request or can I leverage some inbuilt functionality?

Related

How to conditionally poll messages from Kafka Topic

I have a few task notifications in a MongoDB database. Each task has a due_date and reminder flag. I am pushing these tasks to a Kafka Topic. There is a Node JS app that polls from this topic and pushes notifications to a frontend app based on the due_date and reminder flag. The due_date could be past dated or upcoming.
From Kafka we need to send notifications to the Node App that is listening whenever those conditions time-based conditions occur:
Reminder = true and it is X time before the Due Date
Due Date = now
The Task still exists and is Past Due
How can this be done with Kafka?
DB to Kafka interaction should be via source connector. DB Connectors can publish events to Kafka whenever there is a change in underlying table. So if new rows are created or any column is updated.
So ideal solution would be to introduce some more columns in table OR a new utility table with columns to identify the conditions you mentioned above. May be a column like "IsDueDate" which can be a boolean type. Create a scheduler in DB (not sure of Mongo but most DBs have option for this) Or any batch system (like Spring batch/boot app) to validate your data and populate these columns.
Once these columns are updated, it will trigger a message to Kafka via connector and your front end apps polls Kafka for new messages and ultimately can use these flags in payload to identify which condition triggered this and you can do the stuffs in front end.

Email alert for spark streaming delay

We have spark jobs that loads data from Kafka to hive database. Sometimes our streaming jobs getting too much data or hanged, causing delay in live streaming.
We can able to see the active process and pending process in queue in Spark UI.
I want to consolidate these information and send an email alert in case of any delay.
Thanks
You can use below GitHub package for Spark Email Alert.
https://github.com/NikhilSuthar/Scala-Spark-Mail

Hazelcast jet pipeline created on an app with multiple instace causing problem

I have an app where i have created Jet instance and pipeline job to aggregate result of an streaming data. I am running multiple instances of such app.
The problem i am facing is since there are 2 instaces it is running 2 pipeline job and hence the result is computed twice and incorrect but it figures out that both jet instance are part of the same cluster.
Does jet pipeline do not check the pipeline job and if same just consider it as one just like kafka stream does it with its topology?
Job submission in Jet 0.7 is to the entire cluster. If you submit the same Pipeline/DAG twice, the job will execute twice.
The upcoming version adds newJobIfAbsent() method: if the job has a name, it will only submit the job unless there's an active job with equal name. If there is a job with equal name already, it will return Job handle to the already existing job.

spring batch design advice for processing 50k files

We have more than 50k files coming in everyday and needs to be processed. For that we have developed POC apps with design like,
Polling app picks the file continuously from ftp zone.
Validate that file and create metadata in db table.
Another poller picks 10-20 files from db(only file id and status) and deliver it to slave apps as message
Slave app take message and launch a spring batch job, which is reading data, does biz validation in processors and writes validated data to db/another file.
We used spring integration and spring batch technology for this POC
Is it a good idea to launch spring batch job in slaves or directly implement read,process and write logic as plan java or spring bean objects?
Need some insight on launching this job where slave can have 10-25 MDP(spring message driven pojo) and each of this MDP is launching a job.
Note : Each file will have max 30 - 40 thousand records
Generally, using SpringIntegration and SpringBatch for such tasks is a good idea. This is what they are intended for.
With regard to SpringBatch, you get the whole retry, skip and restart handling out of the box. Moreover, you have all these readers and writers that are optimised for bulk operations. This works very well and you only have to concentrate on writing the appropriate mappers and such stuff.
If you want to use plain java or spring bean objects, you will probably end up developing such infrastructure code by yourself... incl. all the needed effort for testing and so on.
Concerning your design:
Besides validating and creation of the metadata entry, you could consider to load the entries directly into a database table. This would give you a better "transactional" control, if something fails. Your load job could look something like this:
step1:
tasklet to create an entry in metadata table with columns like
FILE_TO_PROCESS: XY.txt
STATE: START_LOADING
DATE: ...
ATTEMPT: ... first attempt
step2:
read and validate each line of the file and store it in a data table
DATA: ........
STATE:
FK_META_TABLE: ForeignKey to meta table
step3:
update metatable with status LOAD_completed
-STATE : LOAD_COMPLETED
So, as soon as your metatable entry gets the state LOAD_COMPLETED, you know that all entries of the files have been validated and are ready for further processing.
If something fails, you just can fix the file and reload it.
Then, to process further, you could just have jobs which poll periodically and check if there are new data in the database which should be processed. If more than one file had been loaded during the last period, simply process all files that are ready.
You could even have several slave-processes polling from time to time. Just do a read for update on the state of the metadata table or use an optimistic locking approach to prevent several slaves from trying to process the same entries.
With this solution, you don't need a message infrastructure and you can still scale the whole application without any problems.

Is it possible to implement a reliable receiver which supports non-graceful shutdown?

I'm curious if it is an absolute must that a Spark streaming application is brought down gracefully or it runs the risk of causing duplicate data via the write-ahead log. In the below scenario I outline sequence of steps where a queue receiver interacts with a queue requires acknowledgements for messages.
Spark queue receiver pulls a batch of messages from the queue.
Spark queue receiver stores the batch of messages into the write-ahead log.
Spark application is terminated before an ack is sent to the queue.
Spark application starts up again.
The messages in the write-ahead log are processed through the streaming application.
Spark queue receiver pulls a batch of messages from the queue which have already been seen in step 1 because they were not acknowledged as received.
...
Is my understanding correct on how custom receivers should be implemented, the problems of duplication that come with it, and is it normal to require a graceful shutdown?
Bottom line: It depends on your output operation.
Using the Direct API approach, which was introduced on V1.3, eliminates inconsistencies between Spark Streaming and Kafka, and so each record is received by Spark Streaming effectively exactly once despite failures because offsets are tracked by Spark Streaming within its checkpoints.
In order to achieve exactly-once semantics for output of your results, your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets.
For further information on the Direct API and how to use it, check out this blog post by Databricks.

Resources