I have a scenario below and is currently leverage Spring integration as the technology to achieve.
I have around 18000 staff Id data
for each staff, a process needs to kick off to do 1 HTTP call to retrieve staff profile information from mail calender server, then 1 HTTP call to retrieve some other information, then may need to send out 3-5 more HTTP calls in a single task
I need to finish this process for above 50000 staff in 15 mins.
I will need this whole batch process to run every 15mins again and again.
Assume each job takes 5 seconds to finish.. i still need 30 mins to finish
=================
Inital Thinking
I can use spring integration to have something like:
- create one job for each staff - 18000 jobs. The job request likely only contains a staff ID so request is very light weight.
- add all the jobs to the int:queue at once so it triggers the input channel - calenderSynRequestChannel
- have a poller - 100 concurrent workers to clean up the job in 15 mins.
Questions:
it is a good way to do this kind of batch processing? some concerns i have is the size of the queue to support 18000 jobs at once
should I use file base approach to store all the staff id in multiple files and get picked up later by the poller? however, this will also complicate the design as there could have concurrent issue for read/write/delete the files by the workers.
Current solution:
<int:service-activator ref="synCalenderService" method="synCalender" input-channel="calenderSynRequestChannel">
<int:poller fixed-delay="50" time-unit="MILLISECONDS" task-executor="taskExecutor" receive-timeout="0" />
</int:service-activator>
<task:executor id="taskExecutor" pool-size="50" keep-alive="120" queue-capacity="500"/>
Anyone encounters similar problem might give a bit of insight on how to address using Spring Integration
Why not do a spring batch job that:
Reader that reads the staff data
Processor that make the HTTP calls
Writer that writes the result to a logfile (for example)
Then utilize the TaskScheduler (spring batch framework) to schedule execution for every 15 minutes, or maybe even better with a fixed delay.
If you want to do it more in parallel, utilize the org.springframework.batch.integration.async.AsyncItemProcessor (and writer).
Related
I am a newbie to Spark Streaming and I have some doubts regarding the same like
Do we need always more than one executor or with one we can do our job
I am pulling data from kafka using createDirectStream which is receiver less method and batch duration is one minute , so is my data is received for one batch and then processed during other batch duration or it is simultaneously processed
If it is processed simultaneously then how is it assured that my processing is finished in the batch duration
How to use the that web UI to monitor and debugging
Do we need always more than one executor or with one we can do our job
It depends :). If you have a very small volume of traffic coming in, it could very well be that one machine code suffice in terms of load. In terms of fault tolerance that might not be a very good idea, since a single executor could crash and make your entire stream fault.
I am pulling data from kafka using createDirectStream which is
receiver less method and batch duration is one minute , so is my data
is received for one batch and then processed during other batch
duration or it is simultaneously processed
Your data is read once per minute, processed, and only upon the completion of the entire job will it continue to the next. As long as your batch processing time is less than one minute, there shouldn't be a problem. If processing takes more than a minute, you will start to accumulate delays.
If it is processed simultaneously then how is it assured that my
processing is finished in the batch duration?
As long as you don't set spark.streaming.concurrentJobs to more than 1, a single streaming graph will be executed, one at a time.
How to use the that web UI to monitor and debugging
This question is generally too broad for SO. I suggest starting with the Streaming tab that gets created once you submit your application, and start diving into each batch details and continuing from there.
To add a bit more on monitoring
How to use the that web UI to monitor and debugging
Monitor your application in the Streaming tab on localhost:4040, the main metrics to look for are Processing Time and Scheduling Delay. Have a look at the offical doc : http://spark.apache.org/docs/latest/streaming-programming-guide.html#monitoring-applications
batch duration is one minute
Your batch duration a bit long, try to adjust it with lower values to improve your latency. 4 seconds can be a good start.
Also it's a good idea to monitor these metrics on Graphite and set alerts. Have a look at this post https://stackoverflow.com/a/29983398/3535853
I have multiple Spring batch jobs that I am required to schedule such that they are triggered at the same time.So far I am not having luck around this and I also do not under stand why.I read some where that having multiple jobs to trigger on same CRON expression(e.g. after every 10 mins) is not possible with Quartz and I have to increase the SIZE of the thread pool some how to make it possible.I just want to understand is this correct?If yes then is there an alternate way of achieving this sort of concurrency and if not then how can I increase the thread pool size in the xml configuration?Many Thanks in advance
How to configure Spring sftp:inbound-channel-adapter to run between specific timing lets say 8am-7pm.
Currently i just have the below configuration and i need to poll only between 8am-7pm
<int:poller fixed-rate="300000" max-messages-per-poll="1" />.Heard that Spring batch would help. any suggestions ?
Please, be more specific. Describe your requirements in human words.
E.g. you need to poll every 5 min starting from 8am and ending on 7pm every day.
In this case the cron will look like:
0 0/5 8-19 * *
My problem statement. Read a csv file with 10 million data and store it in db. with as minimal time as possible.
I had implemented it using Simple multi threaded executor of java and the logic is almost similar to spring batch's chunk. Read preconfigured number of data from csv file and then create a thread, and passing the data to thread which validates data and then writes to file which runs in multi thread. once all the task is done I'm calling sql loader to load each file. Now I want to move this code to spring batch(I'm newbie to spring batch)
Here are my question
1. In task, is it possible to make ItemReader to Item writer multi threaded(as I read the file create a new thread to process the data before the thread writes to data)? if not I need to create two steps first step read the file which is single threaded and another step which is multi threaded writing to individual file, but how do I pass the list of data to another task from previous task.
2. In case if there are any failures in a single thread, how can I stop whole batch job processing.
3. How to retry the batch job in case of failure after certain interval. I know that there is retry option in case of failure but I could not find an option to retry the task after certain interval in case of failure. here I'm not talking about scheduler because I've batch job already runs under scheduler, but on failure it has to be re-run after 3 minutes are so.
Here is how I solved the problem.
Read a file and chunk the file( split the file) using Buffered and File Channel reader and writer ( the fastest way of File read/write, even spring batch uses the same). I implemented such that this is executed before job is started( However it can be executed using job as step using method invoker)
Start the Job with directory location as job parameter.
Use multiResourcePartitioner which will get the directory location and for each file a slave step is created in separate thread
In the Slave step get the file passed from Partitioner and use spring batchs itemreader to read the file
Use the Database item writer( I'm using mybatis batch itemwriter) to push the data to Database.
Its better to use the split count equal to commit-count of step.
About multi-thread read How to set up multi-threading in Spring Batch? answer; it will point you to right direction. Also, in this sample there are some consideration about restart for CSV file
Job should automatically fails if some error on thread: I have never tried, but this should be the default behaviour
Spring Batch How to set time interval between each call in a Chunk tasklet can be a start. Also, official doc about Backoff Policies - When retrying after a transient failure it often helps to wait a
bit before trying again, because usually the failure is caused by
some problem that will only be resolved by waiting. If a
RetryCallback fails, the RetryTemplate can pause execution according
to the BackoffPolicy in place.
Let me known if this help or how you solve problem because I'm interested for my (future) work!
I hope my indications can be helpful.
You can split your input file to many file , the use Partitionner and load small files with threads, but on error , you must restart all job after DB cleaned.
<batch:job id="transformJob">
<batch:step id="deleteDir" next="cleanDB">
<batch:tasklet ref="fileDeletingTasklet" />
</batch:step>
<batch:step id="cleanDB" next="split">
<batch:tasklet ref="countThreadTasklet" />
</batch:step>
<batch:step id="split" next="partitionerMasterImporter">
<batch:tasklet>
<batch:chunk reader="largeCSVReader" writer="smallCSVWriter" commit-interval="#{jobExecutionContext['chunk.count']}" />
</batch:tasklet>
</batch:step>
<batch:step id="partitionerMasterImporter" next="partitionerMasterExporter">
<partition step="importChunked" partitioner="filePartitioner">
<handler grid-size="10" task-executor="taskExecutor" />
</partition>
</batch:step>
</batch:job>
Full example code (on Github).
Hope this help.
We need to rebill x amount of customers on any given day.
Currently, we run a cron every 5 mins to bill 20 people/send invoice etc
However, when the number of customers grows, extending to 100 people per 5 min may result in the cron overlapping and billing customers twice.
I have two thoughts:
Running the cron once, but making it sleep x amount after 20 billed/invoiced so that we dont spam the API.
Using a message queue where people are added to the queue and then "workers" process the queue. The problem is I have no experience in this, so not sure what is the best route to take.
Does anyone have any experience in this?