How to ingest all the flowfiles in a CRON driven DetectDuplicate? - cron

In NiFi, I have a cron driven sequence of processors that provides daily a set of flowfiles which contain 2 attributes I am interested in : product_code and publication_date.
My need is to keep only one flowfile per product_code: the one with the most recent publication_date.
Ex:
For this input :
flow_1: product_code: A / publication_date : 2018-01-01
flow_2: product_code: B / publication_date : 2018-01-01
flow_3: product_code: C / publication_date : 2018-01-01
flow_4: product_code: A / publication_date : 2018-04-12
flow_5: product_code: A / publication_date : 2000-12-31
flow_6: product_code: B / publication_date : 2018-02-02
flow_7: product_code: B / publication_date : 2018-03-03
The expected output should be :
flow_3: product_code: C / publication_date : 2018-01-01
flow_4: product_code: A / publication_date : 2018-04-12
flow_7: product_code: B / publication_date : 2018-03-03
The algorithm I tested
Use an UpdateAttribute processor to add an attribute priority to each flowfile, based on the publication_date.
These updated flowfiles are redirected to a PriorityAttributePrioritizer queue.
The flowfiles stay in this queue because there is only one consuming processor, which is cron driven. By this way, I am sure that the flowfiles in the queue are ordered according to the publication_date.
Then the CRON trigger the next processor, a DetectDuplicate based on the product_code attribute. As the flowfiles are processed from the most recent item to the oldest one, I am sure that when a product_code is detected as duplicate, it is because the same product_code was already OK with a more recent publication_date.
The issue
Sadly, when the cron triggers the DetectDuplicate processor, only one message is consumed, and the others stay in the queue.
If I change the "Scheduling strategy" to "Timer driven" with a "Run schedule" of 0, all my flowfiles are consumed and the output is what is expected.
Is there a way to ask my DetectDuplicate processor to consume all the messages in the queue when it starts to work (and not only one message)?
Or is there a way to set up a scheduling strategy like "Start to work at 2:00 AM and stop at 4:00 AM" ?
Do you think of better strategies to meet the need ?
Regards,
Val.
Update 1
(2018-04-13) More information, in addition to Bryan Bende's comments.
I know CRON is not the best solution, but I do not know how to improve my algorithm to get rid of it.
In my case, the flowFiles that are queued to be deduplicated are generated via a sequence of 3 REST calls:
1st call to "GetAllCategories",
then for each category, call the "GetSubCategories",
and for each subCategory, call the "GetProducts".
This flowFiles generation part lasts generally around 5 minutes: last night the first flowFile arrived in the queue at 2:00:16 AM and the last one at 2:04:58 AM. (That's why I scheduled the DetectDuplicate to run at 3:00 AM.)
If my DetectDuplicate processor would be "Timer driven" scheduled, the first flowFiles arriving in the queue would be consumed by the processor, before all the flowFiles to be there.
And this would break the ordering of the full set of flowFiles.
I feel like I have to wait all the flowfiles to be in the queue before the DetectDuplicate processor starts working.
Do you have potential suggestions to improve my algorithm?

You should generally use CRON scheduling for the source processor that starts the flow and then all other processors should be Timer Driven with Run Schedule of 0.
For example, if you pick up files from a directory every day at 2:00 AM, then GetFile should be scheduled with a CRON expression to start the flow at 2:00 AM, but nothing beyond that needs CRON scheduling because they will never receive data unless GetFile runs.
In the case where you want a processor to wait to execute until all flow files are available, you may be able to use the Wait/Notify processors, such that all the flow files build up in front of a Wait processor before being released to the DetectDuplicate processor.

The reason why only one message gets consumed is when you have the CRON scheduling enabled in all the processors - source and consuming/dowstream processors, it executes like this:
Ex: You have set up a CRON schedule in all processors to run on every day 2 PM, so during the trigger it will consume one flowfile from its upstream processor ex: GetFile at 2PM and the rest of the flowfiles will be on the queue and the next flowfile will only be consumed on the next day at 2PM and so on. And this applies to the further downstream processors, meaning, they will also consume only flowfile at a time at everyday 2 PM which is essentially a disaster in the making. Who wants the processing to be in snail's pace?
That's why you have to follow the approach #Bryan had mentioned. The flow pipeline should only have its source processor as CRON driven, the rest of the processors should be Timer driven with a run schedule of your wish, but generally 0 sec is used to consume the flowfile as it comes.

Related

Prevent concurrent cron jobs in pg-boss

I’m considering pg-boss for running and distributing event-based jobs between the instances of the same service. One of my use cases, apart from event-based, is scheduled jobs. Some of them can take a while and continue running until it’s time to trigger the next invocation - e.g. a job is set to run every 5 minutes but it can take e.g. 8 to complete. In such case I need the system to realize that the previous run is still in progress and not trigger the same job while the previous invocation of it is still in progress, using the example of every 5 minutes and a job taking 8 minutes - I’d like sth like the following to happen:
13:00 job triggered
13:05 job still runs, system sees it and doesn’t trigger once more even though it’s time
13:08 job done
13:10 next job run triggered
Is there an elegant way to achieve it with pg-boss without implementing my own locking mechanism?

Anylogic: Resource(from the pool) is not being released when its availability time ( as per certain schedule) is over

In my ANYLOGIC model there are certain services( some delays the agent from 10 to 15 minutes, others 4 to 8 hrs), using certain resources from a resource pool.
The resources (pool) are available as per a well defined time ( Available: entire week except sunday , 10 am to 1:30 pm and then 2:00 Pm to 6 pm.).
I can see that once a service starts it continues till it finishes itself even after the resource availability time is over.
For example:
A resource is available :entire week except sunday , 10 am to 1:30 pm and then 2:00 Pm to 6 pm.
A service( of 8 hrs delay) starts from 12:30 pm....once it starts, it gets continue till it get finished. Practically it shall release resource from 1:30 to 2 pm and also if the task is not over then it shall not continue beyond 6 pm as well, and shall stop the same and start next day(or next availability).
but it does continue once it starts till it gets finished.
kindly suggest the specific area to be targeted to code or any other option is available?
Define your resource-pool downtimes using a Downtime block. Tick it's "may preempt other tasks" as below:
NOTE: play around with preemption as it interacts with Seize-preemption, resource pool preemption and priorities. Start simple and add complexity only when you fully understand how things work under the hood

Scheduled execution at a certain point of a flow

I need a functionality which does the following:
At a certain point of a flow the execution is paused till the specified time.
(It's like parking / staging when all messages remain in a place till the specified time)
So If you set 2016-04-20 11:12:00 for that time (ideally It's specified by cron expression) till that time everything is paused. (flow does not continue processing messages) If the specified time ellapses then a worflow continues the execution from the point where this 'staging' component resides.
Is it possible to do that with Spring Integration?
How should be implemented?
Actually the defaultDelay for the DelayHandler can be calculated from the date value:
#Autowired
#Qualifier("myDelayer.handler")
private DelayHandler myDelayer;
...
Date nextDate = ...
myDelayer.setDefaultDelay(nextDate.getTime() - System.currentTimeMillis());
and use this code somewhere after start your application, e.g. ContextRefreshedEvent.
Or you can just place a desired Date to the message header and use delay-expression.
From other side you can just place your messages to the QueueChannel and use a desired cron from the <poller> on endpoint which should poll messages from that queue.
If you have so long delay time for those messages, you should consider to use persistent MessageStore on that QueueChannel.

Linux task schedule to Hour, minute, second

I'm trying to run a shell script at a specific time up to it's seconds (H:M:S) , but so far all programs such as at only go up to a specific minute (not second).
I don't want to use sleep since it's not accurate. For some reason it ended couple of hours earlier than it was supposed to!
Your question doesn't seem to define accuracy, but there is always some jitter in scheduling in electronic devices. You might use quartz to schedule to the second. You could also use at or cron to schedule to the minute and then sleep the appropriate number of second(s).

Autosys box issue

I am having an autosys box job- auto_task_box which have 3 child jobs
: auto_task1_wd - runs in every 5 mins monday to friday
: auto_task2_dly - runs on 02:00 on every day
: auto_task3_sa - runs at 03:00 on every saturday
Issue is after scheduling is after the Ist run of auto_task1_wd, the box will wait for completion of auto_task2_dly and auto_task3_sa so the next iteration of auto_task1_wd ie after 5 mins won't happen.
How would I tackle this issue?
I am using autosys R11 in linux.
It sounds like the three jobs should run independent from each other. In this case I would not use a box at all, but just three separate tasks, as I always tend to think of boxes as a way of ensuring relationships between tasks.
I agree with the first answer, these jobs should not be grouped into the same box. Boxes are container for jobs with like starting conditions. It is a very bad idea to have date time conditions for jobs in a box. You will have some unexpected runs that way.

Resources