Process all file change events within specified time frame with TPL Dataflow - multithreading

I am monitoring multiple log files across multiple directories. I need to trigger an SSIS package when a file has fired an onchange event. Easy enough, but the complication is I don't want to trigger the SSIS package every time there is a change on the file. I want to wait and capture at least 5 minutes worth of changes to a specific file.
Having used FilewSystemWatcher before I know it triggers each onchange event in a new thread - My thought is to pass these events into a TPL block and have it wait for a specified time interval and then trigger an SSIS package. Basically triggering a related SSIS package every 5 minutes if there have been file change events.
If anyone could point me in the right direction as a starting point I would greatly appreciate it!

I can't tell from your question whether TPL Dataflow is a requirement or just an idea.
I'd probably keep it simple and avoid TPL dataflow. Just poll using a System.Threading.Timer and have it check System.IO.File.GetLastWriteTime on the file.
If you want to get fancy you could use Rx, convert the FilewSystemWatcher event to an Observable, and use the Buffer(TimeSpan) method.
TPL Dataflow doesn't have any intrinsic support for time windows you'd have to roll your own, probably using one of the aforementioned two methods to build it out. My experience with TPL Dataflow is that it's too big and cumbersome for small tasks, and too rudimentary for big tasks, so I'd avoid taking that approach.

Related

Best way to implement background “timer” functionality in Python/Django

I am trying to implement a Django web application (on Python 3.8.5) which allows a user to create “activities” where they define an activity duration and then set the activity status to “In progress”.
The POST action to the View writes the new status, the duration and the start time (end time, based on start time and duration is also possible to add here of course).
The back-end should then keep track of the duration and automatically change the status to “Finished”.
User actions can also change the status to “Finished” before the calculated end time (i.e. the timer no longer needs to be tracked).
I am fairly new to Python so I need some advice on the smartest way to implement such a concept?
It needs to be efficient and scalable – I’m currently using a Heroku Free account so have limited system resources, but efficiency would also be important for future production implementations of course.
I have looked at the Python threading Timer, and this seems to work on a basic level, but I’ve not been able to determine what kind of constraints this places on the system – e.g. whether the spawned Timer thread might prevent the main thread from finishing and releasing resources (i.e. Heroku Dyno threads), etc.
I have read that persistence might be a problem (if the server goes down), and I haven’t found a way to cancel the timer from another process (the .cancel() method seems to rely on having the original object to cancel, and I’m not sure if this is achievable from another process).
I was also wondering about a more “background” approach, i.e. a single process which is constantly checking the database looking for activity records which have reached their end time and swapping the status.
But what would be the best way of implementing such a server?
Is it practical to read the database every second to find records with an end time of “now”? I need the status to change in real-time when the end time is reached.
Is something like Celery a good option, or is it overkill for a single process like this?
As I said I’m fairly new to these technologies, so I may be missing other obvious solutions – please feel free to enlighten me!
Thanks in advance.
To achieve this you need some kind of scheduling tasks functionality. For a fast simpler implementation is a good solution to use the Timer object from the
Threading module.
A more complete solution is tu use Celery. If you are new, deeping in it will give you a good value start using celery as a queue manager distributing your work easily across several threads or process.
You mentioned that you want it to be efficient and scalable, so I guess you will want to implement similar functionalities that will require multiprocessing and schedule so for that reason my recommendation is to use celery.
You can integrate it into your Django application easily following the documentation Integrate Django with Celery.

fs.watch vs setInterval in node.js

I have application where i am reading data from csv file at every interval of 500ms.
CSV file is changed at every 300ms from another desktop based application.
So which one is better to use fs.watch or setInterval in this case.
In this situation I'l go with fs.watch it is helping me to create a more robust architecture.
Let's assume we are using timers setTimeout|setInterval, we need to hardcode the delay, and meanwhile the front application is scaling up and is updating the csv faster or slower, then you will need to modify your code so using fs.watch you just don't care how many change events occured, your application will not need any changes.
The biggest issue that I see at the moment with fs.watch is if the front will update the csv so fast that you will not finish your import and a new event will be dispatched then you will have hard time to deal with race conditions, but till that moment fs.watch is a good call imo.

Loop until data set is not in use with JCL

I am working in mainframe and I need to wait a dataset is released to execute automatically a JOB. Do you know any simple way to loop until a dataset is not in use in JCL? I was looking on the web and i found some solutions with REXX but they seemed too complicated to do such simple thing as I need. Also I have never used REXX.
Regards!
P.D. Also, the data set could not exist.
Edit: I need this becouse I run a XCOM Job which transfer a file of another system to a mainframe dataset. The problem is when this JOB finish, maybe the file is still beign transfered, and would like to wait to transfer be completed before to start the next JOB. Maybe editing the sentence of the next JOB associated to the dataset.
The easy way to do this is to ensure that your file transfer package allocates the dataset with an OLD disposition, that will create a system level enqueue on the dataset and prevent your job from running until the enqueue is released.
Many file transfer packages offer some sort of 'file complete' exit that can also trigger a job once a dataset transmission is fully complete.
But you can't loop in JCL. You can in REXX, but it has a host of issues that you have to deal with, not at all simple.

Best scheduling practice using VC++ for capture in directshow

I have one capture application, that performs the MP4 capture. i need to schedule this capture application to capture video of every 30 minutes (or some dynamic value).
I read the MSDN article for IReferenceClock::AdviseTime from article i am not sure but i assume will trigger event when end-time elapses. but it not seems to work . please advice me if my understanding is incorrect about it. Or is any other batter way to repeat schedule in 30 minutes to capture video??
Thanks
IReferenceClock::AdviseTime is what let's close schedule setting an event. Filters might take advantage of this internally as a part of streaming operation. For you, however, this methods is of no use. There is a number of ways to trigger an action every 30 minutes. On a running application you would typically use SetTimer + WM_TIMER API. If you want your app started every 30 minutes, Task Scheduler is here for you.

How to design a filewatcher /directory watcher in VC++?

I am new to VC++ and programming. I have a task in which I am supposed to design a file watcher in VC++.
The problem goes this way:
I have to monitor some log files continously; whenever a particular log file gets deleted(this deletion is done by some other program), I have to open a TextFile and write some data and the timestamp into it.
How do I go about it? Please help!!
First, you need to setup a system to monitor for file events from that folder.
To get started, take a look at FindFirstChangeNotification().
You'll basically get a waitable handle from that.
Then, were it me, I'd have a thread that waited on that event. Each time the event triggers, the thread resumes, queries for the change details (what file), then perform the needed actions, and resume sleeping on that handle again.
You'll need some additional semaphore or something to use to interrupt this worker-thread and wake it so that you can tell it to quit. Simple to do: have your thread's main loop do a WaitForMultipleObjects - the "wake up semaphore" and the FindFirstChangeNotification handle. When you wake up, check which even notified you, then either process the file change or quit.
MFC has a slightly different way of handling it (slightly) but to do this using the Win32 API what you'd typcially do is use the Directory Management Functions to set up a change notification handle for the directory the file goes in. Then you can wait on the handle and when something happens inside that directory your wait completes, and you can check to see if it was a change to the file that you care about.
Look at the docs for FindFirstChangeNotification and ReadDirectoryChangesW for more information.
Try the Windows Management Instrumentation (WMI) if you have enough privileges. AFAIK it is also the most efficient way to handle the filesystem events.
Handle or query the __InstanceDeletionEvent, __InstanceModificationEvent or __InstanceCreationEvent for the deletion, modification or creation events respectively and filter the files and target path that you want.
Take a look at the WMI Reference/C++ invocation.
For a full-scale example take a look at codeproject querying example.
I strongly recommmend you consider using the implementation here. This API is not 100% reliable, but this code does a good job of wrapping it. If your filesystem traffic is local and not too frequent, it should work well for you.

Resources