Best process to retrieve files daily from outside Azure - azure

I've asked this question in the Azure Logic Apps (LA) forum since I've used LA to implement this process but I may also get some valuable input here.
High-level description: in one specific client, we need to download, daily, dozens of file from a SFTP location to our servers in order to process their data. This workflow was built, in the past, using tools from a different technology than Azure but what we aimed to have was a general process that could be used for different source systems, different files, etc. With that in mind, our process retrieves, in the beginning, from a database, different variables to be applied to each execution of the process such as:
Business date
Remote location path - sftp location
Local location path - internal server location
File extension - .csv, .zip, etc
Number of iterations
Wait time between iterations
Dated files - whether files have business date on their name or not
Once all this is defined in the beginning of the process (there's some extra logic to it, not as straight forward as just getting variables but let's assume this for example purposes), the following logic is applied (the image below may help understand the LA flow):
Search for file in SFTP location
If file is there, get file size, wait X amount of time and check size again.
If file isn't there, try again until reaching maximum number of iterations or file is found
If file's size match, proceed to download the file
If file's size don't match, try again until reaching maximum number of iterations or file is found
LA Flow
In our LA, this process is implemented and working fine, we have 2 parameters in LA which is the filename and source system and based on these parameters, all variables are retrieved in the beginning of the process. Those 2 parameters can be changed from LA to LA and by scripting we can automatically deploy multiple LAs (one for each file we need to download). The process uses a schedule trigger since we want to run it at a specific time each day, we don't want to use the trigger of when a file is placed in a remote location since several files may be placed in the remote location which we aren't interested in.
One limitation that I can see compared to our current process is having multiple LAs grouped under one type of pipeline, where we can group multiple executions and check the state of them all without needing to check multiple LAs. I'm aware that we can do monitoring of LAs with OMS and, potentially, call multiple LAs from a Data Factory pipeline but I'm not exactly sure how that would work in this case.
Anyway, here is where my QUESTION comes in: what would be the best feature in Azure to implement this type of process? LAs works since I have it built, I'm going to take a look at replicate the same process in Data Factory but I'm afraid it may be a bit more complicated to set up this kind of logic there. What else could potentially be used? Really open to all kind of suggestions, just want to make sure I consider all valid options which is hard considering how many different features are offered by Azure and it's hard to keep track of them all.
Appreciate any input, cheers

Related

Managing large quantity of files between two systems

We have a large repository of files that we want to keep in sync between one central location and multiple remote locations. Currently, this is being done using rsync, but it's a slow process mainly because of how long it takes to determine the changes.
My current thought is to find a VCS-like solution where instead of having to check all of the files, we can check the diffs between revisions to determine what gets sent over the wire. My biggest concern, however, is that we'd have to re-sync all of the files that are currently in-sync, which is a significant effort. I've been told that the current repository is about .5 TB and consists of a variety of files of different sizes. I understand that an initial commit will most likely take a significant amount of time, but I'd rather avoid the syncing between clusters if possible.
One thing I did look at briefly is git-annex, but my first concern is that it may not like dealing with thousands of files. Also, one thing I didn't see is what would happen if the file already exists on both systems. If I create a repo using git-annex on the central system and then set up repos on the remote clusters, will pushing from central to a remote repo cause it to sync all of the files?
If anyone has alternative solutions/ideas, I'd love to see them.
Thanks.

QFileSystemWatcher - does it need to run in another thread?

I have a class that does some parsing of two large (~90K rows, 11 columns in the first and around ~20K, 5 columns in the second) CSV files. According to the specification I'm working with the CSV files can be externally changed (removing/adding of new rows; columns remain constant as well as the paths). Such updates can happen at any time (though highly unlikely that an update will be launched in time intervals shorter than a couple of minutes) and an update of any of the two files has to terminate the current processing of all that data (CSV, XML from an HTTP GET request, UDP telegrams), followed by re-parsing the content of each of the two (or just one if only one has changed).
I keep the CSV data (quite reduced since I apply multiple filters to remove unwanted entries) in memory to speed working with it and also to avoid unnecessary IO operations (opening, reading, closing file).
Right now I'm looking into the QFileSystemWatcher, which seems to be exactly what I need. However I'm unable to find any information on how it actually works internally.
Since all I need is to monitor 2 files for changes the number of files shouldn't be an issue. Do I need to run it in a separate thread (since the watcher is part of the same class where the CSV parsing happens) or is it safe to say that it can run without too much fuss (that is it works asynchronously like the QNetworkAccessManager)? My dev environment for now is a 64bit Ubuntu VM (VirtualBox) on a relatively powerful host (a HP Z240 workstation) however the target system is an embedded one. While the whole parsing of the CSV files takes just 2-3 seconds at the most I don't know how much performance impact there will be once the application gets deployed so additional overhead is something of a concern of mine.

VBA: Coordinate batch jobs between several computers

I have a vba script that extract information from huge text files and does a lot of data manipulation and calculations on the extracted info. I have about 1000 files and each take an hour to finish.
I would like to run the script on as many computers (among others ec2 instances) as possible to reduce the time needed to finish the job. But how do I coordinate the work?
I have tried two ways: I set up a dropbox as a network drive with one txt file with the current last job number thart vba access, start the next job and update the number but there is apparently too much lag between an update on a file on one computer is updated throughout the rest to be practical. The second was to find a simple "private" counter service online that updated for each visit so han would access the page, read the number and the page would update the number for the next visit from another computer. But I have found no such service.
Any suggestions on how to coordinate such tasks between different computers in vba?
First of all if you can use a proper programming language, forexample c# for easy parallel processing.
If you must use vba than first optimize your code first. Can you show us the code?
Edit:
If you must than you could do the following. First you need some sort of fileserver to store all text files in a folder.
Then in the macro, foreach every .txt file in folder,
try to open the first in exclusive mode if the file can be opened, then ("your code" after your code is finished move the file elsewhere) else Next .txt file.

Daemon for file watching / reporting in the whole UNIX OS

I have to write a Unix/Linux daemon, which should watch for particular set of files (e.g. *.log) in any of the file directories, across various locations and report it to me. Then I have to read all the newly modified files and then I have to process them and push grepped data into Elasticsearch.
Any suggestion on how this can be achieved?
I tried various Perl modules (e.g. File::ChangeNotify, File::Monitor) but for these I need to specify the directories, which I don't want: I need the list of files to be dynamically generated and I also need the content.
Is there any method that I can call OS system calls for file creation and then read the newly generated/modified file?
Not as easy as it sounds unfortunately. You have hooks to inotify (on some platforms) that let you trigger an event on a particular inode changing.
But for wider scope changing, you're really talking about audit and accounting tracking - this isn't a small topic though - not a lot of people do auditing, and there's a reason for that. It's complicated and very platform specific (even different versions of Linux do it differently). Your favourite search engine should be able to help you find answers relevant to your platform.
It may be simpler to run a scheduled task in cron - but not too frequently, because spinning a filesystem like that is dirty - along with File::Find or similar to just run a search occasionally.

Persistant storage values handling in Linux

I have a QSPI flash on my embedded board.
I have a driver + process "Q" to handle reading and writing into.
I want to store variables like SW revisions, IP, operation time, etc.
I would like to ask for suggestions how to handle the different access rights to write and read values from user space and other processes.
I was thinking to have file for each variable. Than I can assign access rights for those files and process Q can change the value in file if value has been changed. So process Q will only write into and other processes or users can only read.
But I don't know about writing. I was thinking about using message queue or zeroMQ and build the software around it but i am not sure if it is not overkill. But I am not sure how to manage access rights anyway.
What would be the best approach? I would really appreciate if you could propose even totally different approach.
Thanks!
This question will probably be downvoted / flagged due to the "Please suggest an X" nature.
That said, if a file per variable is what you're after, you might want to look at implementing a FUSE file system that wraps your SPI driver/utility "Q" (or build it into "Q" if you get to compile/control source to "Q"). I'm doing this to store settings in an EEPROM on a current work project and its turned out nicely. So I have, for example, a file, that when read, retrieves 6 bytes from EEPROM (or a cached copy) provides a MAC address in std hex/colon-separated notation.
The biggest advantage here, is that it becomes trivial to access all your configuration / settings data from shell scripts (e.g. your init process) or other scripting languages.
Another neat feature of doing it this way is that you can use inotify (which comes "free", no extra code in the fusefs) to create applications that efficiently detect when settings are changed.
A disadvantage of this approach is that it's non-trivial to do atomic transactions on multiple settings and still maintain normal file semantics.

Resources