Automated processing of large text file(s)

Automated processing of large text file(s) - azure

The scenario is as follows: A large text file is put somewhere. At a certain time of the day (or manually or after x number of files), a Virtual Machine with Biztalk installed is about to start automatically for processing of these files. Then, the files should be put in some output place and the VM should be shut down. I don´t know the time it takes for processing these files.
What is the best way to build such a solution? The solution is preferably to be used for similar scenarios in the future.
I was thinking of Logic Apps for the workflow, blob storage or FTP for input/output of the files, an API App for starting/shutting down the VM. Can Azure Functions be used in some way?
EDIT:
I also asked the question elsewhere, see link.
https://social.msdn.microsoft.com/Forums/en-US/19a69fe7-8e61-4b94-a3e7-b21c4c925195/automated-processing-of-large-text-files?forum=azurelogicapps

Just create an Azure Runbook with a Schedule, make that Runbook check for specific files in a Storage Account, if they exist, start up a VM and wait till the files are gone, once the files are gone (so BizTalk processed them, deleted and put them in some place where they belong), Runbook would stop the VM.

Related

Azure Storage - File Share - Move 16m files in nested folders

Posting here as server fault doesn't seem to have the detailed Azure knowledge.
I have a Azure storage account, a file share. This file share is connected to a Azure VM through mapped drive. A FTP server on the VM accepts a stream of files and stores them in the File Share directly.
There are no other connections. Only I have Azure admin access, limited support people have access to the VM.
Last week, for unknown reasons 16 million files, which are nested in many sub-folders (origin, date) moved instantly into a unrelated subfolder, 3 levels deep.
I'm baffled how this can happen. There is a clear instant cut off when files moved.
As a result, I'm seeing increased costs on LRS. I'm assuming because internally Azure storage is replicating the change at my expense.
I have attempted to copy the files back using a VM and AZCOPY. This process crashed midway through leaving me with a half a completed copy operation. This failed attempt took days, which makes me confident I wasn't the support guys dragging and moving a folder by accident.
Questions:
Is it possible to just instantly move so many files (how)
Is there a solid way I can move the files back, taking into account the half copied files - I mean an Azure backend operation way rather than writing an app / power shell / AZCOPY?
So there a cost efficient way of doing this (I'm on Transaction Optimised tier)
Do I have a case here to get Microsoft to do something, we didn't move them... I assume something internally messed up.
Thanks

A tool that supports server-side copy (like AzCopy) can move the files quickly because only the metadata is updated. If you wants to investigate the root cause, I recommend opening a support case. (To sort this out – Your best bet is to connect with our Azure support team by filing a ticket, our support team on best effort basis can help you guide on this matter. )

Azure Automation Use Case

I have a certain script (python), which needs to be automated that is relatively memory and CPU intensive. For a monthly process, it runs ~300 times, and each time it takes somewhere from 10-24 hours to complete, based on input. It takes certain (csv) file(s) as input and produces certain file(s) as output, after processing of course. And btw, each run is independent.
We need to use configs and be able to pass command line arguments to the script. Certain imports, which are not default python packages, need to be installed as well (requirements.txt). Also, need to take care of logging pipeline (EFK) setup (as ES-K can be centralised, but where to keep log files and fluentd config?)
Last bit is monitoring - will we be able to restart in case of unexpected closure?
Best way to automate this, tools and technologies?
My thoughts
Create a docker image of the whole setup (python script, fluent-d config, python packages etc.). Now we somehow auto deploy this image (on a VM (or something else?)), execute the python process, save the output (files) to some central location (datalake, eg) and destroy the instance upon successful completion of process.
So, is what I'm thinking possible in Azure? If it is, what are the cloud components I need to explore -- answer to my somehows and somethings? If not, what is probably the best solution for my use case?
Any lead would be much appreciated. Thanks.

Normally for short living jobs I'd say use an Azure Function. Thing is, they have a maximum runtime of 10 minutes unless you put them on an App Service Plan. But that will costs more unless you manually stop/start the app service plan.
If you can containerize the whole thing I recommend using Azure Container Instances because you then only pay for what you actual use. You can use an Azure Function to start the container, based on an http request, timer or something like that.
You can set a restart policy to indicate what should happen in case of unexpected failures, see the docs.
Configuration can be passed from the Azure Function to the container instance or you could leverage the Azure App Configuration service.

Though I don't know all the details, this sounds like a good candidate for Azure Batch. There is no additional charge for using Batch. You only pay for the underlying resources consumed, such as the virtual machines, storage, and networking. Batch works well with intrinsically parallel (also known as "embarrassingly parallel") workloads.
The following high-level workflow is typical of nearly all applications and services that use the Batch service for processing parallel workloads:
Basic Workflow
Upload the data files that you want to process to an Azure Storage account. Batch includes built-in support for accessing Azure Blob storage, and your tasks can download these files to compute nodes when the tasks are run.
Upload the application files that your tasks will run. These files can be binaries or scripts and their dependencies, and are executed by the tasks in your jobs. Your tasks can download these files from your Storage account, or you can use the application packages feature of Batch for application management and deployment.
Create a pool of compute nodes. When you create a pool, you specify the number of compute nodes for the pool, their size, and the operating system. When each task in your job runs, it's assigned to execute on one of the nodes in your pool.
Create a job. A job manages a collection of tasks. You associate each job to a specific pool where that job's tasks will run.
Add tasks to the job. Each task runs the application or script that you uploaded to process the data files it downloads from your Storage account. As each task completes, it can upload its output to Azure Storage.
Monitor job progress and retrieve the task output from Azure Storage.
(source)

I would go with Azure Devops and a custom agent pool. This agent pool could include some virtual machines (maybe only one) with docker installed. I would then install all the necessary packages that you mentioned on this docker container and also the DevOps agent (it will be needed to communicate with the agent pool).
You could pass every parameter needed in the build container agents through Azure Devops tasks and also have a common storage layer for build and release pipeline. This way you could mamipulate/process your files on the build pipeline and then using the same folder create a task on the release pipeline to export/upload those files somewhere.
As this script should run many times through the month, you could have many containers so that to run more than one job at a given time.
I follow the same procedure for a corporate environment. I keep a VM running windows with multiple docker machines to compile diferent code frameworks. Each container includes different tools and is registered to a custom agent pool. Jobs are distributed across those containers and build and release pipelines integrate with multiple processing.

You probably suppose to use Azure Data Factory for moving and transforming data.
Then you can also use ADF for calling Azure Batch that will be using python.
https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
Adding more info could probably suggest other better suggestions.

The best way to read and process email's attachments in Azure?

We have a few third-party companies sending us emails with CSV/excel data files attached to the emails. I want to build a pipeline (preferably in ADF) to get the attachments, load the raw files (attachments) to blob, process/transform them, and finally load the processed files to another dir in the blob.
To get the attachment, I think I can use the instructions (using Logic App) in this link. Then, trigger an ADF pipeline using storage trigger, get the file and process it and do the rest of the stuff.
However, first, I'm not sure how reliable storage triggers are?
Second, although it seems ok, this approach makes it difficult to monitor the runs and make sure things are working properly. For example, if the logic app doesn't read/load the attachments for any reason and fails, you can't pick it up in ADF as nothing has written in the blob to trigger the pipeline.
Anyway, is this approach good, or there are better ways to do this?
Thanks

If you are able to save attachments into a blob or something, you can schedule a ADF pipeline that imports every file in the blob every minute og 5 minute or so.
Does the files have same data structure everytime? (that makes things much easier)
It is most common to schedule imports in ADF, rather that trigger based on external events.

Copy files from one Azure VM to another with a file watch

I'm trying to set up a situation where I drop files into a folder on one Azure VM, and they're automatically copied to another Azure VM. I was thinking about mapping a drive from the receiver to the sender and using a file watch/copy program to send the files over the mapped drive.
What's a good recommendation for a file watch/copy program that's simple and efficient, and what security setups do I need to get the two Azure boxes to "talk" to each other? They're in the same account/resource group/etc, so I'm not going outside of a virtual network or anything like that.

By default, VMs in the same virtual network can talk to each other (this is true even if default NSGs are applied). So you wouldn't have to do anything special to get that type of communication working.
To answer the second part, you might want to consider just using built-in FCI rules to execute a short script to do the copy. See this link for a short intro into FCI rules.
Alternatively, you could use a service such as Azure files to have files shared between those servers using CIFS. It really depends on why you are trying to have a copy of the file on two servers.
Hope that helps!

Running existing program on Windows AZURE cloud platform

i have an existing program that i would like to upload to the cloud without rewriting it and i'm wondering if that is possible.
For exemple can i upload and run a photoshop instance in the cloud and use it?
Of course not the GUI but photoshop has a communication sdk so web program should be able to control it!
As far as i can see, Worker roles looks good but they have to be written in a specific way and i can't rewrite photoshop !
Thanks for your attention!

As long as your existing program is 64bit compatible and it has installer that supports unattended/silent install; or your programm is xcopy deployable, you can use it in Azure.
For the programm that requires installation and supports unattended/silent install you can use StartUp Task.
For the program that is just xcopy deployable, just put it in a folder of your worker role, and make sure the "Copy to Output" attribute of all required files are set to "Copy always". Then you can use it.
However the bigger question is, what are you going to do with that "existing programm" in Azure, if you do not have API-s to work with.

Here's the thing, the Worker role should be what you need - it's essentially a virtual machine running a slightly different version of Windows, that you can RDP to, and use it normally. You can safely run more or less anything up there, but you need to automate the deployment (e.g. using startup tasks). As this can prove a bit problematic, Microsoft has created a Virtual machine Role. You create your own deployment and that's what gets raised when you instantiate the machine.
However! This machine is stateless, meaning that files it creates aren't saved if it gets restarted. So you need to ensure the files are saved somewhere else, e.g. in blob storage (intended for just such a purpose).
What I would do in your case, is create a virtual machine role, with Photoshop installed, and a custom piece of software next to it, accepting requests via Azure Queues, that does the processing, and saves the file to blob storage, then sends the file onwards to whoever requested

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string