I have a scenario I think could be a fit for Service Fabric. I'm working on premises.
I read messages from a queue.
Each message contains details of a file that needs to be downloaded.
Files are downloaded and added to a zip archive.
There are 40,000 messages so that's 40,000 downloads and files.
I will add these to 40 zip archives, so that's 1000 files per archive.
Would Service Fabric be a good fit for this workload?
I plan to create a service that takes a message off the queue, downloads the file and saves it somewhere.
I'd then scale that service to have 100 instances.
Once all the files have been downloaded I'd kick of a different process to add the files to a zip archive.
Bonus if you can tell me a way to incorporate adding the files to a zip archive as part of the service
Would Service Fabric be a good fit for this workload?
If the usage will be just for downloading and compressing the files, I think it will be overkill to setup a cluster and manage it to sustain an application that is very simple. I think you could find many alternatives where you don't have to setup an environment just to keep your application running and processing a message from the queue.
I'd then scale that service to have 100 instances.
The number of instances does not mean the download will be faster, you have also to consider the the network limit, otherwise you will just end up with servers with idle CPU and Memory, where the network might be the bottleneck.
I plan to create a service that takes a message off the queue, downloads the file and saves it somewhere.
If you want to stick with service fabric and the queue approach, I would suggest this answer I gave a while ago:
Simulate 10,000 Azure IoT Hub Device connections from Azure Service Fabric cluster
The information is not exactly what you plan to do, but might give directions based on the scale you have and how to process a large number of messages from a queue(IoT Hub messaging is very similar to service bus).
For the other questions I would suggest create them on a separate topic.
Agree to Diego, using service fabric for this would be an overkill, and it wont be the best utilization of the resources, moreover this seems to be more of a disk extensive problem where you would need lots of storage to download those file and than compressing it in a zip. An idea can be to use azure functions as the computation seems minimal to me in your case. Download the file in azure file share and than upload to whatever storage you want (BLOB for example). This way you wont be using much of resources and you can scale the function and the azure file share according to your needs.
Related
I'm looking for some best practices on how to implement the following pattern using (micro) services and a service bus. In this particular case Service Fabric services and an Azure service bus instance, but from a pattern point of view that might not even be that important.
Suppose a scenario in which we get a work package with a number of files in it. For processing, we want each individual file to be processed in parallel so that we can easily scale this operation onto a number of services. Once all processing completes, we can continue our business process.
So it's a fan-out, following by a fan-in to gather all results and continue. For example sake, let's say that we have a ZIP file, unzip the file and have each file processed and once all are done we can continue.
The fan-out bit is easy. Unzip the file, for n files post n messages onto a service bus queue and have a number of services handle those in parallel. But now how do we know that these services have all completed their work?
A number of options I'm considering:
Next to sending a service bus message for each file, we also store the files in a table of sorts, along with the name of the originating ZIP file. Once a worker is done processing, it will remove that file from the table again and check if that was the last one. When it was, we can post a message to indicate the entire ZIP was now processed.
Similar to 1., but instead we have the worker reply that it's done and the ZIP processing service will then check if there is any work left. Little bit cleaner as the responsibility for that table is now clearly with the ZIP processing service.
Have the ZIP processing service actively wait for all the reply messages in separate threads, but just typing this already makes my head hurt a bit.
Introduce a specific orchestrator service which takes the n messages and takes care of the fan-out / fan-in pattern. This would still require solution 2 as well, but it's now located in a separate service so we don't have any of this logic (+ storage) in the ZIP processing service itself.
I looked into how service bus might already have a feature of some sort to support this pattern, but could not find anything suitable. Durable functions seems to support a scenario like this, but we're not using functions within this project and I'd rather not start doing so just to implement this one pattern.
We're surely not the first ones to implement such a thing, so I really was hoping I could find some real world advise as to what works and what should be avoided at all cost.
I am developing a distributed application in Python. The application has two major packages, Package A and Package B that work separately but communicate with each other through a queue. In other words Package A generates some files and enqueue (pushes) them to a queue and Package B dequeues (pops) the files on a first-come-first-service basis and processes them. Both Package A and B are going to be deployed on Google Cloud as docker containers.
I need to plan what is the best storage option to keep the files and the queue. Files and the queue could be stored and used temporarily.
I think that my options are Cloud buckets or Google datastore, but have no idea how to choose from and what could be the best option. The best option would be a solution that has a low cost, reliable and easy-to-use from the development aspect.
Any suggestion is welcome... Thanks!
Google Cloud Storage sounds like the right option for you because it supports large files. You have no need for features provided by datastore etc such as querying by other fields.
If you only need to process a file once, when it is first uploaded, you could use GCS pubsub notifications and trigger your processor from pubsub.
if you need more complex tasks, e.g. one task can dispatch to multiple child tasks that all operate on the same file, then it's probably better to use a separate task system like celery and pass the GCS URL in the task definition.
We have an application which is quite scalable as it is. Basically you have one or more stateless nodes that all do some independent work of files that are read and written to shared NFS share.
This NFS can be bottleneck but with local deployment customers just buys big enough box to have sufficient performance.
Now we are moving this to Azure and I would like to have a better more "cloudy" way of sharing data :) and running some Linux NFS server isn't ideal scenario if we need to manage them.
Is the Azure Blob storage the right tool for this job (https://azure.microsoft.com/en-us/services/storage/blobs/)?
we need good scalability. e.g. up to 10k files writen in a minute
files are quite small, less than 50KB per file on average
files created and read, not changed
files are short lived, we purge them every day
I am looking for more practical experience with this kind of storage and how good it really is.
There are two possible solutions to your request, either using Azure Storage Blobs (Recommended for your scenario) or Azure Files.
Azure Blobs has the following scaling targets:
It doesn't support the fact of attaching it a server, such as a network share.
Blobs do not support a hierarchy file structure besides having containers (Virtual folders can be accessed, but the con is you can't delete a container if it contains blobs- for the point about purging- but there are methods to do purging using your own code.)
Azure Files:
Links recommended:
Comparison between Azure Files and Blobs:https://learn.microsoft.com/en-us/azure/storage/common/storage-decide-blobs-files-disks
Informative SO post here
Situation:
A user with a TB worth of files on our Azure blob storage and gigabytes of storage in our Azure databases decides to leave our services. At this point, we need to export all his data into 2GB packages and deposit them on the blob storage for a short period (two weeks or so).
This should happen very rarely, and we're trying to cut costs. Where would it be optimal to implement a task that over the course of a day or two downloads the corresponding user's blobs (240 KB files) and zips them into the packages?
I've looked at a separate webapp running a dedicated continuous webjob, but webjobs seem to shut down when the app unloads, and I need this to hibernate and not use resources when not up and running, so "Always on" is out. Plus, I can't seem to find a complete tutorial on how to implement the interface, so that I may cancel the running task and such.
Our last resort is abandoning webapps (three of them) and running it all on a virtual machine, but this comes up to greater costs. Is there a method I've missed that could get the job done?
This sounds like a job for a serverless model on Azure Functions to me. You get the compute scale you need without paying for idle resources.
I don't believe that there are any time limits on running the function (unlike AWS Lambda), but even so you'll probably want to implement something to split the job up first so it can be processed in parallel (and to provide some resilience to failures). Queue these tasks up and trigger the function off the queue.
It's worth noting that they're still in 'preview' at the moment though.
Edit - have just noticed your comment on file size... that might be a problem, but in theory you should be able to use local storage rather than doing it all in memory.
I have azure log files with more than 250MB size each in one container(6 files per hour). I have cs program to access and process these log files. But what i am doing now is just taking only 100 lines from each log files(created in one hour). If i am processing the whole files, then i want to access almost 1.5GB of data. How can i handle this situation? My plan is to use a WebJob to create smaller files from this log files automatically and to store these files to a different container, and access that files from my cs program. Do you have any idea?
To tell the truth I don't understand the problem- are you worried because of traffic or processing time?
In any case you can try to reduce file size by removing some fields in IIS log setup. Another option will be setting Log File Rollover to smaller size- this can minimize download time for processing.
What you are suggesting is doable with a WebJob, but it sounds like the size of the file is the issue. Would a different log type be better for your scenario? Possibly using "Failed Request Tracing". You may also be able to change the verbosity level in the diagnostics config. For more info see:
Enable diagnostics logging for web apps in Azure App Service
Azure Web App (Website) Logging - Tips and Tools