I am developing a distributed application in Python. The application has two major packages, Package A and Package B that work separately but communicate with each other through a queue. In other words Package A generates some files and enqueue (pushes) them to a queue and Package B dequeues (pops) the files on a first-come-first-service basis and processes them. Both Package A and B are going to be deployed on Google Cloud as docker containers.
I need to plan what is the best storage option to keep the files and the queue. Files and the queue could be stored and used temporarily.
I think that my options are Cloud buckets or Google datastore, but have no idea how to choose from and what could be the best option. The best option would be a solution that has a low cost, reliable and easy-to-use from the development aspect.
Any suggestion is welcome... Thanks!
Google Cloud Storage sounds like the right option for you because it supports large files. You have no need for features provided by datastore etc such as querying by other fields.
If you only need to process a file once, when it is first uploaded, you could use GCS pubsub notifications and trigger your processor from pubsub.
if you need more complex tasks, e.g. one task can dispatch to multiple child tasks that all operate on the same file, then it's probably better to use a separate task system like celery and pass the GCS URL in the task definition.
Related
I'm dealing with a legacy piece of software, totally not cloud friendly.
The local workflow is as follows:
Run Software1
Software1 creates some helper files to be used by Software2
Software2 runs and generates a result file
Software2 is a simulation model compiled as executable.
I now need to run hundreds of simulations and since this software doesn't even support multi-threading I'm looking at running it in the cloud. I have little to none experience with cloud computing. Our company mainly works with Azure but I don't have a problem using AWS or another cloud computing service.
What I'm thinking as possible solution is:
Run a virtual machine that runs Software1
Software1 creates several folders. Each folder contains all the necessary files to perform a single simulation.
Each folder is loaded to a blob storage folder
A Function app is triggered by the blob storage folder creation and a run is performed for each folder by running Software2
Once Software2 is done with the simulation, the function app copies the result file back to the blob storage, in the same folder of the corresponding run.
I tested the Function App and it does what I need but I'm not quite sure how run it several times in parallel. Do you have any suggestion on how to achieve this? Or maybe I should be using something different than function apps.
Thank you in advance for your help,
Guido
If I have understood this correctly, you want to run this Function App multiple times in parallel to "simulate" parallel execution. I think you need to look at Event Grid and re-think your architecture.
If you use a blob trigger, your function will be triggered each time you'll be making an operation in the blob container. If 1 file = 1 run for Software2, a blob trigger is OK and Azure will scale and run your function in parallel. The issue is that Software2 needs to write the results back in blob, creating new triggers.
Another way would be to have Software1 send a message to Storage Queue or Service Bus or an event with Event Grid and have your function be triggered by that. You then would write a Durable function using the "Fan out/fan in" pattern to run Software2 in parallel.
You can also look at creating parallel branches in Logic App.
I have a scenario I think could be a fit for Service Fabric. I'm working on premises.
I read messages from a queue.
Each message contains details of a file that needs to be downloaded.
Files are downloaded and added to a zip archive.
There are 40,000 messages so that's 40,000 downloads and files.
I will add these to 40 zip archives, so that's 1000 files per archive.
Would Service Fabric be a good fit for this workload?
I plan to create a service that takes a message off the queue, downloads the file and saves it somewhere.
I'd then scale that service to have 100 instances.
Once all the files have been downloaded I'd kick of a different process to add the files to a zip archive.
Bonus if you can tell me a way to incorporate adding the files to a zip archive as part of the service
Would Service Fabric be a good fit for this workload?
If the usage will be just for downloading and compressing the files, I think it will be overkill to setup a cluster and manage it to sustain an application that is very simple. I think you could find many alternatives where you don't have to setup an environment just to keep your application running and processing a message from the queue.
I'd then scale that service to have 100 instances.
The number of instances does not mean the download will be faster, you have also to consider the the network limit, otherwise you will just end up with servers with idle CPU and Memory, where the network might be the bottleneck.
I plan to create a service that takes a message off the queue, downloads the file and saves it somewhere.
If you want to stick with service fabric and the queue approach, I would suggest this answer I gave a while ago:
Simulate 10,000 Azure IoT Hub Device connections from Azure Service Fabric cluster
The information is not exactly what you plan to do, but might give directions based on the scale you have and how to process a large number of messages from a queue(IoT Hub messaging is very similar to service bus).
For the other questions I would suggest create them on a separate topic.
Agree to Diego, using service fabric for this would be an overkill, and it wont be the best utilization of the resources, moreover this seems to be more of a disk extensive problem where you would need lots of storage to download those file and than compressing it in a zip. An idea can be to use azure functions as the computation seems minimal to me in your case. Download the file in azure file share and than upload to whatever storage you want (BLOB for example). This way you wont be using much of resources and you can scale the function and the azure file share according to your needs.
We're using Google App Engine Standard Environment for our application. The runtime we are using is Python 2.7. We have a single service which uses multiple versions to deploy the app.
Most of our long-running tasks are done via Task Queues. Most of those tasks do a lot of Cloud Datastore CRUD operations. Whenever we have to send the results back to the front end, we use Firebase Cloud Messaging for that.
I wanted to try out Cloud Functions for those tasks, mostly to take advantage of the serverless architecture.
So my question is What sort of benefits can I expect if I migrate the tasks from Task Queues to Cloud Functions? Is there any guideline which tells when to use which option? Or should we stay with Task Queues?
PS: I know that migrating a code which is written in Python to Node.js will be a trouble, but I am ignoring this for the time being.
Apart from the advantage of being serverless, Cloud Functions respond to specific events "glueing" elements of your architecture in a logical way. They are elastic and scale automatically - spinning up and down depending on the current demand (therefore they incur costs only when they are actually used). On the other hand Task Queues are a better choice if managing execution concurrency is important for you:
Push queues dispatch requests at a reliable, steady rate. They
guarantee reliable task execution. Because you can control the rate at
which tasks are sent from the queue, you can control the workers'
scaling behavior and hence your costs.
This is not possible with Cloud Functions which handle only one request at a time and run in parallel. Another thing for which Task Queues would be a better choice is handling retry logic for the operations that didn't succeed.
Something you can also do with Cloud Functions together with App Engine Cron jobs is to run the function based on a time interval, not an event trigger.
Just as a side note, Google is working on implementing Python to Cloud Functions also. It is not known when that will be ready, however it will be surely announced in Google Cloud Platform Blog.
Situation:
A user with a TB worth of files on our Azure blob storage and gigabytes of storage in our Azure databases decides to leave our services. At this point, we need to export all his data into 2GB packages and deposit them on the blob storage for a short period (two weeks or so).
This should happen very rarely, and we're trying to cut costs. Where would it be optimal to implement a task that over the course of a day or two downloads the corresponding user's blobs (240 KB files) and zips them into the packages?
I've looked at a separate webapp running a dedicated continuous webjob, but webjobs seem to shut down when the app unloads, and I need this to hibernate and not use resources when not up and running, so "Always on" is out. Plus, I can't seem to find a complete tutorial on how to implement the interface, so that I may cancel the running task and such.
Our last resort is abandoning webapps (three of them) and running it all on a virtual machine, but this comes up to greater costs. Is there a method I've missed that could get the job done?
This sounds like a job for a serverless model on Azure Functions to me. You get the compute scale you need without paying for idle resources.
I don't believe that there are any time limits on running the function (unlike AWS Lambda), but even so you'll probably want to implement something to split the job up first so it can be processed in parallel (and to provide some resilience to failures). Queue these tasks up and trigger the function off the queue.
It's worth noting that they're still in 'preview' at the moment though.
Edit - have just noticed your comment on file size... that might be a problem, but in theory you should be able to use local storage rather than doing it all in memory.
I have many images that I need to run through a java program to create more image files -- an embarrassingly parallel case. Each input file is about 500 mb, needs about 4 GB of memory during processing, and takes 30 seconds to 2 minutes to run. The java program is multithreaded but more gain comes from parallelizing on the input files than from using more threads. I need to kick off processes several times a day (I do not want to turn on/off the cluster manually nor pay for it 24/7).
I'm a bit lost in the variety of cloud options out there:
Amazon lambda has insufficient system resources (not enough memory).
Google Cloud DataFlow, it appears that I would have to write my own pipeline source to use their Cloud Storage buckets. Fine, but I don't want to waste time doing that if it's not an appropriate solution (which it might be, I can't tell yet).
Amazon data pipeline looks to be the equivalent to Google Cloud DataFlow. (Added in edit for completeness.)
Google Cloud Dataproc, this is not a map/reduce hadoop-y situation, but might work nonetheless. I'd rather not manage my own cluster though.
Google compute engine or AWS with autoscaling, and I just kick off processes for each core on the machine. More management from me but no APIs to learn.
Microsoft Data Lake is not released yet and looks hadoop-y.
Microsoft Batch seems quite appropriate (but I'm asking because I remain curious about other options).
Can anyone advise what appropriate solution(s) would be for this?
You should be able to do this with Dataflow quite easily. The pipeline could look something like (assuming your files are located on Google Cloud Storage, GCS):
class ImageProcessor {
public static void process(GcsPath path) {
// Open the image, do the processing you want, write
// the output to where you want.
// You can use GcsUtil.open() and GcsUtil.create() for
// reading and writing paths on GCS.
}
}
// This will work fine until a few tens of thousands of files.
// If you have more, let me know.
List<GcsPath> filesToProcess = GcsUtil.expand(GcsPath.fromUri("..."));
p.apply(Create.of(filesToProcess))
.apply(MapElements.via(ImageProcessor::process)
.withOutputType(new TypeDescriptor<Void>() {}));
p.run();
This is one of the common family of cases where Dataflow is used as an embarassingly-parallel orchestration framework rather than a data processing framework, but it should work.
You will need Dataflow SDK 1.2.0 to use the MapElements transform (support for Java 8 lambdas is new in 1.2.0).