Parallel file processing using cloud services - azure

I have many images that I need to run through a java program to create more image files -- an embarrassingly parallel case. Each input file is about 500 mb, needs about 4 GB of memory during processing, and takes 30 seconds to 2 minutes to run. The java program is multithreaded but more gain comes from parallelizing on the input files than from using more threads. I need to kick off processes several times a day (I do not want to turn on/off the cluster manually nor pay for it 24/7).
I'm a bit lost in the variety of cloud options out there:
Amazon lambda has insufficient system resources (not enough memory).
Google Cloud DataFlow, it appears that I would have to write my own pipeline source to use their Cloud Storage buckets. Fine, but I don't want to waste time doing that if it's not an appropriate solution (which it might be, I can't tell yet).
Amazon data pipeline looks to be the equivalent to Google Cloud DataFlow. (Added in edit for completeness.)
Google Cloud Dataproc, this is not a map/reduce hadoop-y situation, but might work nonetheless. I'd rather not manage my own cluster though.
Google compute engine or AWS with autoscaling, and I just kick off processes for each core on the machine. More management from me but no APIs to learn.
Microsoft Data Lake is not released yet and looks hadoop-y.
Microsoft Batch seems quite appropriate (but I'm asking because I remain curious about other options).
Can anyone advise what appropriate solution(s) would be for this?

You should be able to do this with Dataflow quite easily. The pipeline could look something like (assuming your files are located on Google Cloud Storage, GCS):
class ImageProcessor {
public static void process(GcsPath path) {
// Open the image, do the processing you want, write
// the output to where you want.
// You can use GcsUtil.open() and GcsUtil.create() for
// reading and writing paths on GCS.
}
}
// This will work fine until a few tens of thousands of files.
// If you have more, let me know.
List<GcsPath> filesToProcess = GcsUtil.expand(GcsPath.fromUri("..."));
p.apply(Create.of(filesToProcess))
.apply(MapElements.via(ImageProcessor::process)
.withOutputType(new TypeDescriptor<Void>() {}));
p.run();
This is one of the common family of cases where Dataflow is used as an embarassingly-parallel orchestration framework rather than a data processing framework, but it should work.
You will need Dataflow SDK 1.2.0 to use the MapElements transform (support for Java 8 lambdas is new in 1.2.0).

Related

Advice on Scaling OptaPlanner using Azure Functions

I am trying to lift an OptaPlanner project into the cloud as an Azure Function. My goal in this would be to enhance the scaling so that our company can process more solutions in parallel.
Background: We currently have a project running in a Docker container using the optaplanner-spring-boot-starter MVN package. This has been successful when limited to solving one solution at a time. However, we need to dramatically scale the system so that a higher number of solutions can be solved in a limited time frame. Therefore, I'm looking for a cloud-based solution for the extra CPU resources needed.
I created an Azure Function using the optaplanner-core MVN package and our custom domain objects for our existing solution as a proof of concept. The Azure Function uses an HTTP trigger, this seems to work to get a solution, but the performance is seriously degraded. I'm expecting to need to upgrade the consumption plan so that we can specify CPU and memory requirements. However, it appears that Azure is not scaling out additional instances as expected leading to OptaPlanner blocking itself.
Here is the driver of the code:
#FunctionName("solve")
public HttpResponseMessage run(
#HttpTrigger(name = "req", methods = {HttpMethod.POST },authLevel = AuthorizationLevel.FUNCTION)
HttpRequestMessage<Schedule> request,
final ExecutionContext context) {
SolverConfig config = SolverConfig.createFromXmlResource("solverConfig.xml");
//SolverManagerConfig managerConfig = new SolverManagerConfig().withParallelSolverCount("2");
//SolverManagerConfig managerConfig = new SolverManagerConfig().withParallelSolverCount("10");
//SolverManagerConfig managerConfig = new SolverManagerConfig().withParallelSolverCount("400");
SolverManagerConfig managerConfig = new SolverManagerConfig().withParallelSolverCount("AUTO");
SolverManager<Schedule, UUID> solverManager = SolverManager.create(config ,managerConfig);
SolverJob<Schedule, UUID> solverJob = solverManager.solve(UUID.randomUUID(), problem);
// This is a blocking call until the solving ends
Schedule solution = solverJob.getFinalBestSolution();
return request.createResponseBuilder(HttpStatus.OK)
.header("Content-Type", "application/json")
.body(solution)
.build();
}
Question 1: Does anyone know how to set up Azure so that each HTTP call causes a scaling out of a new instance? I would like this to happen so that each solver isn't competing for resources. I have tried to configure this by setting FUNCTIONS_WORKER_PROCESS_COUNT=1 and maxConcurrentRequests=1. I have also tried changing OptaPlanners parallelSolverCount and moveThreadCount to different values without any noticeable differences.
Question 2: Should I be using Quarkus with Azure instead of the core MVN package? I've read that Geoffrey De Smet answered, "As for AWS Lambda (serverless): Quarkus is your friend".
I'm out of my element here as I haven't coded with Java for over 20 years AND I'm new to both Azure Functions and OptaPlanner. Any advice would be greatly appreciated.
Thanks!
Consider using OptaPlanner's Quarkus integration to compile natively. That is better for serverless deployments because it dramatically reduces the startup time. The README of the OptaPlanner quickstarts that use Quarkus explain how.
By switching from OptaPlanner in plain java to OptaPlanner in Quarkus (which isn't a big difference), a few magical things will happen:
The parsing of solverConfig.xml with an XML parser won't happen at runtime during bootstrap, but at build time. If its in src/main/resources/solverConfig.xml, quarkus will automatically pick it up to configure the SolverManager to inject.
No reflection at runtime
You will want to start 1 run per dataset. So parallelSolverCount shouldn't be higher than 1 and no run should handle 2 datasets (even not sequentially). If a run gets 8000 cpuMillis, you can use moveThreadCount=4 for it to get better results faster. If it only gets a 1000 cpuMillis (= 1 core), don't use move threads. Verify a run gets enough memory.
As for your Question 1, unfortunately, I don't have a solution for Azure Functions, but let me point you to a blogpost about running (and scaling) OptaPlanner workloads on OpenShift, which could address some of your concerns on the architecture level.
Scaling is only static for now (the number of replicas is specified manually), but it can be paired with KEDA to scale based on the number of pending datasets.
Important to note, the optaplanner-operator is only experimental at this point.

Is it possible to use different implementation of Python, except of standard one in Gogle Cloud Functions?

I am new to Google Cloud Functions. I want to write a small but execution intensive application. I researched doc and it is unclear if I can use PyPy or CPython when deploying to Google Cloud Functions.
In function, you can't customize the runtime, it is standard to the service.
If you want more control on your environment, choose Cloud Run, serverless but container based, and thus you can do what you want in your Dockerbuild.
In addition, you always have 1vCPU dedicated to the process with a customizable quantity of memory. With Cloud Functions, if you want to have the full power of the CPU you have to pay 2GB of memory. Finally, your process can take up to 15 minutes with Cloud Run but only 9 minutes with Cloud Functions.
I wrote an article on this if you want to know more

Google Cloud: Choosing the right storage option

I am developing a distributed application in Python. The application has two major packages, Package A and Package B that work separately but communicate with each other through a queue. In other words Package A generates some files and enqueue (pushes) them to a queue and Package B dequeues (pops) the files on a first-come-first-service basis and processes them. Both Package A and B are going to be deployed on Google Cloud as docker containers.
I need to plan what is the best storage option to keep the files and the queue. Files and the queue could be stored and used temporarily.
I think that my options are Cloud buckets or Google datastore, but have no idea how to choose from and what could be the best option. The best option would be a solution that has a low cost, reliable and easy-to-use from the development aspect.
Any suggestion is welcome... Thanks!
Google Cloud Storage sounds like the right option for you because it supports large files. You have no need for features provided by datastore etc such as querying by other fields.
If you only need to process a file once, when it is first uploaded, you could use GCS pubsub notifications and trigger your processor from pubsub.
if you need more complex tasks, e.g. one task can dispatch to multiple child tasks that all operate on the same file, then it's probably better to use a separate task system like celery and pass the GCS URL in the task definition.

Implementing LONG background tasks on Azure webapps

Situation:
A user with a TB worth of files on our Azure blob storage and gigabytes of storage in our Azure databases decides to leave our services. At this point, we need to export all his data into 2GB packages and deposit them on the blob storage for a short period (two weeks or so).
This should happen very rarely, and we're trying to cut costs. Where would it be optimal to implement a task that over the course of a day or two downloads the corresponding user's blobs (240 KB files) and zips them into the packages?
I've looked at a separate webapp running a dedicated continuous webjob, but webjobs seem to shut down when the app unloads, and I need this to hibernate and not use resources when not up and running, so "Always on" is out. Plus, I can't seem to find a complete tutorial on how to implement the interface, so that I may cancel the running task and such.
Our last resort is abandoning webapps (three of them) and running it all on a virtual machine, but this comes up to greater costs. Is there a method I've missed that could get the job done?
This sounds like a job for a serverless model on Azure Functions to me. You get the compute scale you need without paying for idle resources.
I don't believe that there are any time limits on running the function (unlike AWS Lambda), but even so you'll probably want to implement something to split the job up first so it can be processed in parallel (and to provide some resilience to failures). Queue these tasks up and trigger the function off the queue.
It's worth noting that they're still in 'preview' at the moment though.
Edit - have just noticed your comment on file size... that might be a problem, but in theory you should be able to use local storage rather than doing it all in memory.

How to host long running process into Azure Cloud?

I have a C# console application which extracts 15GB FireBird database file on a server location to multiple files and loads the data from files to SQLServer database. The console application uses System.Threading.Tasks.Parallel class to perform parallel execution of the dataload from files to sqlserver database.
It is a weekly process and it takes 6 hours to complete.
What is best option to move this (console application) process to azure cloud - WebJob or WorkerRole or Any other cloud service ?
How to reduce the execution time (6 hrs) after moving to cloud ?
How to implement the suggested option ? Please provide pointers or code samples etc.
Your help in detail comments is very much appreciated.
Thanks
Bhanu.
let me give some thought on this question of yours
"What is best option to move this (console application) process to
azure cloud - WebJob or WorkerRole or Any other cloud service ?"
First you can achieve the task with both WebJob and WorkerRole, but i would suggest you to go with WebJob.
PROS about WebJob is:
Deployment time is quicker, you can turn your console app without any change into a continues running webjob within mintues (https://azure.microsoft.com/en-us/documentation/articles/web-sites-create-web-jobs/)
Build in timer support, where WorkerRole you will need to handle on your own
Fault tolerant, when your WebJob fail, there is built-in resume logic
You might want to check out Azure Functions. You pay only for the processing time you use and there doesn't appear to be a maximum run time (unlike AWS Lambda).
They can be set up on a schedule or kicked off from other events.
If you are already doing work in parallel you could break out some of the parallel tasks into separate azure functions. Aside from that, how to speed things up would require specific knowledge of what you are trying to accomplish.
In the past when I've tried to speed up work like this, I would start by spitting out log messages during the processing that contain the current time or that calculate the duration (using the StopWatch class). Then find out which areas can be improved. The slowness may also be due to slowdown on the SQL Server side. More investigation would be needed on your part. But the first step is always capturing metrics.
Since Azure Functions can scale out horizontally, you might want to first break out the data from the files into smaller chunks and let the functions handle each chunk. Then spin up multiple parallel processing of those chunks. Be sure not to spin up more than your SQL Server can handle.

Resources