I am trying to lift an OptaPlanner project into the cloud as an Azure Function. My goal in this would be to enhance the scaling so that our company can process more solutions in parallel.
Background: We currently have a project running in a Docker container using the optaplanner-spring-boot-starter MVN package. This has been successful when limited to solving one solution at a time. However, we need to dramatically scale the system so that a higher number of solutions can be solved in a limited time frame. Therefore, I'm looking for a cloud-based solution for the extra CPU resources needed.
I created an Azure Function using the optaplanner-core MVN package and our custom domain objects for our existing solution as a proof of concept. The Azure Function uses an HTTP trigger, this seems to work to get a solution, but the performance is seriously degraded. I'm expecting to need to upgrade the consumption plan so that we can specify CPU and memory requirements. However, it appears that Azure is not scaling out additional instances as expected leading to OptaPlanner blocking itself.
Here is the driver of the code:
#FunctionName("solve")
public HttpResponseMessage run(
#HttpTrigger(name = "req", methods = {HttpMethod.POST },authLevel = AuthorizationLevel.FUNCTION)
HttpRequestMessage<Schedule> request,
final ExecutionContext context) {
SolverConfig config = SolverConfig.createFromXmlResource("solverConfig.xml");
//SolverManagerConfig managerConfig = new SolverManagerConfig().withParallelSolverCount("2");
//SolverManagerConfig managerConfig = new SolverManagerConfig().withParallelSolverCount("10");
//SolverManagerConfig managerConfig = new SolverManagerConfig().withParallelSolverCount("400");
SolverManagerConfig managerConfig = new SolverManagerConfig().withParallelSolverCount("AUTO");
SolverManager<Schedule, UUID> solverManager = SolverManager.create(config ,managerConfig);
SolverJob<Schedule, UUID> solverJob = solverManager.solve(UUID.randomUUID(), problem);
// This is a blocking call until the solving ends
Schedule solution = solverJob.getFinalBestSolution();
return request.createResponseBuilder(HttpStatus.OK)
.header("Content-Type", "application/json")
.body(solution)
.build();
}
Question 1: Does anyone know how to set up Azure so that each HTTP call causes a scaling out of a new instance? I would like this to happen so that each solver isn't competing for resources. I have tried to configure this by setting FUNCTIONS_WORKER_PROCESS_COUNT=1 and maxConcurrentRequests=1. I have also tried changing OptaPlanners parallelSolverCount and moveThreadCount to different values without any noticeable differences.
Question 2: Should I be using Quarkus with Azure instead of the core MVN package? I've read that Geoffrey De Smet answered, "As for AWS Lambda (serverless): Quarkus is your friend".
I'm out of my element here as I haven't coded with Java for over 20 years AND I'm new to both Azure Functions and OptaPlanner. Any advice would be greatly appreciated.
Thanks!
Consider using OptaPlanner's Quarkus integration to compile natively. That is better for serverless deployments because it dramatically reduces the startup time. The README of the OptaPlanner quickstarts that use Quarkus explain how.
By switching from OptaPlanner in plain java to OptaPlanner in Quarkus (which isn't a big difference), a few magical things will happen:
The parsing of solverConfig.xml with an XML parser won't happen at runtime during bootstrap, but at build time. If its in src/main/resources/solverConfig.xml, quarkus will automatically pick it up to configure the SolverManager to inject.
No reflection at runtime
You will want to start 1 run per dataset. So parallelSolverCount shouldn't be higher than 1 and no run should handle 2 datasets (even not sequentially). If a run gets 8000 cpuMillis, you can use moveThreadCount=4 for it to get better results faster. If it only gets a 1000 cpuMillis (= 1 core), don't use move threads. Verify a run gets enough memory.
As for your Question 1, unfortunately, I don't have a solution for Azure Functions, but let me point you to a blogpost about running (and scaling) OptaPlanner workloads on OpenShift, which could address some of your concerns on the architecture level.
Scaling is only static for now (the number of replicas is specified manually), but it can be paired with KEDA to scale based on the number of pending datasets.
Important to note, the optaplanner-operator is only experimental at this point.
Related
We have a service running as an Azure function (Event and Service bus triggers) that we feel would be better served by a different model because it takes a few minutes to run and loads a lot of objects in memory and it feels like it loads it every time it gets called instead of keeping in memory and thus performing better.
What is the best Azure service to move to with the following goals in mind.
Easy to move and doesn't need too many code changes.
We have long term goals of being able to run this on-prem (kubernetes might help us here)
Appreciate your help.
To achieve first goal:
Move your Azure function code inside a continuous running Webjob. It has no max execution time and it can run continuously caching objects in its context.
To achieve second goal (On-premise):
You need to explain this better, but a webjob can be run as a console program on-premise, also you can wrap it into a docker container to move it from on-premise to any cloud but if you need to consume messages from an Azure Service Bus you will need an On-Premise-Azure approach connecting your local server to the cloud with a VPN or expressroute.
Regards.
There are a couple of ways to solve the said issue, each with slightly higher amount of change from where you are.
If you are just trying to separate out the heavy initial load, then you can do it once in a Redis Cache instance and then reference it from there.
If you are concerned about how long your worker can run, then Webjobs (as explained above) can work, however, that is something I'd suggest avoiding since its not where Microsoft is putting its resources. Rather look at durable functions. Here an orchestrator function can drive a worker function. (Even here be careful, that since durable functions retain history after running for very very very long times, the history tables might get too large - so probably program in something like, restart the orchestrator after say 50,000 runs (obviously the number will vary based on your case)). Also see this.
If you want to add to this, the constrain of portability then you can run this function in a docker image that can be run in an AKS cluster in Azure. This might not work well for durable functions (try it out, who knows :) ), but will surely work for the worker functions (which would cost you the most compute anyways)
If you want to bring the workloads completely on-prem then Azure functions might not be a good choice. You can create an HTTP server using the platform of your choice (Node, Python, C#...) and have that invoke the worker routine. Then you can run this whole setup inside an image on an AKS cluster on prem and to the user it looks just like a load balanced web-server :) - You can decide if you want to keep the data on Azure or bring it down on prem as well, but beware of egress costs if you decide to move it out once you've moved it up.
It appears that the functions are affected by cold starts:
Serverless cold starts within Azure
Upgrading to the Premium plan would move your functions to pre-warmed instances, which should counter the problem you are experiencing:
Pre-warmed instances for Azure Functions
However, if you potentially want to deploy your function/triggers to on-prem, you should spin them out as microservices and deploy them with containers.
Currently, the fastest way would probably be to deploy the containerized triggers via Azure Container Instances if you don't already have a Kubernetes Cluster running. With some tweaking, you can deploy them on-prem later on.
There are few options:
Move your function app on to premium. But it will not help u a lot at the time of heavy load and scale out.
Issue: In that case u will start facing cold startup issues and problem will be persist in heavy load.
Redis Cache, it will resolve your most of the issues as the main concern is heavy loading.
Issue: If your system is multitenant system then your Cache become heavy during the time.
Create small micro durable functions. It will be not the answer of your Q as u don't want lots of changes but it will resolve your most of the issues.
I have a nodejs web server where there seem to be a couple of bottlenecks preventing it from fully functioning peak load.
logging multiple events to our SQL server
logging multiple events to our elastic cluster
under heavy load , both SQL and elastic seem to reject my requests and/or considerably slow performance. So I've decided to reduce the load on these DBs via Logstash (for elastic) and an async task queue (for SQL)
Since i'm working on limited time , i'm wondering if i can solve these issues with just an async task queue (like kue) where i can push both SQL and elastic logs.
Do i need to implement logstash as well? or does an async queue solve the same thing as logstash.
You can try Async library's Queue feature and try and make it run on a child process or much better in a separate server as a Microservice for queues. That way you can move the load to another location, giving you a boost on your app server.
As you mentioned you are using azure I would strongly recommend using their queue solution plus a couple of azure functions to handle the read from the queue and processing.
I've rolled my own solution before using node.js and rabbitmq with node workers to read from the queue and write to elastic search because the solution couldn't go into the cloud.
It works and is robust but it takes quite a bit of configuration and custom code that is a maintenance nightmare. Ripped that out as soon as I could.
The benefits of using the azure service are:
Very little bespoke configuration is required.
Less custom code === less bugs & maintainence
scaling isn't an issue, some huge businesses rely on this
no 2am support calls, if azure is down they are going to fix it... fast
much cheaper, unless the throughput is massive and constant the scaling model is much cheaper and azure functions are perfect as you won't have running servers sitting there doing nothing when the queue is empty.
Same could be said for AWS and Google Cloud.
I am building a small utility that packages Locust - performance testing tool (https://locust.io/) and deploys it on azure functions. Just a fun side project to get some hands on with the serverless craze.
Here's the git repo: https://github.com/amanvirmundra/locust-serverless.
Now I am thinking that it would be great to run locust test in distributed mode on serverless architecture (azure functions consumption plan). Locust supports distributed mode but it needs the slaves to communicate with master using it's IP. That's the problem!!
I can provision multiple functions but I am not quite sure how I can make them talk to each other on the fly(without manual intervention).
Thinking out loud:
Somehow get the IP of the master function and pass it on to the slave functions. Not sure if that's possible in Azure functions, but some people have figured a way to get an IP of azure function using .net libraries. Mine is a python version but I am sure if it can be done using .net then there would be a python way as well.
Create some sort of a VPN and map a function to a private IP. Not sure if this sort of mapping is possible in azure.
Some has done this using AWS Lambdas (https://github.com/FutureSharks/invokust). Ask that person or try to understand the code.
Need advice in figuring out what's possible at the same time keeping things serverless. Open to ideas and/or code contributions :)
Update
This is the current setup:
The performance test session is triggered by an http request, which takes in number of requests to make, the base url, and no. of concurrent users to simulate.
Locustfile define the test setup and orchestration.
Run.py triggers the tests.
What I want to do now, is to have master/slave setup (cluster) for a massive scale perf test.
I would imagine that the master function is triggered by an http request, with a similar payload.
The master will in turn trigger slaves.
When the slaves join the cluster, the performance session would start.
What you describe doesn't sounds like a good use-case for Azure Functions.
Functions are supposed to be:
Triggered by an event
Short running (max 10 minutes)
Stateless and ephemeral
But indeed, Functions are good to do load testing, but the setup should be different:
You define a trigger for your Function (e.g. HTTP, or Event Hub)
Each function execution makes a given amount of requests, in parallel or sequentially, and then quits
There is an orchestrator somewhere (e.g. just a console app), who sends "commands" (HTTP call or Event) to trigger the Function
So, Functions are "multiplying" the load as per schedule defined by the orchestrator. You rely on Consumption Plan scalability to make sure that enough executions are provisioned at any given time.
The biggest difference is that function executions don't talk to each other, so they don't need IPs.
I think the mentioned example based on AWS Lambda is just calling Lambdas too, it does not setup master-client lambdas talking to each other.
I guess my point is that you might not need that Locust framework at all, and instead leverage the built-in capabilities of autoscaled FaaS.
I'm currently writing a Node library to execute untrusted code within Docker containers. It basically maintains a pool of containers running, and provides an interface to run code in one of them. Once the execution is complete, the corresponding container is destroyed and replaced by a new one.
The four main classes of the library are:
Sandbox. Exposes a constructor with various options including the pool size, and two public methods: executeCode(code, callback) and cleanup(callback)
Job. A class with two attributes, code and callback (to be called when the execution is complete)
PoolManager, used by the Sandbox class to manage the pool of containers. Provides the public methods initialize(size, callback) and executeJob(job, callback). It has internal methods related to the management of the containers (_startContainer, _stopContainer, _registerContainer, etc.). It uses an instance of the dockerode library, passed in the constructor, to do all the docker related stuff per se.
Container. A class with the attributes tmpDir, dockerodeInstance, IP and a public method executeCode(code, callback) which basically sends a HTTP POST request against ContainerIP:3000/compile along with the code to compile (a minimalist API runs inside each Docker container).
In the end, the final users of the library will only be using the Sandbox class.
Now, my question is: how should I test this?
First, it seems pretty clear to my that I should begin by writing functional tests against my Sandbox class:
it should create X containers, where X is the required pool size
it should correctly execute code (including the security aspects: handling timeouts, fork bombs, etc. which are in the library's requirements)
it should correctly cleanup the resources it uses
But then I'm not sure what else it would make sense to test, how to do it, and if the architecture I'm using is suitable to be correctly tested.
Any idea or suggestion related to this is highly appreciated! :) And feel free to ask for a clarification if anything looks unclear.
Christophe
Try and separate your functional and unit testing as much as you can.
If you make a minor change to your constructor on Sandbox then I think testing will become easier. Sandbox should take a PoolManager directly. Then you can mock the PoolManager and test Sandbox in isolation, which it appears is just the creation of Jobs, calling PoolManager for Containers and cleanup. Ok, now Sandbox is unit tested.
PoolManager may be harder to unit test as the Dockerode client might be hard to mock (API is fairly big). Regardless if you mock it or not you'll want to test:
Growing/shrinking the pool size correctly
Testing sending more requests than available containers in the pool
How stuck containers are handled. Both starting and stopping
Handling of network failures (easier when you mock things)
Retries
Any of failure cases you can think of
The Container can be tested by firing up the API from within the tests (in a container or locally). If it's that minimal recreating it should be straightforward. Once you have that it's really just testing an HTTP client it sounds like.
The source code for the actual API within the container can be tested however you like with standard unit tests. Because you're dealing with untrusted code there are a lot of possibilities:
Doesn't compile
Never completes execution
Never starts
All sorts of bombs
Uses all host's disk space
Is a bot and talks over the network
The code could do basically anything. You'll have to pick the things you care about. Try and restrict everything else.
Functional tests are going to be important too, there are a log of pieces to deal with here and mocking Docker isn't going to be easy.
Code isolation is a difficult problem; I wish Docker was around last time I had to deal with it. Just remember that your customers will always do things you didn't expect! Good luck!
I have many images that I need to run through a java program to create more image files -- an embarrassingly parallel case. Each input file is about 500 mb, needs about 4 GB of memory during processing, and takes 30 seconds to 2 minutes to run. The java program is multithreaded but more gain comes from parallelizing on the input files than from using more threads. I need to kick off processes several times a day (I do not want to turn on/off the cluster manually nor pay for it 24/7).
I'm a bit lost in the variety of cloud options out there:
Amazon lambda has insufficient system resources (not enough memory).
Google Cloud DataFlow, it appears that I would have to write my own pipeline source to use their Cloud Storage buckets. Fine, but I don't want to waste time doing that if it's not an appropriate solution (which it might be, I can't tell yet).
Amazon data pipeline looks to be the equivalent to Google Cloud DataFlow. (Added in edit for completeness.)
Google Cloud Dataproc, this is not a map/reduce hadoop-y situation, but might work nonetheless. I'd rather not manage my own cluster though.
Google compute engine or AWS with autoscaling, and I just kick off processes for each core on the machine. More management from me but no APIs to learn.
Microsoft Data Lake is not released yet and looks hadoop-y.
Microsoft Batch seems quite appropriate (but I'm asking because I remain curious about other options).
Can anyone advise what appropriate solution(s) would be for this?
You should be able to do this with Dataflow quite easily. The pipeline could look something like (assuming your files are located on Google Cloud Storage, GCS):
class ImageProcessor {
public static void process(GcsPath path) {
// Open the image, do the processing you want, write
// the output to where you want.
// You can use GcsUtil.open() and GcsUtil.create() for
// reading and writing paths on GCS.
}
}
// This will work fine until a few tens of thousands of files.
// If you have more, let me know.
List<GcsPath> filesToProcess = GcsUtil.expand(GcsPath.fromUri("..."));
p.apply(Create.of(filesToProcess))
.apply(MapElements.via(ImageProcessor::process)
.withOutputType(new TypeDescriptor<Void>() {}));
p.run();
This is one of the common family of cases where Dataflow is used as an embarassingly-parallel orchestration framework rather than a data processing framework, but it should work.
You will need Dataflow SDK 1.2.0 to use the MapElements transform (support for Java 8 lambdas is new in 1.2.0).