Azure Machine Learning Studio : Deployed Service gives OutOfMemoryLimit exception

Azure Machine Learning Studio : Deployed Service gives OutOfMemoryLimit exception - azure

Using Azure Machine Learning Studio I have published a few models that are having OutOfMemoryLimit issues in the deployed services (not during training).
The model type I am using is the "multiclass decision forest", and I do create some decent sized forests. When stored to a blob they take up to around 150 MB in size. Closing in on 150 I get OOM every time. At around 140 Maybe 1 in 10, and even at 120 MB they happens now and then.
The thing is it runs fine in the studio, and when deployed as a service it is not very consistent in when it gives exceptions. I can run requests against a service and get a reply 9 out of 10 cases, but then in the 10% remaining cases I will get an exception that looks like this:
{"error":{"code":"MemoryQuotaViolation","message":"The model had
exceeded the memory quota assigned to
it.","details":[{"code":"OutOfMemoryLimit","message":"The model
consumed more memory than was appropriated for it. Maximum allowed
memory for the model is 2560 MB. Please check your model for
issues."}]}}
Now I do run this as a request-response, as opposed to a batch job, and I suspect it might do just fine as a batch job. The reason for R-R is that I do need these data in real time and batch jobs are simply too slow.
I suspect the "right" approach is to further handicap my forest by reducing tree count or increasing leaf node sizes, but obviously this will reduce the model accuracy (further). Before I do so I am looking for some advice around:
Is it possible to pay for MORE than the 2,5 GB limit for the Azure ML SaaS ? (If not when is that coming??)
Is there any way to test whether a deployed model will break this limit or not before actually deploying it? We are trying to run retraining automatically and this reduces our reliability drastically.
Any other advice on what to try/test/think of
Thanks in advance!

Related

Why change feed lag estimator showing lag in millions?

I am working on cosmos db change feed for a real time project. we are running our webjobs in azure app service with P3V2 specification. there are multiple webjobs running using change feed. So to monitor the processes we have used the change feed lag estimator for monitoring record lags. the implementation is according to following document.
https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/how-to-use-change-feed-estimator
For one of the webjob in the .net core code we have put a delay of 10 mins using await Task.delay() function. for that specific webjob we are getting estimation in millions even though the records which we are processing are not more than 100.
This is kind of uncertain behavior we are observing. can anyone help to find the exact reason?

Is the Estimator matching a processor that is currently running and processing documents? Normally what you describe matches a scenario where the Processor is not running/never ran or never completed a successful run on some of the leases.
You can use the detailed estimation to understand how the lag is distributed across leases: https://docs.microsoft.com/en-us/azure/cosmos-db/sql/how-to-use-change-feed-estimator#as-an-on-demand-detailed-estimation

What is the optimal architecture design on Azure for an infrequently used backend that needs a robust configuration?

I'm trying to find the optimal cloud architecture to host a software on Microsoft Azure.
The scenario is the following:
A (containerised) REST API is exposed to the users through which they can submit POST and GET requests. POST requests trigger a backend that needs a robust configuration to operate properly and GET requests are sent to fetch the result of the backend, if any. This component of the solution is currently hosted on an Azure Web App Service which does the job perfectly.
The (containerised) backend (triggered by POST requests) perform heavy calculations during a short amount of time (typically 5-10 minutes are allotted for the calculation). This backend needs (at least) 4 cores and 16 Gb RAM, but the more the better.
The current configuration consists in the backend hosted together with the REST API on the App Service with a plan that accommodates the backend's requirements. This is clearly not very cost-efficient, as the backend is idle ~90% of the time. On top of that it's not really scalable despite an automatic scaling rule to spawn new instances based on the CPU use: it's indeed possible that if several POST requests come at the same time, they are handled by the same instance and make it crash due to a lack of memory.
Azure Functions doesn't seem to be an option: the serverless (consumption plan) solution they propose is restricted to 1.5 Gb RAM and doesn't have Docker support.
Azure Container Instances neither, because first the max number of CPUs is 4 (which is really few for the needs here, although acceptable) and second there are cold starts of approximately 2 minutes (I imagine due to the creation of the container group, pull of the image, and so on). Despite the process is async from a user perspective, a high latency is not allowed as the result is expected within 5-10 minutes, so cold starts are a problem.
Azure Batch, which at first glance appears to be a perfect fit (beefy configurations available, made for hpc, cost effective, made for time limited tasks, ...) seems to be slow too (it takes a couple of minutes to create a pool and jobs don't run immediately when submitted).
Do you have any idea what I could use?
Thanks in advance!

Azure Functions
You could look at Azure Functions Elastic Premium plan. EP3 has 4 cores, 14GB of RAM and 250GB of storage.
Premium plan hosting provides the following benefits to your functions:
Avoid cold starts with perpetually warm instances
Virtual network connectivity.
Unlimited execution duration, with 60 minutes guaranteed.
Premium instance sizes: one core, two core, and four core instances.
More predictable pricing, compared with the Consumption plan.
High-density app allocation for plans with multiple function apps.
https://learn.microsoft.com/en-us/azure/azure-functions/functions-premium-plan?tabs=portal
Batch Considerations
When designing an application that uses Batch, you must consider the possibility of Batch not being available in a region. It's possible to encounter a rare situation where there is a problem with the region as a whole, the entire Batch service in the region, or your specific Batch account.
If the application or solution using Batch always needs to be available, then it should be designed to either failover to another region or always have the workload split between two or more regions. Both approaches require at least two Batch accounts, with each account located in a different region.
https://learn.microsoft.com/en-us/azure/batch/high-availability-disaster-recovery

Quickly running into high memory usage Node.js

I am running a small Node.js server on a Heroku Hobby tier dyno. This tier allocates 512mb of RAM. This service takes in a moderate size of JSON data (~5000 JSON objects, each is ~5kb, 5000*5kb = 25mb), runs some analysis on it and outputs about 20 metrics. It's not a trivial amount of input data but it's also definitely not GBs of files. I'm confused where I'm running into the memory limits here. I run one continuous request for all 20 metrics. Does garbage collection not happen until the end of the request? I'm creating a lot of Date objects probably around 2 per JSON object so about 10,000 total and I do this upfront and save them for the whole process so I'm not constantly re-creating these dates. Could this be the issue I'm running into? Any suggestions for how to optimize there?
By the way, I know Node.js isn't the best tool for data processing in general and we're already looking to move to a Python based server for the libraries and multithreaded environment. But, until that's up and running, I'd love to be able to understand and improve the situation I'm currently running into on Node.js.
Thanks!

Azure Analysis Services reached maximum allowable memory allocation when creating partitions for table

We have an SSAS tabular model that we want to add partitions to. The server is hosted on Azure with 100GB of memory (the highest tier). We manage to create 5 out of 20 partitions, but when we try to create the sixth partition we get the following error:
Failed to save modifications to the server. Error returned: 'Memory error: You have reached the maximum allowable memory allocation for your tier. Consider upgrading to a tier with more available memory.
Technical Details:
RootActivityId: b2ae04c9-f0eb-4f62-93f9-adcda143a25d
Date (UTC): 9/13/2017 7:43:46 AM
The strange thing is that the memory usage is just around 17gb out of 100gb when we check the server monitoring logs.
I have seen a similar issue in Azure Analysis Services maximum allowable memory issue, but I don't think this is the same problem.
Another funny thing is that we have managed to process another model with the same type of data, but the tables used in that model are even bigger than the tables in this model. The server that is hosting that model has the same amount of memory as the server that is hosting the model that fails partitioning.
If it is of any help, we upgraded this server's tier, so perhaps there is a bug in Azure so it thinks we have the old pricing tier with the lower amount of memory?

The strange thing is that our on-premise data gateway computer was the cause of this problem.. I don’t know why but we got rid of this error once we restarted the gateway computer...

Number of instances needed for windows azure application

I'm fairly new to Windows Azure and want to host a survey application that will be filled out by appr. 30.000 users simultaniously.
The application consists of 1 .aspx page that will be sent to the client once, asks 25 questions and will give a wrap-up of the given answers at the end. When the user has given the answer and hits the 'next question' buttons the given answer will be send via an .ashx handler to the server. The response is the next question and answers. The wrap-up is sent to the client after a full postback.
The answer is saved in an Azure Table that is partitioned so that each partition can hold a max of 450 users.
I would like to ask if someone can give an estimated guess about how many web-role instances we need to start in order to have this application keep running. (If that is too hard to say, is it more likely to start 5, 50 or 500 instances?)
What is a better way to go: 20 small instances or 5 large instances?
Thanks for your help!

The most obvious answer: you would be best served by testing this yourself and see how your application holds up. You can easily get performance counters and other diagnostics out of Windows Azure; for instance, you can connect Microsoft SCOM (System Center Operations Manager) to monitor your environment during test. Site Hammer is a simple load testing tool for Windows Azure (on MSDN code gallery).
Apart from this very obvious answer, I will share some guesstimates: given the type of load, you are probably better of with more small instances as opposed to a lower number of large ones, especially since you already have your storage partitioned. If you are really going to have 30K visitors simultaneously and give them a ~15 second interval between reading the questions & posting their answers you are looking at 2,000 requests per second. 10 nodes should be more than enough to handle that load. Remember that this is just a simple estimate, lacking any form of insight in your architecture, etc. For these types of loads, caching is a very good idea; it will dramatically increase the load each node can handle.
However, the best advice I can give you is to make sure that you are actively monitoring. It takes less than 30 minutes to spin up additional instances, so if you monitor your environment and/or make sure that you are notified whenever it starts to choke, you can easily upgrade your setup. Keep in mind that you do need to contact customer support to be able to go over 20 instances (this is a default limit, in place to protect you from over-spending).

Aside from the sage advice tijmenvdk gave you, let me add my opinion on instance size. In general, go with the smallest size that will support your app, and then scale out to handle increased traffic. This way, when you scale back down, your minimum compute cost is kept low. If you ran, say, a pair of extra-large instances as your baseline (since you always want minimum two instances to get the uptime SLA), your cost footprint starts at 0.12 x 8 x 2 = $1.92 per hour, even during low-traffic times. If you go with small instances, you'd be at 0.12 x 1 x 2 = $0.24 per hour.
Each VM size as associated CPU, memory, and local 9non-durable) disk storage, so pick the smallest size unit that your app works efficiently in.
For load/performance-testing, you might also want to consider a hosted solution such as Loadstorm.

How simultaneous are the requests in reality?
Will they all type the address in at exactly the same time?
That said, profile your app locally, this will enable you to estimate CPU, Network and Memory usage on Azure. Then, rather than looking at how many instances you need, look at how you can reduce the requirement! Apply these tips, and profile locally again.
Most performance tips have a tradeoff between cpu, memory or bandwith usage, the idea is to ensure that they scale equally. If you're application runs out of memory, but you have loads of CPU and network, dont
For a single page survey, ensure your html, css & js is minified, ensure its cacheable.
Combine them if possible, and to get really scaleable, push static files (css,js & images) to a CDN. This all reduces the number of requests the webserver has to deal with, and therefore reduces the number of webroles you will need = less network.
How does the ashx return the response? i.e. is it sending html, xml or json?
personally, I'd get it to return JSON, as this will require less network bandwidth, and most likely less server side processing = less mem and network.
Use Asyncronous API's to access azure storage (this uses IO completion ports to free up the iis thread to handle more requests until azure storage comes back = enabling cpu to scale)
tijmenvdk has already mentioned using queues to write. Do the list of questions change? if not, cache them, so that the app only has to read from table storage once on start-up and once for each client for the final wrap-up = saves network and cpu at the expense of memory.
All of these tips are equally applicable to a normal web application, on a single server or web-farm environment.
The point I'm trying to make is that what you can't measure, you cant improve, and measurement, improvement and cost all go hand in hand. Dynamic scaling will reduce costs, but fundamentally if your application hasn't been measured and resource usage optimised, asking how many instances you need is pointless.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string