How to upload large pytorch model to azure? - azure

I have locally pretrained pytorch (.pth) model that has ~500mb in size. I have created azure function which loads the model from azure static storage, but it takes over 5 minutes to load, which causes HTTP request to function to time out. How could I approach storing/loading model of this size?

Related

http api on google cloud using app engine or cloud function

i want to build an api using python and host it on google cloud. api will basically read some data in bucket and then do some processing on it and return the data back. i am hoping that i can read the data in memory and when request comes, just process it and send response back to serve it with low latency. assume i will read few thousand records from some database/storage and when request comes process and send 10 back based on request parameters. i dont want to make connection/read storage when the request comes as it will take time and i want to serve as fast as possible.
will google cloud function work for this need? or shoudl i go with app engine. (basically i want to be able to read the data once and hold it for incoming requests). data mostly will be less than 1-2 gb (max)
thanks,
Manish
You have to have the static data with your function code. Increase the Cloud Functions memory to allow it to load in memory the data to keep them warm and for having a very fast access to it.
Then, you have 2 way to achieve it:
Load the data at startup. You load them only once, the first call has a high latency to download (from GCS for instance) and load the data in memory. The advantage is: in case of data update, you don't have to redeploy your function, only update the data in their location. At the next function start, the new data will be loaded
Deploy the function with the static data in the deployment. This time the startup time is much faster (no download), only the data to load in memory. But when you want to update the data, you have to redeploy your function.
A final word if you have 2 set of static data, you must have 2 functions. The responsibility is different, so the deployment is different.

Nodejs server getting latent when any of the backend dependent service gets latent

Our infra for web application looks like this
Nodejs Web application -> GraphQL + Nodejs as middleware (BE for FE) -> Lot's of BE services in ROR -> DB/ES etc etc
We have witness the whole middleware layer of GrpahQL+Nodejs gets latent whenever any of the multiple crucial BE service gets latent and request queuing starts happening. When we tried to compare it with number of requests during the period it got latent it was <1k request which is much lower than the claimed 10k concurrent request handling of nodejs. Looking for pointers to debug this issue further.
Analysis done so far from our end:
As per Datadog and other APM which are used to to monitor system health, CPU and memory usage have shown no abnormal behaviour when the servers gets latent
We are using various request tracking methods from top most layer to last layer, and it is confirmed that request queuing is happening on this middleware layer only.

Should I use a Google Cloud Function or App Engine for connecting with Azure Cognetive Services and get fast results?

Introduction
I am using the Azure Face API in my Google Cloud Function (I make around 3 or 4 https requests everytime my function is called) but I am getting a really slow execution time, 5 seconds.
Function execution took 5395 ms, finished with status: 'ok'
Function execution took 3957 ms, finished with status: 'ok
Function execution took 2512 ms, finished with status: 'ok
Basically what I am doing in my cloud function is:
1. Detect a face using Azure
2. Save the face in the Azure LargeFaceList
3. Find 20 similar faces using Azure
4. Train the updated Azure LargeFaceList (if it is not being trained already)
I have the Google Cloud Function located in us-central1 ('near' my Azure Sace Service, which is in north-central-us). I have assigned it a memory of 2GB and a timeout of 540 secs. I am in Europe.
Problem
As I said before, the function takes too long to complete its execution (from 3.5 to 5 seconds). I don't know if this is because of the "Cold Start" or because it takes a time to run the algorithm.
Pd: The LargeFaceList currently only contains 10 faces (for 1000 faces the training duration is 1 second, and for 1 million 30 minutes).
My Options
Run the code on:
1- Google Cloud Function (doing this now)
2- Google Cloud App Engine
I have been experimenting with cloud functions from the last 3 months, and I have never used the App Engine service.
My Question
Is it possible to use firestore triggers on App Engine? And will I get a faster execution time if I move this code to App Engine?
With Cloud Functions, you can process only one request on 1 instance of function. If you have 2 requests, Cloud Functions creates 2 instances and each one are processed on only one instance.
Thus, if you have 180 concurrent requests you will have 180 function instances in the same time. (up to 1000 instance, default quotas)
Cloud Run runs on the same underlying infrastructure as Cloud Functions, but run containers. On 1 instance of Cloud Run, you can handle up to 80 requests concurrently.
Therefore, for 180 concurrent request, you should have 3 or 4 instance, and not 180 as for Cloud Functions. And because you pay the processing time (CPU + Memory), 180 Cloud Functions instances is more expensive than 3 Cloud Run services.
I wrote an article on this.
In summary, serverless architecture are highly scalable and process the request in parallel. Think about the processing time of only one request, not about the max amount of concurrent requests. (So, do it to have a cost perspective)

Tensorflow Serving number of requests in queue

I have my own TensorFlow serving server for multiple neural networks. Now I want to estimate the load on it. Does somebody know how to get the current number of requests in a queue in TensorFlow serving? I tried using Prometheus, but there is no such option.
Actually ,the tf serving doesn't have requests queue , which means that the tf serving would't rank the requests, if there are too many requests.
The only thing that tf serving would do is allocating a threads pool, when the server is initialized.
when a request coming , the tf serving will use a unused thread to deal with the request , if there are no free threads, the tf serving will return a unavailable error.and the client shoule retry again later.
you can find the these information in the comments of tensorflow_serving/batching/streaming_batch_schedulor.h
what 's more ,you can assign the number of threads by the --rest_api_num_threads or let it empty and automatically configured by tf serivng

one controller, service and model instance per request in node js

I am working on a large project and we are reviewing the performance of the system we are thinking to make one(separate) instance of controller, service and model for every single request to make the code more readable but I think it would affect the performance of the system!!
Is it so?

Resources