Databricks REST API throttling and capacity restrictions/limits - apache-spark

I've scaled up the hardware on an azure-databricks cluster ("all-purpose" cluster) appropriately so that it should handle a very large amount of work. The application is designed in a way where incoming data is processed in smallish, discrete chunks. The jobs run in ~20 to 30 seconds. But there is a high degree of concurrent jobs that need to execute at the same time (eg. anywhere from 0 to 50 simultaneous jobs).
The only approach for delivering jobs to the cluster seems to be by way of their REST API in azure databricks (doc: https://docs.databricks.com/dev-tools/api/latest/jobs.html )
Everything behaves normally until the number of concurrent jobs reaches 10 or so. At that point I see an unreasonable deterioration in throughput. But if I check ganglia or custom telemetry, there appears to be no reason for the deteriorated performance.
My suspicion is that the REST API itself is introducing an artificial bottleneck and they are throttling the number of jobs I can send over to my cluster. This was not self-evident to me. If I am paying for a large cluster, I should be allowed to send jobs to it. The REST API seems to be doing little more than serving as a communication channel that allows me to transmit my requests to my cluster. That API is the last place I would expect to find a resource bottleneck. A Spark developer would naturally investigate their code, then the cluster hardware. The REST API is not a reasonable place for Databricks to be introducing some additional, secretive limitations.
Does anyone know of another way to transmit distinct jobs to a cluster without going thru the REST API? Eg. is there a way for the driver node in the cluster to spawn additional/distinct/first-class jobs without being counted against our REST API allowance?
This issue seems silly and artificial. The secretive nature of these limits is bothersome to me as well. If they are throttling the REST API then there should be a warning, error, or ganglia chart for that. Otherwise developers will struggle with the performance issues using trial and error and guesswork.
Any help is appreciated. I'd prefer not to go all the way back to the drawing board, because of an artificial restriction in their REST API (one that was probably put in place to protect an underpowered "control plane").

Spark is awesome, but it isn't designed to be a high-concurrency database. The folks at Databricks have done a lot to lift the concurrency limitations of Spark, it still isn't a high-concurrecy solution.
In other words, your problem isn't the REST API ... it's the Spark engine in Databricks.
I know you don't want to go back to the drawing board, but the choices here are all bad ones:
you can run multiple Databricks clusters ( https://docs.databricks.com/clusters/index.html ) and use NGINX or some other load balancer to distribute the API requests. This will get expensive, quickly, but will avoid redesign.
If your use case supports it, try using a real-time database that supports high concurrency. I like Druid (see https://druid.apache.io or https://imply.io if you want a managed version), but there are others in the same category

Related

Pacemaker/Corosync/PostgreSQL cluster failovers during heavy load

At our company we're running PostgreSQL on 4-node clusters using Pacemaker and Corosync.
During heavy batch loads we suffer from cluster failovers because the inbuilt resource monitoring gets timed out when trying to access the database because well, server overload...
On one end it's understandable cluster behaviour that a 'self induced denial of service' should trigger a master switchover, on the other hand we'd like to not see our batches and service (temporarily) aborted because of this. A standalone server would have just pulled through. Obviously we look into optimizing and spreading the batches, but that's like putting one fire out and another pops up elsewhere.
I looked into linux cgroups but this doesn't seem to be a viable solution as all it does is CPU/IO limit your postgresql resource, which is part of the problem :-)
Any ideas or suggestions very much appreciated!

logstash vs a task queue for nodejs application

I have a nodejs web server where there seem to be a couple of bottlenecks preventing it from fully functioning peak load.
logging multiple events to our SQL server
logging multiple events to our elastic cluster
under heavy load , both SQL and elastic seem to reject my requests and/or considerably slow performance. So I've decided to reduce the load on these DBs via Logstash (for elastic) and an async task queue (for SQL)
Since i'm working on limited time , i'm wondering if i can solve these issues with just an async task queue (like kue) where i can push both SQL and elastic logs.
Do i need to implement logstash as well? or does an async queue solve the same thing as logstash.
You can try Async library's Queue feature and try and make it run on a child process or much better in a separate server as a Microservice for queues. That way you can move the load to another location, giving you a boost on your app server.
As you mentioned you are using azure I would strongly recommend using their queue solution plus a couple of azure functions to handle the read from the queue and processing.
I've rolled my own solution before using node.js and rabbitmq with node workers to read from the queue and write to elastic search because the solution couldn't go into the cloud.
It works and is robust but it takes quite a bit of configuration and custom code that is a maintenance nightmare. Ripped that out as soon as I could.
The benefits of using the azure service are:
Very little bespoke configuration is required.
Less custom code === less bugs & maintainence
scaling isn't an issue, some huge businesses rely on this
no 2am support calls, if azure is down they are going to fix it... fast
much cheaper, unless the throughput is massive and constant the scaling model is much cheaper and azure functions are perfect as you won't have running servers sitting there doing nothing when the queue is empty.
Same could be said for AWS and Google Cloud.

Node.js scaling out on Kubernetes

I built an app on node.js using Docker and I'm not sure how to scale it on a Kubernetes cluster so that I take the most out of my cluster hardware.
From a performance perspective which of the following is better:
clusterize my node app and run as many containers as needed
or
just run as many containers as needed without clustering ?
When I say clustering I mean this https://nodejs.org/api/cluster.html
My app is a simple CRUD Api backed by mongoDB. We estimate that it will have 1000 concurrent users. Our cluster has 3 nodes.
The NodeJS cluster mechanism is useful to allow NodeJS to more effectively use greater than a single core, so depending on your code it may benefit you, but it's highly dependent on your code and the various dependencies and how well they work (or not) with clustering.
As a general practice, if you can break your containers down into nicely parallelized efforts that can be run as pods within kubernetes, then I'd recommend the following as a process to see what works for you:
set up a single pod with your code in it, and run a load test against it. Use the data that Kubernetes has from cAdvisor to characterize how much resources (cpu & memory) your pod likes to have.
set a resource limit for cpu and memory based on what you see above.
run a load test to validate what your single pod handles in terms of scale
And from there, you have a baseline where you can use Kubernetes to scale this horizontally to validate the 1000 user concurrent baseline you want to achieve.
There's a good talk on this process from the 2017 Kubecon called Load Testing Kubernetes: How to optimize your cluster resource allocation in production
Once you have a baseline, you can run a prototype out leveraging the clustering in your code, and then compare against the non-clustered version. If you do this, I'd double-check that any limits you set are > 1 core for CPU, or you'll be self-limiting outside of the NodeJS runtime to get access to multiple cores, which would defeat the purpose of using clustering.
Depending on what you're doing in your code, there may be significant re-work needed to enable clustering, as it wants to leverage its own worker concept, and it's not clear what frameworks you're using and if they'll fit reasonably into that structure.

Dynamic Service Creation to Distribute Load

Background
The problem we're facing is that we are doing video encoding and want to distribute the load to multiple nodes in the cluster.
We would like to constrain the number of video encoding jobs on a particular node to some maximum value. We would also like to have small video encoding jobs sent to a certain grouping of nodes in the cluster, and long video encoding jobs sent to another grouping of nodes in the cluster.
The idea behind this is to help maintain fairness amongst clients by partitioning the large jobs into a separate pool of nodes. This helps ensure that the small video encoding jobs are not blocked / throttled by a single tenant running a long encoding job.
Using Service Fabric
We plan on using an ASF service for the video encoding. With this in mind we had an idea of dynamically creating a service for each job that comes in. Placement constraints could then be used to determine which pool of nodes a job would run in. Custom metrics based on memory usage, CPU usage ... could be used to limit the number of active jobs on a node.
With this method the node distributing the jobs would have to poll whether a new service could currently be created that satisfies the placement constraints and metrics.
Questions
What happens when a service can't be placed on a node? (Using CreateServiceAsync I assume?)
Will this polling be prohibitively expensive?
Our video encoding executable is packaged along with the service which is approximately 80MB. Will this make the spinning up of a new service take a long time? (Minutes vs seconds)
As an alternative to this we could use a reliable queue based system, where the large jobs pool pulls from one queue and the small jobs pool pulls from another queue. This seems like the simpler way, but I want to explore all options to make sure I'm not missing out on some of the features of Service Fabric. Is there another better way you would suggest?
I have no experience with placement constraints and dynamic services, so I can't speak to that.
The polling of the perf counters isn't terribly expensive, that being said it's not a free operation. A one second poll interval shouldn't cause any huge perf impact while still providing a decent degree of resolution.
The service packages get copied to each node at deployment time rather than when services get spun up, so it'll make the deployment a bit slower but not affect service creation.
You're going to want to put the job data in reliable collections any way you structure it, but the question is how. One idea I just had that might be worth considering is making the job processing service a partitioned service and base your partitioning strategy based off encoding job size and/or tenant so that large jobs from the same tenant get stuck in the same queue, and smaller jobs for others go elsewhere.
As an aside, one thing I've dealt with in the past is SF remoting limits the size of the messages sent and throws if its too big, so if your video files are being passed from service to service you're going to want to consider a paging strategy for inter service communication.

How much latency is there transferring data to the Windows Azure Worker Role External Endpoint?

I have an app that I'm thinking about moving to Azure as a Worker Role with an external facing endpoint. It's a small little process that runs in about 200-400ms, but our users would like to start running the little job 50K-100K times a day, per user. Before I go building the Azure prototype, I need to figure out what kind of latency I can expect communicating with an Azure external endpoint. Obviously, the latency depends on the size of information that I'm sending and receiving, and it depends on the speed of my internet connection, but I can't find any metrics anywhere. Are there any kind of base line numbers out there?
For the sake of argument, lets say I'm on a T1 and I'm sending 10K up and 10K down with each job run.
I don't think latency is exactly the term you looking for, that's the delay it takes sending each packet over the network which is affected more by your distance from the server, and the nature of your network.
Having said that, everyones results wrt to latency will be different, the only way to be sure will be to set up a prototype and run some performance tests on it. Also remember with Azure you can specify your data center, so select one near you.

Resources