apache pulsar: storage write latency vs publish latency - apache-pulsar

While looking at the metrics for the broker, I found there are two different latency metrics in pulsar broker: storage write latency and publish latency.
I'm aware of the fact that apache pulsar in general guarantees <5ms for publish latency and I've sort of confirmed that on AWS(I ran my own benchmark)
Nonetheless, storage write latency metric somewhat confusing as shown below as this does not fall sub 5ms in general.
It would be helpful if anyone could explan what storage write latency means specifically and where(zookeeper, bookeeper, broker etc) it comes from.

Related

Databricks REST API throttling and capacity restrictions/limits

I've scaled up the hardware on an azure-databricks cluster ("all-purpose" cluster) appropriately so that it should handle a very large amount of work. The application is designed in a way where incoming data is processed in smallish, discrete chunks. The jobs run in ~20 to 30 seconds. But there is a high degree of concurrent jobs that need to execute at the same time (eg. anywhere from 0 to 50 simultaneous jobs).
The only approach for delivering jobs to the cluster seems to be by way of their REST API in azure databricks (doc: https://docs.databricks.com/dev-tools/api/latest/jobs.html )
Everything behaves normally until the number of concurrent jobs reaches 10 or so. At that point I see an unreasonable deterioration in throughput. But if I check ganglia or custom telemetry, there appears to be no reason for the deteriorated performance.
My suspicion is that the REST API itself is introducing an artificial bottleneck and they are throttling the number of jobs I can send over to my cluster. This was not self-evident to me. If I am paying for a large cluster, I should be allowed to send jobs to it. The REST API seems to be doing little more than serving as a communication channel that allows me to transmit my requests to my cluster. That API is the last place I would expect to find a resource bottleneck. A Spark developer would naturally investigate their code, then the cluster hardware. The REST API is not a reasonable place for Databricks to be introducing some additional, secretive limitations.
Does anyone know of another way to transmit distinct jobs to a cluster without going thru the REST API? Eg. is there a way for the driver node in the cluster to spawn additional/distinct/first-class jobs without being counted against our REST API allowance?
This issue seems silly and artificial. The secretive nature of these limits is bothersome to me as well. If they are throttling the REST API then there should be a warning, error, or ganglia chart for that. Otherwise developers will struggle with the performance issues using trial and error and guesswork.
Any help is appreciated. I'd prefer not to go all the way back to the drawing board, because of an artificial restriction in their REST API (one that was probably put in place to protect an underpowered "control plane").
Spark is awesome, but it isn't designed to be a high-concurrency database. The folks at Databricks have done a lot to lift the concurrency limitations of Spark, it still isn't a high-concurrecy solution.
In other words, your problem isn't the REST API ... it's the Spark engine in Databricks.
I know you don't want to go back to the drawing board, but the choices here are all bad ones:
you can run multiple Databricks clusters ( https://docs.databricks.com/clusters/index.html ) and use NGINX or some other load balancer to distribute the API requests. This will get expensive, quickly, but will avoid redesign.
If your use case supports it, try using a real-time database that supports high concurrency. I like Druid (see https://druid.apache.io or https://imply.io if you want a managed version), but there are others in the same category

How to get better performance with Azure ServiceBus Standard plan

I don't manage to get over 14 msg/second with the Azure ServiceBus Standard Plan. I'm running some benchmark tests with the Azure-Sample tool that I found in this question:
The test is done with a ServiceBus resource with a single Queue and all default configurations:
If I read this correctly, you've got the maximum concurrency of one (MaxInflightReceives) with 5 receivers (ReceiverCount). Increasing concurrency and enabling prefetch on the clients will increase the overall throughput. But,
Testing should be done within the same Azure data centre. If you're testing from a local machine, you're introducing a substantial latency that cannot be avoided.
The receive mode used is PeekLock. It is slower than ReceiveAndDelete. Not suggesting to switch, but this needs to be taken into consideration as you're trading throughput for safety by using PeekLock.
The standard tier has a cap on the number of operations per second. In addition to that, your namespace is deployed in a shared environment with entities scattered in various deployment containers. Performance will vary and cannot be guaranteed. If you want to have a guaranteed throughput, use Premium SKU.

Store and forward to eventHub a lot of data

We're developing an InternetOfThings application. Actually we get the data from the device and send to EventHub on Azure.
We could have a lot of connection problems on the field, sometimes offline periods that can even last for days.
Is there a library to store and forward our messages and forward to eventhub when possible? We hear about rabbitmq but here the volume of data is huge (near 3/4 gb per day). Is it capable to manage that volumes? Have someone expirience in a similar scenario?

Deleting items from Azure queue painfully slow

My application relies heavily on a queue in in Windows Azure Storage (not Service Bus). Until two days ago, it worked like a charm, but all of a sudden my worker role is no longer able to process all the items in the queue. I've added several counters and from that data deduced that deleting items from the queue is the bottleneck. For example, deleting a single item from the queue can take up to 1 second!
On a SO post How to achive more 10 inserts per second with azure storage tables and on the MSDN blog
http://blogs.msdn.com/b/jnak/archive/2010/01/22/windows-azure-instances-storage-limits.aspx I found some info on how to speed up the communication with the queue, but those posts only look at insertion of new items. So far, I haven't been able to find anything on why deletion of queue items should be slow. So the questions are:
(1) Does anyone have a general idea why deletion suddenly may be slow?
(2) On Azure's status pages (https://azure.microsoft.com/en-us/status/#history) there is no mentioning of any service disruption in West Europe (which is where my stuff is situated); can I rely on the service pages?
(3) In the same storage, I have a lot of data in blobs and tables. Could that amount of data interfere with the ability to delete items from the queue? Also, does anyone know what happens if you're pushing the data limit of 2TB?
1) Sorry, no. Not a general one.
2) Can you rely on the service pages? They certainly will give you information, but there is always a lag from the time an issue occurs and when it shows up on the status board. They are getting better at automating the updates and in the management portal you are starting to see where they will notify you if your particular deployments might be affected. With that said, it is not unheard of that small issues crop up from time to time that may never be shown on the board as they don't break SLA or are resolved extremely quickly. It's good you checked this though, it's usually a good first step.
3) In general, no the amount of data you have within a storage account should NOT affect your throughput; however, there is a limit to the amount of throughput you'll get on a storage account (regardless of the data amount stored). You can read about the Storage Scalability and Performance targets, but the throughput target is up to 20,000 entities or messages a second for all access of a storage account. If you have a LOT of applications or systems attempting to access data out of this same storage account you might see some throttling or failures if you are approaching that limit. Note that as you saw with the posts on improving throughput for inserts these are the performance targets and how your code is written and configurations you use have a drastic affect on this. The data limit for a storage account (everything in it) is 500 TB, not 2TB. I believe once you hit the actual storage limit all writes will simply fail until more space is available (I've never even got close to it, so I'm not 100% sure on that).
Throughput is also limited at the partition level, and for a queue that's a target of Up to 2000 messages per second, which you clearly aren't getting at all. Since you have only a single worker role I'll take a guess that you don't have that many producers of the messages either, at least not enough to get near the 2,000 msgs per second.
I'd turn on storage analytics to see if you are getting throttled as well as check out the AverageE2ELatency and AverageServerLatency values (as Thomas also suggested in his answer) being recorded in the $MetricsMinutePrimaryTransactionQueue table that the analytics turns on. This will help give you an idea of trends over time as well as possibly help determine if it is a latency issue between the worker roles and the storage system.
The reason I asked about the size of the VM for the worker role is that there is a (unpublished) amount of throughput per VM based on it's size. An XS VM gets much less of the total throughput on the NIC than larger sizes. You can sometimes get more than you expect across the NIC, but only if the other deployments on the physical machine aren't using their portion of that bandwidth at the time. This can often lead to varying performance issues for network bound work when testing. I'd still expect much better throughput than what you are seeing though.
There is a network in between you and the Azure storage, which might degrade the latency.
Sudden peaks (e.g. from 20ms to 2s) can happen often, so you need to deal with this in your code.
To pinpoint this problem further down the road (e.g. client issues, network errors etc.) You can turn on storage analytics to see where the problem exists. There you can also see if the end2end latency is too big or just the server latency is the limiting factor. The former usually tells about network issues, the latter about something beeing wrong on the Queue itself.
Usually those latency issues a transient (just temporary) and there is no need to announce that as a service disruption, because it isn't one. If it has constantly bad performance, you should open a support ticket.

How much latency is there transferring data to the Windows Azure Worker Role External Endpoint?

I have an app that I'm thinking about moving to Azure as a Worker Role with an external facing endpoint. It's a small little process that runs in about 200-400ms, but our users would like to start running the little job 50K-100K times a day, per user. Before I go building the Azure prototype, I need to figure out what kind of latency I can expect communicating with an Azure external endpoint. Obviously, the latency depends on the size of information that I'm sending and receiving, and it depends on the speed of my internet connection, but I can't find any metrics anywhere. Are there any kind of base line numbers out there?
For the sake of argument, lets say I'm on a T1 and I'm sending 10K up and 10K down with each job run.
I don't think latency is exactly the term you looking for, that's the delay it takes sending each packet over the network which is affected more by your distance from the server, and the nature of your network.
Having said that, everyones results wrt to latency will be different, the only way to be sure will be to set up a prototype and run some performance tests on it. Also remember with Azure you can specify your data center, so select one near you.

Resources