After having some latencies on our Azure Storage and especially on some File Shares, we configured a log analytics workspace to see what's going on. I was surprised by the amount of failed queries : (during 4 hours)
To explain the situation, we have one entity that scans this shared folder to check if there is a file to consume. So the "ErrorCount" might represent the number of times the entity scan the folder.
Is that a common behavior?
"StatusCode" = 0 , what does it mean ?
Related
azure experts
I am a DS recently trying to manage the pipelines after our DE left, so quite new to azure. I recently ran into an issue that our azure storage usage increased quite significantly, I found some clues, but now stuck and will be really grateful if an expert can help me through.
Looking into the usage, it seems the usage increase to be mainly due to a significant increase of transaction and ingress of the API ListFilesystemDir.
I traced the usage profile back to the time when our pipelines run, I located the main usage to be from a copy data activity, which copies data from a folder in ADLS to SQL DW staging table. The amount of data being actually transferred is only tens of MB, similar to before, so should not be the reason leading to the substantial increase. The main difference I found is in the transaction and ingress using the API ListFilesystemDir, which shows an ingress volume of hundreds of MB and ~500k transactions during the pipeline run. The surprising thing is, the usage of this API ListFilesystemDir was very small before (at tens of kB and ~1k transaction).
recent storage usage of API ListFilesystemDir
Looking at the pipeline, I notice it uses a filepath wildcard to filter and select all the most recently updated files within a higher-level folder.
pipeline
My questions are
Is there any change to the usage/limit/definition of API ListFilesystemDir that could cause such an issue?
Does the application of wildcard path filter means azure need to query all the file lists in the root folder and then do filter, such it leads to a huge amount the transaction? Although it didn't seem to cause the issue before, has there been any change to the API usage of this wildcard filter?
Any suggestion on solving the problem?
Thank you very much!
I use Azure Kubernetes service, which is connected to Log Analytics Workspace.
Log Analytics Workspace collects too much data, what is quite expensive.
After googling the ways to reduce the costs I found a few recommendations, but most of them about reducing the ContainerLogs size. My case is different. I do need all ContainerLogs, but nothing else.
On the image you can see the result of running this query:
union withsource = tt *
| where TimeGenerated > ago(1day)
| where _IsBillable == true
| summarize BillableDataMBytes = sum(_BilledSize)/ (1000. * 1000.) by tt
| render piechart
As you can see I need only 3% of all stored data, the rest 97% I want to disable or reduce (take new values much rare). Is it possible?
Is it possible to disable/reduce at least the "Perf"?
Basic Logs and Archive Logs options can help you. Please take a look at this article.
You can find explanation below regarding Basic Logs.
The first option is the introduction of “Basic Logs”. This sits
alongside the current option, called “Analytics Logs”. Basic logs
allow you to designate specific tables in your Log Analytics workspace
as “basic”. This then means that the cost of the data ingestion is
reduced; however, some of the features of Log Analytics are not
available on that data. This means that you cannot configure alerts on
this data, and the options for querying the data are limited. You also
pay a cost for running a query against the data.
For the Archive Logs:
The second change is adding an option for archiving logs. It is often
necessary to keep certain logs for an extended period due to company
or regulatory requirements. Log Analytics does support retention of
logs for up to two years, but you pay a retention cost that is
relatively high because the data is kept in live tables that can be
accessed at any time.
Archive logs allow you to move the data into an offline state where it
cannot be accessed directly but is significantly cheaper. Archive data
is charged at $0.025 per GB per month, compared to $0.12 per GB per
month for standard data retention. If you need to access the archive
data, you pay an additional fee for either querying the archive or
restoring it to active tables. This costs $0.007 per GB of data
scanned for querying or $0.123 per GB per day of data restored (so
effectively the same as standard data retention.). These prices were
correct at the time of publishing, please check the Azure Monitor
pricing page.
The storage consumption on our ADLS Gen2 rose from 5 TB to 314 TB within 10 days and has maintained steady since then. It has just 2 containers:- $logs container and a container with all directories for data storage. The $logs container looks empty. I have tried looking at Folder Statistics in Azure Storage Explorer on the other container and it does not seem any of the directories is big enough.
Interestingly, one of the directories was running the Folder Statistics for few hours so I cancelled it. On cancellation, partial result showed 200+ TB and 88k+ blobs in it. I did a visual inspection of the directory and there were just a handful of blobs that would barely sum up to 1 GB. This directory had been present for months without issue. Regardless, I deleted this directory and checked the storage consumption after a few hours but could not see any change.
This brings to questions:-
If I cancel an ongoing Folder Statistic, could it show an incorrect partial result (in the above case it showed 200TB whereas it looked barely 1 GB in reality)? I have done it on previous occasions but even the partial stats seemed feasible.
Could there be hidden blobs in ADLS Gen2 that might not show up on visual inspection? (I have Read, Write, Delete access if that matters)
I have run Folder Statistic on Azure Storage Explorer for all folders individually. But is there a better way to get the storage consumption at one go (at least classified for directory and their sub-directory level - I suppose blob level would be overkill but whatever works). I have access to Databricks with mount point to this container and can create a cluster with the required runtime if such code is specific to one.
For reference:-
Update: We found the cause of the increase. It was, in fact, a few copy activities created by our team.
Interestingly, when we deleted it, it took about 48 hrs before the storage graph actually started going down although the files disappeared immediately.
This was not a delay in re4freshing the consumption graph, rather it actually took that time before we saw expected sharp dip in storage.
We raised a Microsoft case and they confirmed that such an amount of data can take time to actually delete in the background.
We have an application which is quite scalable as it is. Basically you have one or more stateless nodes that all do some independent work of files that are read and written to shared NFS share.
This NFS can be bottleneck but with local deployment customers just buys big enough box to have sufficient performance.
Now we are moving this to Azure and I would like to have a better more "cloudy" way of sharing data :) and running some Linux NFS server isn't ideal scenario if we need to manage them.
Is the Azure Blob storage the right tool for this job (https://azure.microsoft.com/en-us/services/storage/blobs/)?
we need good scalability. e.g. up to 10k files writen in a minute
files are quite small, less than 50KB per file on average
files created and read, not changed
files are short lived, we purge them every day
I am looking for more practical experience with this kind of storage and how good it really is.
There are two possible solutions to your request, either using Azure Storage Blobs (Recommended for your scenario) or Azure Files.
Azure Blobs has the following scaling targets:
It doesn't support the fact of attaching it a server, such as a network share.
Blobs do not support a hierarchy file structure besides having containers (Virtual folders can be accessed, but the con is you can't delete a container if it contains blobs- for the point about purging- but there are methods to do purging using your own code.)
Azure Files:
Links recommended:
Comparison between Azure Files and Blobs:https://learn.microsoft.com/en-us/azure/storage/common/storage-decide-blobs-files-disks
Informative SO post here
My application relies heavily on a queue in in Windows Azure Storage (not Service Bus). Until two days ago, it worked like a charm, but all of a sudden my worker role is no longer able to process all the items in the queue. I've added several counters and from that data deduced that deleting items from the queue is the bottleneck. For example, deleting a single item from the queue can take up to 1 second!
On a SO post How to achive more 10 inserts per second with azure storage tables and on the MSDN blog
http://blogs.msdn.com/b/jnak/archive/2010/01/22/windows-azure-instances-storage-limits.aspx I found some info on how to speed up the communication with the queue, but those posts only look at insertion of new items. So far, I haven't been able to find anything on why deletion of queue items should be slow. So the questions are:
(1) Does anyone have a general idea why deletion suddenly may be slow?
(2) On Azure's status pages (https://azure.microsoft.com/en-us/status/#history) there is no mentioning of any service disruption in West Europe (which is where my stuff is situated); can I rely on the service pages?
(3) In the same storage, I have a lot of data in blobs and tables. Could that amount of data interfere with the ability to delete items from the queue? Also, does anyone know what happens if you're pushing the data limit of 2TB?
1) Sorry, no. Not a general one.
2) Can you rely on the service pages? They certainly will give you information, but there is always a lag from the time an issue occurs and when it shows up on the status board. They are getting better at automating the updates and in the management portal you are starting to see where they will notify you if your particular deployments might be affected. With that said, it is not unheard of that small issues crop up from time to time that may never be shown on the board as they don't break SLA or are resolved extremely quickly. It's good you checked this though, it's usually a good first step.
3) In general, no the amount of data you have within a storage account should NOT affect your throughput; however, there is a limit to the amount of throughput you'll get on a storage account (regardless of the data amount stored). You can read about the Storage Scalability and Performance targets, but the throughput target is up to 20,000 entities or messages a second for all access of a storage account. If you have a LOT of applications or systems attempting to access data out of this same storage account you might see some throttling or failures if you are approaching that limit. Note that as you saw with the posts on improving throughput for inserts these are the performance targets and how your code is written and configurations you use have a drastic affect on this. The data limit for a storage account (everything in it) is 500 TB, not 2TB. I believe once you hit the actual storage limit all writes will simply fail until more space is available (I've never even got close to it, so I'm not 100% sure on that).
Throughput is also limited at the partition level, and for a queue that's a target of Up to 2000 messages per second, which you clearly aren't getting at all. Since you have only a single worker role I'll take a guess that you don't have that many producers of the messages either, at least not enough to get near the 2,000 msgs per second.
I'd turn on storage analytics to see if you are getting throttled as well as check out the AverageE2ELatency and AverageServerLatency values (as Thomas also suggested in his answer) being recorded in the $MetricsMinutePrimaryTransactionQueue table that the analytics turns on. This will help give you an idea of trends over time as well as possibly help determine if it is a latency issue between the worker roles and the storage system.
The reason I asked about the size of the VM for the worker role is that there is a (unpublished) amount of throughput per VM based on it's size. An XS VM gets much less of the total throughput on the NIC than larger sizes. You can sometimes get more than you expect across the NIC, but only if the other deployments on the physical machine aren't using their portion of that bandwidth at the time. This can often lead to varying performance issues for network bound work when testing. I'd still expect much better throughput than what you are seeing though.
There is a network in between you and the Azure storage, which might degrade the latency.
Sudden peaks (e.g. from 20ms to 2s) can happen often, so you need to deal with this in your code.
To pinpoint this problem further down the road (e.g. client issues, network errors etc.) You can turn on storage analytics to see where the problem exists. There you can also see if the end2end latency is too big or just the server latency is the limiting factor. The former usually tells about network issues, the latter about something beeing wrong on the Queue itself.
Usually those latency issues a transient (just temporary) and there is no need to announce that as a service disruption, because it isn't one. If it has constantly bad performance, you should open a support ticket.