What are the limitations for zip file extraction in Azure Search? - azure

My indexer shows an error for two zip files (size of 800MB) saying that the zip file has "the size of ... bytes, which exceeds the maximum size for document extraction for your curent service tier." Azure Search is already set on a Standard tier. Is the solution to go to a higher tier? Cause from the documentation I gather that the limit is the same across all Standard tiers? If there is a limit to the size of zip files to be extracted, then what is it? And would the solution for bigger files then be to unzip it before Azure Search?

Maximum running times exist to provide balance and stability to the service as a whole, but larger data sets might need more indexing time than the maximum allows. If an indexing job cannot complete within the maximum time allowed, try running it on a schedule. Here is indexer limit you could refer to.
Note: A service is provisioned at a specific tier. Jumping tiers to gain capacity involves provisioning a new service (there is no in-place upgrade). For more information, see Choose a SKU or tier. To learn more about adjusting capacity within a service you've already provisioned, see Scale resource levels for query and indexing workloads.

Related

DTU utilization is not going above 56%

I'm doing a data load on azure sql server using azure data factory v2. I started the data load & the DB was set to Standard Pricing Tier with 800 DTUs. It was slow, so I increased the DTUs to 1600. (My pipeline is still running since 7 hrs).
I decided to change the pricing tier. I changed the pricing tier to Premium, DTUs set to 1000. (I didnt make any additional changes).
The pipeline failed as it lost connection. I rerun the pipeline.
Now, when I monitor the pipeline, it is working fine. When I monitor the database. The DTU usage on average is not going above 56%.
I am dealing with tremendous data. How can I speed up the process?
I expect the DTUs must max out. But the average utilization is around 56%.
Please follow this document Copy activity performance and scalability guide.
This tutorial gives us the Performance tuning steps.
One of ways is increase the Azure SQL Database tier with more DTUs. You have increased the Azure SQL Database tier with more 1000 DTUs, but the average utilization is around 56%. I think You don't need so higher price tier.
You need to think about other ways to improve the performance. Such as set more Data Integration Units(DIU).
A Data Integration Unit is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Azure Data Factory. Data Integration Unit only applies to Azure integration runtime, but not self-hosted integration runtime.
Hope this helps.
The standard answer from Microsoft seems to be that you need to tune the target database or scale up to a higher tier. This suggests that Azure Data Factory is not a limiting factor in the copy performance.
However we've done some testing on a single table, single copy activity, ~15 GB of data. The table did not contain varchar(max), high precision, just simple and plain data.
Conclusion: it does barely matter what kind of tier you choose (not too low ofcourse), roughly above S7 / 800 DTU, 8 vcores, the performance of the copy activity is ~10 MB/s and does not go up. The load on the target database is 50%-75%.
Our assumption is that since we could keep throwing higher database tiers against this problem, but did not see any improvement in the copy activity performance, this is Azure Data Factory related.
Our solution is, since we are loading a lot of separate tables, to scale out instead of scale up via a for each loop and a batch count set to at least 4.
The approach to increase the DIU is only applicable in some cases:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance#data-integration-units
Setting of DIUs larger than four currently applies only when you copy
multiple files from Azure Storage, Azure Data Lake Storage, Amazon S3,
Google Cloud Storage, cloud FTP, or cloud SFTP to any other cloud data
stores.
In our case we are copying data from relational databases.

Estimating Azure Search Cost

I'm fairly new to Azure platform and need some help with cost estimates for Azure Search service.
Every month we will have about 500GB worth of files that will dropped into Azure Blob Storage. We would like to index these files using Azure Search just based on the file names.
When I look at the Standard S2 pricing, it has the following text for storage:
100 GB/partition (max 1.2 TB documents per service). What does this mean? Does it mean that once my storage crosses 1.2TB, I'll need to purchase another Service?
Any help would be greatly appreciated.
If a tier's capacity turns out to be too low based on your needs, you will need to provision a new service at the higher tier and then reload your indexes. Kindly note that there is no in-place upgrade of the same service from one SKU to another.
Storage is constrained by disk space or by a hard limit on the maximum number of indexes, document, or other high-level resources, whichever comes first.
A service is provisioned at a specific tier. Jumping tiers to gain capacity involves provisioning a new service (there is no in-place upgrade). For more information, see Choose a SKU or tier. To learn more about adjusting capacity within a service you've already provisioned, see Scale resource levels for query and indexing workloads.
Check the document for more details on this topic.
S2 offers the following (at this time) - Storage per partition = 100 GB, with Partitions per service = 12 and, the Partition size = 100 GB.
You could use Pricing Calculator (https://azure.microsoft.com/pricing/calculator/) for the cost estimation as well.
The storage limits for Azure Search refers to the size of documents in the index, which can be bigger or smaller than your original blobs depending on your use case.
For example, if you want to do text searches on blob content then the index size will be bigger than your original blobs. Whereas if you only want to search for file names, then size of the original blobs becomes irrelevant and only the number of blobs will affect your Azure Search index size.

How big a single Azure Function can be?

My Azure Function will require resources to run properly (ranging from 5 to 30 Mb). I'm thinking of putting them on the Blob Storage and loading from the Storage each time function runs, but since it will be quite often (a couple times per day) I'm thinking of simply putting them as resources in the function project.
The question is, how much space can I allocate for the function itself? How big (in terms of disk space) a Azure Function can be?
It depends on which host plan your Function app is on.
For dedicate App service plan(Pricing tier Basic, Standard and so on), we can see the size of disk storage allocated for this plan.
Just like #Slava Utesinov has mentioned, we can also check it from Platform features -> App Service plan -> File system storage as well.
For Consumption plan, File system storage shows 0KB usage among 1GB available. Because Function apps on Consumption plan are stored in Storage File Share, which is specified by WEBSITE_CONTENTAZUREFILECONNECTIONSTRING and WEBSITE_CONTENTSHARE Application settings on Azure portal, check doc. As for space, max size of a File share is 5TB, see doc.

Sync mechanism to azure search - How Reliable is azure search insertion?

How reliable is the insertion mechanism to azure search?
Say, a call on average to upload to azure search. Are there any slas on this? average insertion time for one document, average failure rate for one document.
I'm trying to send data from my database to azure search and I was wondering if it was more reliable to send data directly to azure search, or do a dual write for example to a high available queue like kafka and read from there.
From SLA for Azure Search:
We guarantee at least 99.9% availability for index query requests when
an Azure Search Service Instance is configured with two or more
replicas, and index update requests when an Azure Search Service
Instance is configured with three or more replicas. No SLA is provided
for the Free tier.
Your client code needs to follow the best practices: batch indexing requests, retry on transient failures with an exponential back-off policy, and scale service appropriately based on the size of the documents and indexing load.
Whether or not use an intermediate buffer depends not so much on SLA, but how spiky your indexing load will be, and how decoupled you want your search indexing component to be.
You may also find Capacity planning for Azure Search useful.

What Azure Table IOPS limitations exist with the different Web App (or function) sizes?

I'm looking at high scale Azure table operations, and I am looking for documentation that describes the maxIOPS to expect from Azure instance sizes for Azure Web App, Function, etc.
The Web Roles and corresponding limitation is well documented. For example see this comment in the linked question
so, we ran our tests on different instance sizes and yes that makes a huge difference. at medium we get around 1200 writes per second, on extra large we get around 7200. We are looking at building a distributed read/write controller possibly using the dcache as the middle man. – JTtheGeek Aug 9 '13 at 22:39
Question
What is the corresponding limitation for the Web Apps (logic, mobile, etc) and Azure Table IOPS
According to the official document that total Request Rate (assuming 1 KB object size) per storage account Up to 20,000 IOPS, entities per second, or messages per second. We also can get the VM Max IOPS limitations from the Azure VM size document. Web Apps are based on service plan, in the service plan we could choose different price tiers that have different VM sizes. It maybe could use for reference. More Azure limitation please refer to Azure subscription and service limits, quotas, and constraints.

Resources