How to store more than 100 records using cosmos batch from azure cosmos db using query - azure

I tried to create more than 100 documents in a batch and received a 400 (Bad Request) result from the server with the error Batch request has more operations than what is supported.
Creating 100 documents works fine. Clearly, there is a limit of 100 operations per batch.There's no documentation I could find anywhere the solution.
I can not store them in different batches because even if one doc fail to store I want others also to rollback. Any somebody please guide me how to achieve this using cosmos db?

Transactional Batch has two upper-limits, size-wise (aside from the restriction that a batch must be within the same partition of the same collection):
100 items
2MB payload
Going beyond 100 items (or 2MB) will require you to iterate through multiple batches, checkpointing with each successfully-written batch. How you accomplish this is really up to you, as there is no mechanism built-in.
The limitations on batch item count and size are documented here.

Related

Slow bulk insert to Azure database

We are running an elastic pool in Azure running multiple databases, when running 1 of our larger imports this seems to take longer than we are used to. During these imports we ran at 6 cores as a test. All databases are allowed to use all cores.
On our local enviroment, it inserts about 100k records per second, however, the same dataset on Azure does about 1k per second (our vm) to 4k per second (dev laptop).
During this insert, the database only uses 14% log IO, 5% CPU and 0% DataIO.
When setting up a new database using DTU model in P2 we are noticing the same experience. So we are not even hitting the limits of the database
The table contains about 36 columns which are all required.
We have tried this using BulkInsert in the following way using different batchsizes
BulkConfig b = new BulkConfig();
b.BatchSize = 100000;
await dbcontext.BulkInsertAsync(entities, b);
As well as using standard EntityFramework addranges using smaller batches. We even went as far as using the manually written SqlBulkCopy methods, however all with no dice.
Now the question is mainly, is this a software issue? Are we running into issues in our AzureDB? Do we need to change the way we do Bulk imports?
Edit:
Attempted to run the import using the TempDB Setting in BulkInsert, however this also does not increase performance. LogIO is still at 14%.
Iterate through the dataset on the application layer, invoking a
stored procedure for each row that will perform an INSERT/UPDATE
action based on the existence of a record with a certain key. If the
number of records to upsert is limited, this strategy may work well;
otherwise, roundtrips and log writes will have a major influence on
speed.
To minimise roundtrips and log writes and increase throughput, use
bulk insert approaches like the SqlBulkCopy class in ADO.NET to
upload the full dataset to Azure SQL Database and then execute all
the INSERT/UPDATE (or MERGE) operations in a single batch. Overall
execution times may be reduced from hours to minutes/seconds using
this method.
Here, is a discussion related to same scenario: Optimize Azure SQL Database Bulk Upsert scenarios - link.

Azure search index storage size stops at 8MB

I am trying to ingest a load of 13k json documents into azure search engine, but the index stops at around 6k documents without any error for the indexer and the index storage size is 7.96MB and it doesn't surpass this limit no matter what.
I have tried using smaller batches of 3k/indexer and after that 1k/indexer, but I got the same result.
In my json I have around 10 simple fields, and 20 complex fields (which have other nested complex fields, but up to level 5).
Do you have any idea if there is a limit per size for an index? And where I can set it up?
As SLA, I think we are using S1 plan (based on what limits we have - 50 indexers, and so on)
Thanks
Really hard to help without seeing it, but I remember I faced a problem like this in the past. In my case, it was a problem of duplicating with the key field.
I also recommend you smaller batches (~500 documents)
PS: Take a look if your nested jsons are not too big (in case it's marked as retrievable).

Is there a way to increase the request size limit when inserting data into cosmosdb?

I have a requirement of reading multiple files (105 files) from ADLS(Azure data lake storage); parsing them and subsequently adding the parsed data directly to multiple collections in azure cosmos db for mongodb api. All this needs to be done in one request. Average file size is 120kb.
The issue is that after multiple documents are added,an error is raised "request size limit too large"
Please let me know if someone has any inputs on this.
It's unclear how you're performing multi-document inserts but... You can't increase maximum request size. You'll need to perform individual inserts, or insert in smaller batches.

Azure Cosmos DB - incorrect and variable document count

I have inserted exactly 1 million documents in an Azure Cosmos DB SQL container using the Bulk Executor. No errors were logged. All documents share the same partition key. The container is provisioned for 3,200 RU/s, unlimited storage capacity and single-region write.
When performing a simple count query:
select value count(1) from c where c.partitionKey = #partitionKey
I get varying results varying from 303,000 to 307,000.
This count query works fine for smaller partitions (from 10k up to 250k documents).
What could cause this strange behavior?
It's reasonable in cosmos db. Firstly, what you need to know is that Document DB imposes limits on Response page size. This link summarizes some of those limits: Azure DocumentDb Storage Limits - what exactly do they mean?
Secondly, if you want to query large data from Document DB, you have to consider the query performance issue, please refer to this article:Tuning query performance with Azure Cosmos DB.
By looking at the Document DB REST API, you can observe several important parameters which has a significant impact on query operations : x-ms-max-item-count, x-ms-continuation.
So, your error is resulted of bottleneck of RUs setting. The count query is limited by the number for RUs allocated to your collection. The result that you would have received will have a continuation token.
You may have 2 solutions:
1.Surely, you could raise the RUs setting.
2.For cost, you could keep looking for next set of results via continuation token and keep on adding it so that you will get total count.(Probably in sdk)
You could set value of Max Item Count and paginate your data using continuation tokens. The Document Db sdk supports reading paginated data seamlessly. You could refer to the snippet of python code as below:
q = client.QueryDocuments(collection_link, query, {'maxItemCount':10})
results_1 = q._fetch_function({'maxItemCount':10})
#this is a string representing a JSON object
token = results_1[1]['x-ms-continuation']
results_2 = q._fetch_function({'maxItemCount':10,'continuation':token})
I imported exactly 30k documents into my database.Then I tried to run the query
select value count(1) from c in Query Explorer. It turns out only partial of total documents every page. So I need to add them all by clicking Next Page button.
Surely, you could do this query in the sdk code via continuation token.

How fast is Azure Search Indexer and how I can index faster?

Each index batch is limited from 1 to 1000 documents. When I call it from my local machine or azure VM, I got 800ms to 3000ms per 1000 doc batch. If I submit multiple batches with async, the time spent is roughly the same. That means it would take 15 - 20 hours for my ~50M document collection.
Is there a way I can make it faster?
It looks like you are using our Standard S1 search service and although there are a lot of things that can impact how fast data can be ingested. I would expect to see ingestion to a single partition search service at a rate of about 700 docs / second for an average index, so I think your numbers are not far off from what I would expect, although please note that these are purely rough estimates and you may see different results based on any number of factors (such as number of fields, quantity of facets, etc)..
It is possible that some of the extra time you are seeing is due to the latency of uploading the content from your local machine to Azure, and it would likely be faster if you did this directly from Azure but if this is just a one time-upload that probably is not worth the effort.
You can slightly increase the speed of data ingestion by increasing the number of partitions you have and the S2 Search Service will also ingest data faster. Although both of these come at a cost.
By the way, if you have 50 M documents, please make sure that you allocate enough partitions since a single S1 partition can handle 15M documents or 25GB so you will definitely need extra partitions for this service.
Also as another side note, when you are uploading your content (and especially if you choose to do parallelized uploads), keep an eye on the HTTP responses because if the search service exceeds the resources available you could get HTTP 207 (indicating one or more item failed to apply) or 503's indicating the whole batch failed due to throttling. If throttling occurs, you would want to back off a bit to let the service catch up.
I think you're reaching the request capacity:
https://azure.microsoft.com/en-us/documentation/articles/search-limits-quotas-capacity/
I would try another tier (s1, s2). If you still face the same problem, try get in touch with support team.
Another option:
Instead of pushing data, try to add your data to the blob storage, documentDb or Sql Database, and then, use the pull approach:
https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/

Resources