Is there a way to increase the request size limit when inserting data into cosmosdb? - azure

I have a requirement of reading multiple files (105 files) from ADLS(Azure data lake storage); parsing them and subsequently adding the parsed data directly to multiple collections in azure cosmos db for mongodb api. All this needs to be done in one request. Average file size is 120kb.
The issue is that after multiple documents are added,an error is raised "request size limit too large"
Please let me know if someone has any inputs on this.

It's unclear how you're performing multi-document inserts but... You can't increase maximum request size. You'll need to perform individual inserts, or insert in smaller batches.

Related

How to store more than 100 records using cosmos batch from azure cosmos db using query

I tried to create more than 100 documents in a batch and received a 400 (Bad Request) result from the server with the error Batch request has more operations than what is supported.
Creating 100 documents works fine. Clearly, there is a limit of 100 operations per batch.There's no documentation I could find anywhere the solution.
I can not store them in different batches because even if one doc fail to store I want others also to rollback. Any somebody please guide me how to achieve this using cosmos db?
Transactional Batch has two upper-limits, size-wise (aside from the restriction that a batch must be within the same partition of the same collection):
100 items
2MB payload
Going beyond 100 items (or 2MB) will require you to iterate through multiple batches, checkpointing with each successfully-written batch. How you accomplish this is really up to you, as there is no mechanism built-in.
The limitations on batch item count and size are documented here.

Azure Cosmos DB - incorrect and variable document count

I have inserted exactly 1 million documents in an Azure Cosmos DB SQL container using the Bulk Executor. No errors were logged. All documents share the same partition key. The container is provisioned for 3,200 RU/s, unlimited storage capacity and single-region write.
When performing a simple count query:
select value count(1) from c where c.partitionKey = #partitionKey
I get varying results varying from 303,000 to 307,000.
This count query works fine for smaller partitions (from 10k up to 250k documents).
What could cause this strange behavior?
It's reasonable in cosmos db. Firstly, what you need to know is that Document DB imposes limits on Response page size. This link summarizes some of those limits: Azure DocumentDb Storage Limits - what exactly do they mean?
Secondly, if you want to query large data from Document DB, you have to consider the query performance issue, please refer to this article:Tuning query performance with Azure Cosmos DB.
By looking at the Document DB REST API, you can observe several important parameters which has a significant impact on query operations : x-ms-max-item-count, x-ms-continuation.
So, your error is resulted of bottleneck of RUs setting. The count query is limited by the number for RUs allocated to your collection. The result that you would have received will have a continuation token.
You may have 2 solutions:
1.Surely, you could raise the RUs setting.
2.For cost, you could keep looking for next set of results via continuation token and keep on adding it so that you will get total count.(Probably in sdk)
You could set value of Max Item Count and paginate your data using continuation tokens. The Document Db sdk supports reading paginated data seamlessly. You could refer to the snippet of python code as below:
q = client.QueryDocuments(collection_link, query, {'maxItemCount':10})
results_1 = q._fetch_function({'maxItemCount':10})
#this is a string representing a JSON object
token = results_1[1]['x-ms-continuation']
results_2 = q._fetch_function({'maxItemCount':10,'continuation':token})
I imported exactly 30k documents into my database.Then I tried to run the query
select value count(1) from c in Query Explorer. It turns out only partial of total documents every page. So I need to add them all by clicking Next Page button.
Surely, you could do this query in the sdk code via continuation token.

Max number of records in DynamoDB

I'm writing a script that saves a CSV into a dynamoDB table. I'm using Node.js and the aws-sdk module. Everything seems to be correct, but I'm sending over 50k records to Dynamo, while only 1181 are saved and shown on the web console.
I've tried with different amount of records and this is the biggest count I get ,no matter if I try saving 100k, 10k or 50k.
According to AWS's documentation, there shouldn't be any limit to the amount of records, any idea as to what other factors could influence this hard limit?
BTW, my code is catching errors from the insert actions, and I'm not picking up any when inserting past the 1181 mark, so the module is not really helping.
Any extra idea would be appreciated.
If your using the DynamoDb batchWriteitem or another batch insert you need to check the "UnprocessedItems" Element in the response. Sometimes batch writes exceed the provisioned write capacity of your table and it will not process all of your inserts, sounds like what is happening here.
You should check the response of your insert and if there are unprocessed items setup a retry and exponential backoff timing strategy in your code. This will allow for the additional items to be inserted until all of your CSV is processed.
Here is the refrerence link for Dynamo BatchWriteItem if you want to take a closer look at the response elements. Good luck!

Cassandra: Issue with blob creation for large file

We are trying to load a file in to a blob column in Cassandra. When we load files of 1-2 MB files, it goes through fine. While loading large file, say around 50 MB, getting following error:
Cassandra failure during write query at consistency LOCAL_QUORUM (1 responses were required but only 0 replica responded, 1 failed)
It is a single node development DB. Any hints or support will be appreciated.
50mb is pretty big for a cell. Although a little out of date its still accurate: http://cassandra.apache.org/doc/4.0/faq/#can-large-blob
There is no mechanism for streaming out of cells in Cassandra so the cells content needs to be serialized in as single response, in memory. Your probably hitting a limit or bug somewhere thats throwing an exception and causing the failed query (check cassandras system.log, may be an exception in there that will describe whats occuring better).
If you have a CQL collection or logged batch there are additional lower limits.
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/refLimits.html
You can try chunking your blobs into parts. Id actually recommend like 64kb, and on client side, iterate through them and generate a stream (to also prevent loading it completely in memory on your side).
CREATE TABLE exampleblob (
blobid text,
chunkid int,
data blob,
PRIMARY KEY (blobid, chunkid));
Then just SELECT * FROM exampleblob WHERE blobid = 'myblob'; and iterate through results. Inserting gets more complex though since you have to have logic to split up your file, this can also be done in streaming fashion though and be memory efficient on your app side.
Another alternative is to just upload the blob to S3 or some distributed file store, use a hash of the file as the bucket/filename. In Cassandra just store the filename as a reference to it.

Limited results when using Excel BigQuery connector

I've pulled a data extract from BigQuery using the Excel connector but my results have been limited to 230,000 records.
Is this a limitation of the connector or something I have not done properly?
BigQuery does have a maximum response size of 64MB (compressed). So, depending on the size of your rows, it's quite possible that 230,000 is the maximum size response BigQuery can return.
See more info on quotas here:
https://developers.google.com/bigquery/docs/quota-policy
What's the use case -- and how many rows are you expecting to be returned? Generally BigQuery is used for large aggregate analysis, rather than results which return tons of unaggregated results. You can dump the entire table as a CSV into Google Cloud Storage if you're looking for your raw dataset too.
Also, you may want to try running the query in the UI at:
https://bigquery.cloud.google.com/

Resources