While performing a simple Get operation by Id where a single document is returned (not an array with one document) I get the following x-ms-resource-usage:
x-ms-resource-usage:documentSize:0;documentsSize:288;collectionSize=307;
Questions:
Why is documentSize 0?
What is the unit of measure? Bytes?
What is the difference between documentSize and documentsSize? Please note the query only returns one document.
What is the collectionSize? Is that the total number of documents in the collection?
What is the difference between x-ms-resource-usage and x-ms-resource-quota?
I'm fairly sure the numbers are as follows, and all in KB:
documentSize: Size of the document
documentsSize: Combined size of all documents in collection
collectionSize: Combined size of all documents in collection, along with overhead such as indexes
x-ms-resource-usage is about consumed resources within the collection, while x-ms-resource-quota is going to give you your limits. So with quota, you'll see documentsSize and collectionSize both set to something like 10485760, which is 10GB (10,485,760 MB).
documentSize and documentsSize are the same value - first one in MB and the second one in kB. Apparently, documentSize is being deprecated.
collectionSize = documentsSize+metadata (in kB)
Related
I can't find any information on wether "item size" refers to the original document size, or to the result size of the query after projection.
I can observe that simple queries like these
documents.find({ /*...*/ }, { name: 1 })
consume more than 1000 RU, for results of 400 items (query fields are indexed). The original documents are pretty large, about 500 kb. The actually received data is tiny due to the projection. If I remove the projection, the query runs several seconds but doesn't consume significantly more RUs (it's actually slightly more, but it seems to be due to the fact that it's split into more GetMore calls).
It sounds really strange to me, that the cost of a query mainly depends on the size of the original document in the collection, not on the data retrieved. Is that really true? Can I redruce the cost of this query without splitting data into multiple collections? The logic is basically: "Just get the 'name' of all these big documents in the collection".
(No partitioning on the db...)
Microsoft unfortunately doesn't seem to publish their formula for determining RU costs, just broad descriptions. They do say about RU considerations:
As the size of an item increases, the number of RUs consumed to read
or write the item also increases
So it is the case that cost depends on the raw size of the item, not just the portion of it output from a read operation. If you use the Data Explorer to run some queries and inspect the Query Stats, you'll see two metrics, Retrieved Document Size and Output Document Size. By projecting a subset of properties, you reduce the output size, but not the retrieved size. In tests on my data, I see a very small decrease in RU charge by selecting the return properties -- definitely not a savings in proportion to the reduced output.
Fundamentally, getting items smaller is probably the most important thing to work towards, both in terms of the property data size and the number of properties. You definitely don't want 500 KB items if you can avoid it.
I am trying to ingest a load of 13k json documents into azure search engine, but the index stops at around 6k documents without any error for the indexer and the index storage size is 7.96MB and it doesn't surpass this limit no matter what.
I have tried using smaller batches of 3k/indexer and after that 1k/indexer, but I got the same result.
In my json I have around 10 simple fields, and 20 complex fields (which have other nested complex fields, but up to level 5).
Do you have any idea if there is a limit per size for an index? And where I can set it up?
As SLA, I think we are using S1 plan (based on what limits we have - 50 indexers, and so on)
Thanks
Really hard to help without seeing it, but I remember I faced a problem like this in the past. In my case, it was a problem of duplicating with the key field.
I also recommend you smaller batches (~500 documents)
PS: Take a look if your nested jsons are not too big (in case it's marked as retrievable).
As there is a size limit for cosmsos db for single entry of data, how can I add a data of size more than 2 mb as a single entry?
The 2MB limit is a hard-limit, not expandable. You'll need to work out a different model for your storage. Also, depending on how your data is encoded, it's likely that the actual limit will be under 2MB (since data is often expanded when encoded).
If you have content within an array (the typical reason why a document would grow so large), consider refactoring this part of your data model (perhaps store references to other documents, within the array, vs the subdocuments themselves). Also, with arrays, you have to deal with an "unbounded growth" situation: even with documents under 2MB, if the array can keep growing, then eventually you'll run into a size limit issue.
I'm using ArangoDB for a Web Application through Strongloop.
I've got some performance problem when I run this query:
FOR result IN Collection SORT result.field ASC RETURN result
I added some index to speed up the query like skiplist index on the field sorted.
My Collection has inside more than 1M of records.
The application is hosted on n1-highmem-2 on Google Cloud.
Below some specs:
2 CPUs - Xeon E5 2.3Ghz
13 GB of RAM
10GB SSD
Unluckly, my query spend a lot of time to ending.
What can I do?
Best regards,
Carmelo
Summarizing the discussion above:
If there is a skiplist index present on the field attribute, it could be used for the sort. However, if its created sparse it can't. This can be revalidated by running
db.Collection.getIndexes();
in the ArangoShell. If the index is present and non-sparse, then the query should use the index for sorting and no additional sorting will be required - which can be revalidated using Explain.
However, the query will still build a huge result in memory which will take time and consume RAM.
If a large result set is desired, LIMIT can be used to retrieve slices of the results in several chunks, which will cause less stress on the machine.
For example, first iteration:
FOR result IN Collection SORT result.field LIMIT 10000 RETURN result
Then process these first 10,000 documents offline, and note the result value of the last processed document.
Now run the query again, but now with an additional FILTER:
FOR result IN Collection
FILTER result.field > #lastValue LIMIT 10000 RETURN result
until there are no more documents. That should work fine if result.field is unique.
If result.field is not unique and there are no other unique keys in the collection covered by a skiplist, then the described method will be at least an approximation.
Note also that when splitting the query into chunks this won't provide snapshot isolation, but depending on the use case it may be good enough already.
I have created a search project that based on lucene 4.5.1
There are about 1 million documents and each of them is about few kb, and I index them with fields: docname(stored), lastmodified,content. The overall size of index folder is about 1.7GB
I used one document (the original one) as a sample, and query the content of that document against index. the problems now is each query result is coming up slow. After some tests, I found that my queries are too large although I removed stopwords, but I have no idea how to reduce query string size. plus, the smaller size the query string is, the less accurate the result comes.
This is not limited to specific file, because I also tested with other original files, the performance of search is relatively slow (often 1-8 seconds)
Also, I have tried to copy entire index directory to RAMDirectory while search, that didn't help.
In addition, I have one index searcher only across multiple threads, but in testing, I only used one thread as benchmark, the expected response time should be a few ms
So, how can improve search performance in this case?
Hint: I'm searching top 1000
If the number of fields is large a nice solution is to not store them then serialize the whole object to a binary field.
The plus is, when projecting the object back out after query, it's a single field rather than many. getField(name) iterates over the entire set so O(n/2) then getting the values and setting fields. Just one field and deserialize.
Second might be worth at something like a MoreLikeThis query. See https://stackoverflow.com/a/7657757/277700