I am trying to ingest a load of 13k json documents into azure search engine, but the index stops at around 6k documents without any error for the indexer and the index storage size is 7.96MB and it doesn't surpass this limit no matter what.
I have tried using smaller batches of 3k/indexer and after that 1k/indexer, but I got the same result.
In my json I have around 10 simple fields, and 20 complex fields (which have other nested complex fields, but up to level 5).
Do you have any idea if there is a limit per size for an index? And where I can set it up?
As SLA, I think we are using S1 plan (based on what limits we have - 50 indexers, and so on)
Thanks
Really hard to help without seeing it, but I remember I faced a problem like this in the past. In my case, it was a problem of duplicating with the key field.
I also recommend you smaller batches (~500 documents)
PS: Take a look if your nested jsons are not too big (in case it's marked as retrievable).
Related
Hi I've got a simple collection with 40k records in. It's just an import of a csv (c.4Mb) so it has a consistent object per document and is for an Open Data portal.
I need to be able to offer a full download of the data as well as the capabilities of AQL for querying, grouping, aggregating etc.
If I set batchSize to the full dataset then it takes around 50 seconds to return and is unsurprisingly about 12Mb due to the column names.
eg
{"query":"for x in dataset return x","batchSize":50000}
I've tried things caching and balancing between a larger batchSize and using the cursor to build the whole dataset but I can't get the response time down very much.
Today I came across the attributes and values functions and created this AQL statement.
{"query":"return union(
for x in dataset limit 1 return attributes(x,true),
for x in dataset return values(x,true))","batchSize":50000}
It will mean I have to unparse the object but I use PapaParse so that should be no issue (not proved yet).
Is this the best / only way to have an option to output the full csv and still have a response that performs well?
I am trying to avoid having to store the data multiple times, eg once raw csv then data in a collection. I guess there may be a dataset that is too big to cope with this approach but this is one of our bigger datasets.
Thanks
I use Azure Table storage as a time series database. The database is constantly extended with more rows, (approximately 20 rows per second for each partition). Every day I create new partitions for the day's data so that all partition have a similar size and never get too big.
Until now everything worked flawlessly, when I wanted to retrieve data from a specific partition it would never take more than 2.5 secs for 1000 values and on average it would take 1 sec.
When I tried to query all the data of a partition though things got really really slow, towards the middle of the procedure each query would take 30-40 sec for 1000 values.
So I cancelled the procedure just to re start it for a smaller range. But now all queries take too long. From the beginning all queries need 15-30 secs. Can that mean that data got rearranged in a non efficient way and that's why I am seeing this dramatic decrease in performance? If yes is there a way to handle such a rearrangement?
I would definitely recommend you to go over the links Jason pointed above. You have not given too much detail about how you generate your partition keys but from sounds of it you are falling into several anti patterns. Including by applying Append (or Prepend) and too many entities in a single partition. I would recommend you to reduce your partition size and also put either a hash or a random prefix to your partition keys so they are not in lexicographical order.
Azure storage follows a range partitioning scheme in the background, so even if the partition keys you picked up are unique, if they are sequential they will fall into the same range and potentially be served by a single partition server, which would hamper the ability of azure storage service overall to load balance and scale out your storage requests.
The other aspect you should think is how you are reading the entities back, the best recommendation is point query with partition key and row key, worst is a full table scan with no PK and RK, there in the middle you have partition scan which in your case will also be pretty bad performance due to your partition size.
One of the challenges with time series data is that you can end up writing all your data to a single partition which prevents Table Storage from allocating additional resources to help you scale. Similarly for read operations you are constrained by potentially having all your data in a single partition which means you are limited to 2000 entities / second - whereas if you spread your data across multiple partitions you can parallelize the query and yield far greater scale.
Do you have Storage Analytics enabled? I would be interested to know if you are getting throttled at all or what other potential issues might be going on. Take a look at the Storage Monitoring, Diagnosing and Troubleshooting guide for more information.
If you still can't find the information you want please email AzTableFeedback#microsoft.com and we would be happy to follow up with you.
The Azure Storage Table Design Guide talks about general scalability guidance as well as patterns / anti-patterns (see the append only anti-pattern for a good overview) which is worth looking at.
Each index batch is limited from 1 to 1000 documents. When I call it from my local machine or azure VM, I got 800ms to 3000ms per 1000 doc batch. If I submit multiple batches with async, the time spent is roughly the same. That means it would take 15 - 20 hours for my ~50M document collection.
Is there a way I can make it faster?
It looks like you are using our Standard S1 search service and although there are a lot of things that can impact how fast data can be ingested. I would expect to see ingestion to a single partition search service at a rate of about 700 docs / second for an average index, so I think your numbers are not far off from what I would expect, although please note that these are purely rough estimates and you may see different results based on any number of factors (such as number of fields, quantity of facets, etc)..
It is possible that some of the extra time you are seeing is due to the latency of uploading the content from your local machine to Azure, and it would likely be faster if you did this directly from Azure but if this is just a one time-upload that probably is not worth the effort.
You can slightly increase the speed of data ingestion by increasing the number of partitions you have and the S2 Search Service will also ingest data faster. Although both of these come at a cost.
By the way, if you have 50 M documents, please make sure that you allocate enough partitions since a single S1 partition can handle 15M documents or 25GB so you will definitely need extra partitions for this service.
Also as another side note, when you are uploading your content (and especially if you choose to do parallelized uploads), keep an eye on the HTTP responses because if the search service exceeds the resources available you could get HTTP 207 (indicating one or more item failed to apply) or 503's indicating the whole batch failed due to throttling. If throttling occurs, you would want to back off a bit to let the service catch up.
I think you're reaching the request capacity:
https://azure.microsoft.com/en-us/documentation/articles/search-limits-quotas-capacity/
I would try another tier (s1, s2). If you still face the same problem, try get in touch with support team.
Another option:
Instead of pushing data, try to add your data to the blob storage, documentDb or Sql Database, and then, use the pull approach:
https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/
I have created a search project that based on lucene 4.5.1
There are about 1 million documents and each of them is about few kb, and I index them with fields: docname(stored), lastmodified,content. The overall size of index folder is about 1.7GB
I used one document (the original one) as a sample, and query the content of that document against index. the problems now is each query result is coming up slow. After some tests, I found that my queries are too large although I removed stopwords, but I have no idea how to reduce query string size. plus, the smaller size the query string is, the less accurate the result comes.
This is not limited to specific file, because I also tested with other original files, the performance of search is relatively slow (often 1-8 seconds)
Also, I have tried to copy entire index directory to RAMDirectory while search, that didn't help.
In addition, I have one index searcher only across multiple threads, but in testing, I only used one thread as benchmark, the expected response time should be a few ms
So, how can improve search performance in this case?
Hint: I'm searching top 1000
If the number of fields is large a nice solution is to not store them then serialize the whole object to a binary field.
The plus is, when projecting the object back out after query, it's a single field rather than many. getField(name) iterates over the entire set so O(n/2) then getting the values and setting fields. Just one field and deserialize.
Second might be worth at something like a MoreLikeThis query. See https://stackoverflow.com/a/7657757/277700
I have data stored in Table Storage. When I try to retrieve the data I do this using the partition key and row key. I have been doing some timings to retrieve data of around 8000 bytes.
I'm getting times ranging from 500-700ms and YES my host and storage are in the same data center.
Is Table Storage really so slow or am I doing something very wrong. I was expecting access times to be more like 50ms. Bear in mind that all of my tables added together probably only hold 200 rows.
Your performance numbers certainly sound very poor - and much worse than I've seen.
There are some useful reference numbers - and some good advice - on the storage team blog - see http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx
For your specific problem, I suggest writing some very simple test code to measure your numbers again - if you are still seeing the same problems, then post the code here and - if your code really is trivial - then contact MS support.
Are you trying to retrieve multiple entities at once? If so, there is a known bug in the query parser of the Table Storage, and indexes does not get used when multiple entities are queried directly from their RowKey, instead the request generate a linear scan of the table which can indeed take 500 to 700ms for each roundtrip.