azure search. What if I have a lot of facets - azure

in a commercial application it is not uncommun to have hundreds facets. Of course not all products are flaged with all of them.
But when searching I need to add a facet querystring parameter that list all the facets that I want to get back. As I don't know by advance the list of relevant one, I have to pass all of them in the query.
This is not practical we more than a few facets.
Is there a way to solve this issue or is it a limitation of the product?
The Azure Search doc:
https://msdn.microsoft.com/fr-fr/library/azure/dn798927.aspx

You are correct that this is a current limitation of Azure Search in that you need to pass all the facets in the query string. Please know that we are aware of this and in fact it can be an even bigger issue for customers where they have so many parameters or facets in their query string that it exceeds the max size of the url. For this reason, we are investigating what can be done about this to accommodate this.
I apologize that I do not yet have a date for when this is to be available other than to say it is on our short term roadmap.
Liam

It looks like Azure Search now supports both a GET and POST method, and recommends using POST when the length of the URL would exceed the max limit of 2048 characters (1024 for just the querystring).
https://learn.microsoft.com/en-us/rest/api/searchservice/search-documents

Related

Why can't ContinuationToken be used for paging in Azure Search API?

Reading the documentation for the Azure Search .NET SDK, I see that the ContinuationToken property is not supposed to used for pagination (this is the same as the #odata.nextLink and #search.nextPageParameter properties in the REST API).
Note that this property is not meant to help you implement paging of search results. You can implement paging using the Top and Skip search parameters.
Source
Why can't I use it for pagination? I have a situation where I want to run a query and then step through a static copy of the results page by page. I don't want those query results to change beneath my feet, however, as I am navigating through them, as new documents are added to the underlying database. In my case, there could be hundreds or thousands of results that get added in the minute or two between submitting the initial query and navigating to another page. How could I accomplish this?
Your question can be addressed in two parts:
Why is it not recommended to use ContinuationToken to implement pagination?
How can pagination be implemented such that results remain completely stable from page to page?
These are actually unrelated questions, since nothing about ContinuationToken guarantees the stability of the search results. Azure Search makes no consistency guarantees around paging, whether you use $top and $skip or ContinuationToken.
For question #1, the reason ContinuationToken is not recommended for paging is that Azure Search controls when the token is returned, not your application code. If you make assumptions about how and when Azure Search decides to return you a token, there's a chance those assumptions may break with a future service update. The intent of ContinuationToken is to prevent requests for too many documents from overwhelming the service, so you should assume that it is entirely at the service's discretion whether it will return a token.
For question #2, since Azure Search doesn't provide consistency guarantees, you can't completely avoid issues like the same document showing up in multiple pages, missing documents, or documents that are deleted by the time they are seen in results. Even if you wanted to build your own snapshot of the results and page over them in your application code, building a consistent snapshot isn't possible in the first place. However, if your only concern is to avoid showing new documents in the results, you can include a created timestamp field in your index and filter on that in every search request.
Frankly, unless you're trying to export the entire contents of your index, I would question the need for such strong consistency guarantees around paging. Google and Bing make no such guarantees, so arguably user expectations are already set around this. If you are trying to export your data, this is unfortunately not easy with Azure Search today. In that case, please vote on this User Voice item to help the team prioritize this scenario.

How to support pagination for external change log searching to OpenDJ LDAP?

I want to search change log under "cn=changelog". I can search the result normally if the result entries were not a lot. But if there are a lot of entries in the result, the memory will be not enough. So, I want to page the result. How can I define the size limit?
I also refered to https://bugster.forgerock.org/jira/si/jira.issueviews:issue-html/OPENDJ-1218/OPENDJ-1218.html. While, I wonder how to define a filter to support "changeNumber". And in my result, there is not this attribut "changeNumber". Why?
Please help me how shoud I do?
BTW, I am using OpenDJ 3.0.
Size limit is an option of the client call. You can always specify the maximum amount of entries you want to be returned (the server has it's own limit and will enforce the smallest between the 2).
How to define the size limit depends on what you are using as client, and you did not mention it.
Can you provide details on what you are using to search (tool, library...) and what is the filter and options you are currently using ? It's difficult to provide help and suggestions to improvement when there is no detail.

Date function and Selecting top N queries in DocumentDB

I have following questions regarding Azure DocumentDB
According to this article, multiple functions have been added to
DocumentDB. Is there any way to get Date functions working? How can i
get the queries of type greater than some date working?
Is there any way to select top N results like 'Select top 10 * from users'?
According to Document playground , Order By will be supported in future. Is ther any other way around for now?
The application that I am developing requires certain number of results to be displayed that have been inserted recently. I need these functionalities within a stored procedure. The documents that I am storing in DocumentDB have a DateTime property. I require the above mentioned functionalities for my application to work. I have searched at documentation and samples. Please help if you know of any workaround.
Some thoughts/suggestions below:
Please take a look at this idea on how to store and query dates in DocumentDB (as epoch timestamps). http://azure.microsoft.com/blog/2014/11/19/working-with-dates-in-azure-documentdb-4/
To get top N results, set FeedOptions.MaxItemCount and read only one page, i.e., call ExecuteNextAsync() once. See https://msdn.microsoft.com/en-US/library/microsoft.azure.documents.linq.documentqueryable.asdocumentquery.aspx for an example. We're planning to add TOP to the grammar to make this easier in the future.
You can email me at arramac at microsoft dot com to get early access to Order By right away. This is planned for broad release shortly.
Please note that stored procedures are best used when you have a write operation(s). You'll be able to better throughput on reads when you query directly.

Is there any way to skip rows when I retrieve from Azure table storage?

I believe in the past the answer to this question was no. However has anything changed with the recent releases or does anyone know of a way that I can do this. I am using datatables and would love to be able to do something like skip 50 retrieve 50 rows. skip 100 retrieve 50 rows etc.
It is still not possible to skip rows. The only navigation construct supported is top. The Table Service REST API is the definitive way to access Wndows Azure Storage, so its documentation is the go-to location for what is or is not possible.
What you're asking here is possible using continuation tokens. Scott Densmore blogged about this a while ago to explain how you can use continuation tokens for paging when you're displaying a table (like what you're asking here with DataTables): Paging with Windows Azure Table Storage. The blog post shows how to show pages of 3 items while using continuation tokens to move forward and back between pages:
Besides that there's also Steve's post that describes the same concept: Paging Over Data in Windows Azure Tables
Yes (kinda) and no. No, in the sense that the Skip operation is not directly supported at the REST head. You could of course do it in memory, but that would defeat the purpose.
However, you can of course actually do this pattern if you structure your data correctly. We do something like this ourselves. We align our partition key to the datetime and use the RowKey as a discriminator. This means we can always pinpoint the partition range we are interested in and then Take() some amount of data. So, for example, we can easily Take() the first 20 rows per hour by specifying a unique query (skipping over data we don't want). The partion key is simply aligned per hour and then we optionally discriminate further using the RowKey - finally, we just take data. When executed in parallel, this works just dandy.
Again, the more technically correct answer is NO. However, you can approximate it cleverly using the PK and RK.

Using Lucene to index private data, should I have a separate index for each user or a single index

I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)
My questions revolve around performance and security.
Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?
Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.
Any insight would be greatly appreciated.
Regards,
Nate
Of course, one index.
You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.
And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.
Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.
However there are several reasons why you would want to split indexes:
if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.

Resources