Azure Search- replicating result of nested SQL query - azure

I have a database comprising of the following schema depicting the linkage between individuals who connect with multiple Advisors and these Advisors have affiliations with multiple organizations
Individuals--> Advisors (m:n relationship)
Advisors --> Enterprises (m:n relationship)
The business need is to enable search on all these concepts and organize results around AdvisorIds. As an example, display of a search result could be as follows
a) Advisor1-> connected to Individuals A,B,C; and linked to Enterprises X,Y
b) Advisor2-> connected to Individuals A, E; and linked to Enterprises M,X,Z
Towards this, we created a flattened table on these concepts and the relationship between them. Hence the same AdvisorId would appear in multiple rows
When I search for a string, I want to ensure that ALL records around an AdvisorId to be returned together irrespective of search score of the individual records.
One approach could be
a) first run an Azure Search and get a result of AdvisorId, ordered by search score of each record. This will repeat Advisor Ids
b) take a distinct set of AdvisorIds (across pages) via standard SQL
c) for each AdvisorId, pick all the related records via standard SQL
2 questions
Here a lot of processing in (b) and (c) will be done outside Azure leading to delays. Also, if I were to use pagination for (a), I am never sure of number of AdvisorId's, I end up with after the distinct operation
I wanted to check if there is a way to get the nested search implemented in Azure to do (a), (b) and (c) as a single API call
If I were to use facets for handling (a) and (b) together, how do I ensure that the ordering is based on the best search-score document within a facet

There isn't a way to achieve what you want in a single request unless you model your data differently. Instead of denormalizing the Individuals-Advisors and Advisors-Enterprises relationships, it may be possible instead to have one document per Advisor and use collections to store information about the related Individuals and Enterprises. This may or may not work for you depending on whether you need to supported correlated filtering on Individuals and Enterprises that are related to an Advisor. There is a whitepaper here that should help you evaluate whether this approach would work for you.
Another option might be to model Individuals, Advisors, and Enterprises as separate indexes, issue three queries, and do a client-side join. However, this is limited by the number of Advisor IDs you'd need to send in the queries on Individuals and Enterprises. Azure Search has limits on the size of filters that can make this impractical unless your queries have low recall.
We are working on making Azure Search better for scenarios like yours. For example, we're currently working on adding support for complex types. Please vote on User Voice and feel free to suggest other features that would help.

Related

How to use Azure Search Service with heterogenous data sources

I have worked on Azure Search service previously where I created an indexer directly on a SQL DB in the Azure Portal.
Now I have a use-case where I would want to ingest from multiple data sources each having different data schema. Assume these data sources to be 3 search APIs of X,Y,Z teams. All of them take search term and gives back results in their own schema. I want my Azure Search Service to be proxy for these so that I have one search API that a user can use to get results from multiple sources, ordered correctly.
How should I go about doing it? I assume that I might have to create a common schema and whenever user searches something, I would call these 3 APIs and get results, map them to a common schema and then index this data in common schema into Azure Search index. Finally, call this Azure Search API to give back the results to the caller.
I would appreciate any help! If I can get hold of a better documentation for doing this work, that will be great as well.
Your assumption is correct. You can work with 3 different indexes and fire queries against them, or you can try to combine all of them in the same index. The benefit of the second approach is a better way to implement ordering / paging as all the information will be stored in the same index.
It really depends on what you mean by ordered correctly. Should team X be able to see results from teams Y and Z? The only way you can get ranked results like this is to maintain a single index with a common schema containing data from all teams.
One potential pitfall with this approach is conflicts in the schema. For example if one team requires a field to be of a specific datatype or use a specific analyzer, while another team has different requirements. We do this in our indexes, but with some carefully selected common fields and then dedicated fields prefixed according to our own naming convention to avoid conflicts.
One thing to consider is the need to reset the index. If you need to add, change or remove fields you will have to delete the index and create it again with a new schema. If you have a common index and team X needs to add a new property, you would need to reset (delete and create) the common index which affects all teams.
So, creating separate indexes per team has its benefits. Each team can have their own schema without risk of conflicts and they can reset their index without affecting the other teams.

Look ahead search on document fields in azure DocumentDb

We are interested in using DocumentDb as a data store for a number of data sources and as such we are running a quick POC to establish whether it meets the criteria we are looking for.
One of the areas we are keen to provide is look ahead search capabilities for certain fields. These are traditionally provided using the SQL LIKE syntax which does not appear to be supported at present.
Searching online I have seen people talking about integrating Azure search but this appears to be a very costly mechanism for such a simple use case.
I have also seen people mention the use of UDF's but this appears to require an entire collection scan which is not practical from a performance perspective.
Does anyone have any alternative suggestions? One thing I considered was simply using a SQL table and initiating an update each time a document was inserted\updated\deleted?
DocumentDB supports STARTSWITH and range indexes to support prefix/look ahead searching.
You can progressively make queries like the following based on what your user types in a text box:
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "H")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hi")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hil")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hilton")
Note that you must configure the collection, or the path/property you're using for these queries with a range index. You can extend this approach to handle additional cases as well:
To query in a case-insensitive manner, you must store the lower case form of the search property, and use that for querying.
I faced a similar situation, where a fast lookup was required, as a user typed search terms.
My scenario was that potentially thousands of simultaneous users would be performing such lookups; when testing this under load, to avoid saturation and throttling, we found we would have to increase the DocumentDB Request Unit (RU) throughput amount to a point that was not financially viable for us, in our specific circumstances.
We decided that DocumentDB was best used as the persistent store, and 'full' data retrieval - and this role it performs exceptionally well - while a small ElasticSearch cluster performed the role it was designed for - text search, faceted search, weighted search, stemming, and most relevant to your question, autocomplete analyzersand completion suggesters.
The subject of type ahead queries, creation of indexes, autocomplete analyzer and query time 'search as you type' in ElasticSearch can be found here, here and here
The fact that you plan to have several data sources would also potentially make the ElasticSearch cluster approach more attractive, to aggregate search data.
I used the Bitnami template available in the Azure market place to create relatively small instances, and most importantly, this allowed me to place the cluster on the same Virtual Network as my other components, which greatly increased performance.
Cost was lower than Azure Search (which uses ElasticSearch under the hood).

How to perform intersection operation on two datasets in Key-Value store?

Let's say I have 2 datasets, one for rules, and the other for values.
I need to filter the values based on rules.
I am using a Key-Value store (couchbase, cassandra etc.). I can use multi-get to retrieve all the values from one table, and all rules for the other one, and perform validation in a loop.
However I find this is very inefficient. I move massive volume of data (values) over the network, and the client busy working on filtering.
What is the common pattern for finding the intersection between two tables with Key-Value store?
The idea behind the nosql data model is to write data in a denormalized way so that a table can answer to a precise query. To make an example imagine you have reviews made by customers on shops. You need to know the reviews made by a user on shops and also reviews received by a shop. This would be modeled using two tables
ShopReviews
UserReviews
In the first table you query by shop id in the second by user id but data are written twice and accessed directly using just a key access.
In the same way you should organize values by rules (can't be more precise without knowing what's the relation between them) and so on. One more consideration: newer versions of nosql db supports collections which might help to model 1 to many relations.
HTH, Carlo

Using Lucene to index private data, should I have a separate index for each user or a single index

I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)
My questions revolve around performance and security.
Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?
Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.
Any insight would be greatly appreciated.
Regards,
Nate
Of course, one index.
You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.
And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.
Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.
However there are several reasons why you would want to split indexes:
if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.

Azure Table Storage Design for Web Application

I am evaluating the use of Azure Table Storage for an application I am building, and I would like to get some advice on...
whether or not this is a good idea for the application, or
if I should stick with SQL, and
if I do go with ATS, what would be a good approach to the design of the storage.
The application is a task-management web application, targeted to individual users. It is really a very simple application. It has the following entities...
Account (each user has an account.)
Task (users create tasks, obviously.)
TaskList (users can organize their tasks into lists.)
Folder (users can organize their lists into folders.)
Tag (users can assign tags to tasks.)
There are a few features / requirements that we will also be building which I need to account for...
We eventually will provide features for different accounts to share lists with each other.
Users need to be able to filter their tasks in a variety of ways. For example...
Tasks for a specific list
Tasks for a specific list which are tagged with "A" and "B"
Tasks that are due tomorrow.
Tasks that are tagged "A" across all lists.
Tasks that I have shared.
Tasks that contain "hello" in the note for the task.
Etc.
Our application is AJAX-heavy with updates occurring for very small changes to a task. So, there is a lot of small requests and updates going on. For example...
Inline editing
Click to complete
Change due date
Etc...
Because of the heavy CRUD work, and the fact that we really have a list of simple entities, it would be feasible to go with ATS. But, I am concerned about the transaction cost for updates, and also whether or not the querying / filtering I described could be supported effectively.
We imagine numbers starting small (~hundreds of accounts, ~hundreds or thousands of tasks per account), but we obviously hope to grow our accounts.
If we do go with ATS, would it be better to have...
One table per entity (Accounts, Tasks, TaskLists, etc.)
Sets of tables per customer (JohnDoe_Tasks, JohnDoe_TaskLists, etc.)
Other thoughts?
I know this is a long post, but if anyone has any thoughts or ideas on the direction, I would greatly appreciate it!
Azure Table Storage is well suited to a task application. As long as you setup your partition keys and row keys well, you can expect fast and consistent performance with a huge number of simultaneous users.
For task sharing, ATS provides optimistic concurrency to support multiple users accessing the same data in parallel. You can use optimistic concurrency to warn users when more than one account is editing the same data at the same time, and prevent them from accidentally overwriting each-other's changes.
As to the costs, you can estimate your transaction costs based on the number of accounts, and how active you expect those accounts to be. So, if you expect 300 accounts, and each account makes 100 edits a day, you'll have 30K transactions a day, which (at $.01 per 10K transactions) will cost about $.03 a day, or a little less than $1 a month. Even if this estimate is off by 10X, the transaction cost per month is still less than a hamburger at a decent restaurant.
For the design, the main aspect to think about is how to key your tables. Before designing your application for ATS, I'd recommend reading the ATS white paper, particularly the section on partitioning. One reasonable design for the application would be to use one table per entity type (Accounts, Tasks, etc), then partition by the account name, and use some unique feature of the tasks for the row key. For both key types, be sure to consider the implications on future queries. For example, by grouping entities that are likely to be updated together into the same partition, you can use Entity Group Transactions to update up to 100 entities in a single transaction -- this not only increases speed, but saves on transaction costs as well. For another implication of your keys, if users will tend to be looking at a single folder at a time, you could use the row key to store the folder (e.g. rowkey="folder;unique task id"), and have very efficient queries on a folder at a time.
Overall, ATS will support your task application well, and allow it to scale to a huge number of users. I think the main question is, do you need cloud magnitude of scaling? If you do, ATS is a great solution; if you don't, you may find that adjusting to a new paradigm costs more time in design and implementation than the benefits you receive.
What your are asking is a rather big question, so forgive me if I don't give you an exact answer.. The short answer would be: Sure, go ahead with ATS :)
Your biggest concern in this scenario would be about speed. As you've pointed out, you are expecting a lot of CRUD operations. Out of the box, ATS doesn't support tranactions, but you can architect yourself out of such a challenge by using the CQRS structure.
The big difference from using a SQL to ATS is your lack of relations and general query possibilities, since ATS is a "NoSQL" approach. This means you have to structure your tables in a way that supports your query operations, which is not a simple task..
If you are aware of this, I don't see any trouble doing what your'e describing.
Would love to see the end result!

Resources