How to combine multiple index in azure cognitive search? - azure

I have 3 different indexes, and I want to combine into one. Is there anyway we can combine multiple indexes into one index, so that I can search that single index.

Yes, but how you do it depends on what data sources you are using and how you set them up.
You can define a data source, indexer and index via the portal. You can have multiple data sources/indexers for a single index, but you need to use the REST API to set this up as explained here. You cannot do this via the user interface in the Azure Portal.
See a previous post from 2016 about multiple indexers per index: Azure search: use a single index on multiple data sources

Related

Join feature from two different index

We need data form two different azure search indexes,since we were not able to find any option to join indexes currently,we are replicating data from different indexes,to another new index,because of which we are facing issues to keep redundant data in sync in multiple indexes and also the cost aspect to maintain data in new index
Is there any other better option other than our current solution for our use case

Monitor indexes with Azure Cognitive Search

We use Azure Cognitive Search for the search functionality for one of our applications. We would like to monitor to ensure that the indexes are being updated.
The problem is that we don't use indexers to update the indexes. We have custom APIs that update the indexes which are called from the application that uses the Azure Cognitive Search.
Is there a way to monitor the indexes so that we know they are being updated? For example keep track of the Document Count for the indexes? If the index Document Counts are fluctuating then clearly the indexes are being updated.
I'd like to add this metric to an Azure dashboard so would prefer a solution that uses the in-built functionality if possible.
Or any other suggestions are welcome too.
This page gives a great description of the metrics that are available to monitor. I see "DocumentsProcessedCount" on that list which seems to be what you are looking for, and the documentation notes that it works for both indexers and pushing documents directly (the latter is your scenario). Also check out https://learn.microsoft.com/azure/search/monitor-azure-cognitive-search for more information on monitoring. Hope this helps!
I would say that monitoring is easier when you don't use one of the built-in indexers. Since you use custom APIs to push content, it's up to you to define how to determine if the content is updated.
Do not rely on document counts. If you add and remove one item, the count stays the same.
Instead, add a DateTime property called LastProcessed to the items in your index. Make sure you populate this property at the time you submit items to one of your indexes. You can then verify that the index is updated by querying and index, sorting by LastProcessed. The first item in the response will be the latest item you have in that index.
That item can be one week old or a few seconds old. You have to implement the logic that determines if this means that the index is updated.

How to use Azure Search Service with heterogenous data sources

I have worked on Azure Search service previously where I created an indexer directly on a SQL DB in the Azure Portal.
Now I have a use-case where I would want to ingest from multiple data sources each having different data schema. Assume these data sources to be 3 search APIs of X,Y,Z teams. All of them take search term and gives back results in their own schema. I want my Azure Search Service to be proxy for these so that I have one search API that a user can use to get results from multiple sources, ordered correctly.
How should I go about doing it? I assume that I might have to create a common schema and whenever user searches something, I would call these 3 APIs and get results, map them to a common schema and then index this data in common schema into Azure Search index. Finally, call this Azure Search API to give back the results to the caller.
I would appreciate any help! If I can get hold of a better documentation for doing this work, that will be great as well.
Your assumption is correct. You can work with 3 different indexes and fire queries against them, or you can try to combine all of them in the same index. The benefit of the second approach is a better way to implement ordering / paging as all the information will be stored in the same index.
It really depends on what you mean by ordered correctly. Should team X be able to see results from teams Y and Z? The only way you can get ranked results like this is to maintain a single index with a common schema containing data from all teams.
One potential pitfall with this approach is conflicts in the schema. For example if one team requires a field to be of a specific datatype or use a specific analyzer, while another team has different requirements. We do this in our indexes, but with some carefully selected common fields and then dedicated fields prefixed according to our own naming convention to avoid conflicts.
One thing to consider is the need to reset the index. If you need to add, change or remove fields you will have to delete the index and create it again with a new schema. If you have a common index and team X needs to add a new property, you would need to reset (delete and create) the common index which affects all teams.
So, creating separate indexes per team has its benefits. Each team can have their own schema without risk of conflicts and they can reset their index without affecting the other teams.

Azure Search stored=false

Solr (among others) permits fields to be indexed, but not stored. Unless I’ve missed something in the documentation, Azure Search doesn’t appear to support this option.
It does have an attribute called retrievable, but it states
Currently, selecting this attribute does not cause a measurable increase in index storage requirements.
This suggests to me that Azure Search is storing everything anyway, and perhaps enabling toggling of this behaviour in-place?
My question is, how can I define a field in an equivalent way to stored=false in Azure Search?
As MatsLindh said, in Azure Search, an index is a persistent store of documents and other constructs used for filtered and full text search on an Azure Search service. So, you could not define a field to store=false.
According to that you have a large index, one of the simplest mechanisms is to submit multiple documents or records in a single request.
Note: To keep document size down, remember to exclude non-queryable data from the request. Images and other binary data are not directly searchable and shouldn't be stored in the index. To integrate non-queryable data into search results, you should define a non-searchable field that stores a URL reference to the resource.
For more details about indexing large data sets in Azure Search, you could refer to this article.

Merge blob in Azure Search

Is it possible to merge multiple blob into a single Azure Search record?
Complete Scenario: We have list of companies stored as json in cosmosDB and its related documents(.docx/pdf) in blob storage. A company can have multiple documents with varying size up to 20 MB and there is no upper limit of number of documents. How can we merge content of all documents and push into 'content' field of Azure Search Index, so that we could perform full-text search in companies data coming from cosmos and blob.
I've looked into https://www.lytzen.name/2017/01/30/combine-documents-with-other-data-in.html - Scenario discuss in the tutorial has one-to-one relationship between candidate data and CV. In our case there is one-to-many relationship between company and its documents.
Any help / direction would be appreciated.
Thanks
Azure Search Blob Indexer maps each blob to a document in the search index 1:1. At the moment, there isn't a way to merge the content of multiple blobs into a single document automatically. However, you can always write a client application that does this and pushes the aggregated content to the Azure Search index using our SDK or REST API..
I'm curious to learn more about scenario. With a single document in the index per company, you won't be able to search for individual documents from blob storage. Is that what want?
It is possible to merge data from different datasources into a single document in a search index, as long as you're trying to "assemble" a document from multiple fields and not merging into a single field.
Note that:
All the datasources agree on what the document key is. By default, the key is blob path. Since path are unique across blobs, the need to agree on keys means that you need to set a metadata property on your "secondary" blobs that correlates them with the "primary" blob.
You can't use indexers to merge multiple source documents into a single index field such as content. Likely, this is not what you need anyway for JSON metadata stored in Cosmos DB, since you probably want to capture that metadata into its own set of fields. For merging into the content field, you would need to write your own merging logic as noted in the previous response.
It seems that the fundamental primitive that would make your scenario "just work" is collection merge - you would model content not as a string, but as a collection of strings, where each element is extracted from one of your blobs. Please feel free to add a suggestion for collection merge functionality to our UserVoice.
One solution that I found is to compress the documents into ZIP and pass ZIP file to Azure Search indexer. Only problem with this solution is that I have to add another processing step for ZIP creation and additional storage cost for keeping ZIP

Resources