Azure Search on searching articles with images - azure

I have 500+ articles with Images, I want to retrieve those articles with images and show the same in in chatbot using Microsoft bot Framework and Azure Search. But, Azure search isn't able to index images. In this scenario where do I need to store these images, how do I map these images to appropriate article?

You can either populate the documents in the index via code or use an indexer that can create your documents. Here is the Indexer data source drop down showing the different data sources available.
You could put the information about the image and a path to the image in any of these data sources and have an indexer pick it up and create your documents within the index.
If you don't want to use a formal database, you could also upload your images into blob storage and decorate each blob with custom metadata. When creating your index with an indexer, it will find the custom metadata and it can become fields on documents within your index.
There are lots of options, but my advice is to keep your documents within your index as small as possible to control your costs. That's usually done by having as few a fields as possible and have fields that reference where the stuff is located. The documents within an index are for discovering where things are located and NOT for storing the actual data. When you index start occupying lots of space, your cost will go up a LOT.

Related

Monitor indexes with Azure Cognitive Search

We use Azure Cognitive Search for the search functionality for one of our applications. We would like to monitor to ensure that the indexes are being updated.
The problem is that we don't use indexers to update the indexes. We have custom APIs that update the indexes which are called from the application that uses the Azure Cognitive Search.
Is there a way to monitor the indexes so that we know they are being updated? For example keep track of the Document Count for the indexes? If the index Document Counts are fluctuating then clearly the indexes are being updated.
I'd like to add this metric to an Azure dashboard so would prefer a solution that uses the in-built functionality if possible.
Or any other suggestions are welcome too.
This page gives a great description of the metrics that are available to monitor. I see "DocumentsProcessedCount" on that list which seems to be what you are looking for, and the documentation notes that it works for both indexers and pushing documents directly (the latter is your scenario). Also check out https://learn.microsoft.com/azure/search/monitor-azure-cognitive-search for more information on monitoring. Hope this helps!
I would say that monitoring is easier when you don't use one of the built-in indexers. Since you use custom APIs to push content, it's up to you to define how to determine if the content is updated.
Do not rely on document counts. If you add and remove one item, the count stays the same.
Instead, add a DateTime property called LastProcessed to the items in your index. Make sure you populate this property at the time you submit items to one of your indexes. You can then verify that the index is updated by querying and index, sorting by LastProcessed. The first item in the response will be the latest item you have in that index.
That item can be one week old or a few seconds old. You have to implement the logic that determines if this means that the index is updated.

Can Azure Cognitive Search Indexer set field values?

I have an Azure Cognitive Search index which indexes data from multiple data sources. Each data source is indexed with a near identical indexer. Each indexer calls the same skillset configuration.
Within the index definition I have a field labeled "datasource" which is intended to identify the data source for a particular document. I would like to have the indexer or use a modular skill, such as a conditional skill, to set the value of this field based on the data source. I understand it is possible to use a conditional skill to the value of a field if a value is not found, but I want to avoid having to create a new skillset for every indexer. My data sources are documents of multiple types in blob containers.
Using only the indexer definition is is possible to assign the value of a field to a string manually in the definition, by somehow extracting the name of the data source, or using a modular skill in the skillset definition?
An avenue I have been pursuing is setting user-specified blob metadata at the container level. However, I have not been able to successfully retrieve this information with either the indexer or skillset. I do not want to set this user-specified blob metadata on every single blob in a container.
Unfortunately it is not possible to configure a blob data source in a way that will pass unique information to the skillset. Having a separate skillset per datasource may be the cleanest option. Alternatively, you could pass metadata_storage_path to a custom skill and parse the container path to return a value by convention or mapping.

Azure Search stored=false

Solr (among others) permits fields to be indexed, but not stored. Unless I’ve missed something in the documentation, Azure Search doesn’t appear to support this option.
It does have an attribute called retrievable, but it states
Currently, selecting this attribute does not cause a measurable increase in index storage requirements.
This suggests to me that Azure Search is storing everything anyway, and perhaps enabling toggling of this behaviour in-place?
My question is, how can I define a field in an equivalent way to stored=false in Azure Search?
As MatsLindh said, in Azure Search, an index is a persistent store of documents and other constructs used for filtered and full text search on an Azure Search service. So, you could not define a field to store=false.
According to that you have a large index, one of the simplest mechanisms is to submit multiple documents or records in a single request.
Note: To keep document size down, remember to exclude non-queryable data from the request. Images and other binary data are not directly searchable and shouldn't be stored in the index. To integrate non-queryable data into search results, you should define a non-searchable field that stores a URL reference to the resource.
For more details about indexing large data sets in Azure Search, you could refer to this article.

Merge blob in Azure Search

Is it possible to merge multiple blob into a single Azure Search record?
Complete Scenario: We have list of companies stored as json in cosmosDB and its related documents(.docx/pdf) in blob storage. A company can have multiple documents with varying size up to 20 MB and there is no upper limit of number of documents. How can we merge content of all documents and push into 'content' field of Azure Search Index, so that we could perform full-text search in companies data coming from cosmos and blob.
I've looked into https://www.lytzen.name/2017/01/30/combine-documents-with-other-data-in.html - Scenario discuss in the tutorial has one-to-one relationship between candidate data and CV. In our case there is one-to-many relationship between company and its documents.
Any help / direction would be appreciated.
Thanks
Azure Search Blob Indexer maps each blob to a document in the search index 1:1. At the moment, there isn't a way to merge the content of multiple blobs into a single document automatically. However, you can always write a client application that does this and pushes the aggregated content to the Azure Search index using our SDK or REST API..
I'm curious to learn more about scenario. With a single document in the index per company, you won't be able to search for individual documents from blob storage. Is that what want?
It is possible to merge data from different datasources into a single document in a search index, as long as you're trying to "assemble" a document from multiple fields and not merging into a single field.
Note that:
All the datasources agree on what the document key is. By default, the key is blob path. Since path are unique across blobs, the need to agree on keys means that you need to set a metadata property on your "secondary" blobs that correlates them with the "primary" blob.
You can't use indexers to merge multiple source documents into a single index field such as content. Likely, this is not what you need anyway for JSON metadata stored in Cosmos DB, since you probably want to capture that metadata into its own set of fields. For merging into the content field, you would need to write your own merging logic as noted in the previous response.
It seems that the fundamental primitive that would make your scenario "just work" is collection merge - you would model content not as a string, but as a collection of strings, where each element is extracted from one of your blobs. Please feel free to add a suggestion for collection merge functionality to our UserVoice.
One solution that I found is to compress the documents into ZIP and pass ZIP file to Azure Search indexer. Only problem with this solution is that I have to add another processing step for ZIP creation and additional storage cost for keeping ZIP

SphinxSearch or a spider - which one to choose?

We own SiteA and SiteB and they share the same server and database where we have full control.
SiteC , siteD and siteE are some of the sites we own as well but reside on a different web hosts.
The goal is to create a unified search functionality for all of the sites mentioned above. That is if somebody search for a term in SiteA, the search result will automatically come up with results from SiteB,SiteC,SiteD and Site E too. The search results should be shown under the website they were found in.
All these websites content are stored in their own databases.
If I use SphinxSearch to index the above sites,I would then require those sites that we dont have complete control with to setup a web service where i can download a database dump or csv file for indexing.
Im not quite sure about how a sphider will come into play here so need your opinion.
Sphinx or a spider?
THanks!
If you can ask the owner of other websites to give you content for free, then there is no need for a spider. Just use sphinxsearch to index the content.
If you can't get content directly from them, a spider is the only choice for you. There is little to think about this issue.
Sphinx is a full-text search engine solution, while a spider is for fetching contents from internet. They are not replacements to each other. Even if you use a spider, you still have to use some full-text search engine software for example sphinx or lucene/solr.
So you have to make a decision first: Do I want to use sphinx for searching? If the answer is yes, then there is only one thing left: how can I index the contents for searching?
sphinx supports using database or XML as data source. Database as data source is more popular because preparing and updating XML documents in a specific format is very tedious(compared to maintaining a database table). So I guess finally you have to store all of the data into database. As you described, all of the data are all ready in databases, but some of the databases are out of your control. For you own database, there is no problem. For the databases that out of your control, I suggest that you use distributed sphinx searching: http://sphinxsearch.com/docs/2.0.6/distributed.html
The key idea is to horizontally partition (HP) searched data accross
search nodes and then process it in parallel.
Partitioning is done manually. You should
setup several instances of Sphinx programs (indexer and searchd) on
different servers;
make the instances index (and search) different parts of data;
configure a special distributed index on some of the searchd
instances;
and query this index.
This index only contains references to other local and remote indexes
- so it could not be directly reindexed, and you should reindex those indexes which it references instead.

Resources