I have a need to query JSON data stored in Azure blob storage, for operations of filtering (on data types text, data and int), paging (i.e. a functionality similar to skip and take).
The problem my JSON structure is that there is no specific format of JSON data (key/value pair) and is dynamic . Hence the key/value pair of one JSOn result data can differ from another JSOn result data.
Can Azure search help in building indexes on such dynamic JSOn data so that the same can be queried or is there another preferred way?
Take a look at this https://learn.microsoft.com/en-us/azure/search/search-howto-index-json-blobs maybe it can help you.
Other option might be to export json from blob storage into Azure SQL Database or DocumentDB (maybe not everything - if you can you can export just part of data that you need) and query it there.
If you only need filtering like exact matches and numerical comparisons, then a document database such as DocumentDB may be a better choice than Azure Search.
Azure Search excels in linguistically aware full text search (including things like dealing with inflected word forms, misspellings, fuzzy matching, etc.)
As Jovan pointed out, the options are not mutually exclusive - you can use DocumentDB as the primary store and Azure Search for full text search scenarios (getting data from DocumentDB using DocumentDB indexer if necessary).
Related
I have an Azure Cognitive Search index which indexes data from multiple data sources. Each data source is indexed with a near identical indexer. Each indexer calls the same skillset configuration.
Within the index definition I have a field labeled "datasource" which is intended to identify the data source for a particular document. I would like to have the indexer or use a modular skill, such as a conditional skill, to set the value of this field based on the data source. I understand it is possible to use a conditional skill to the value of a field if a value is not found, but I want to avoid having to create a new skillset for every indexer. My data sources are documents of multiple types in blob containers.
Using only the indexer definition is is possible to assign the value of a field to a string manually in the definition, by somehow extracting the name of the data source, or using a modular skill in the skillset definition?
An avenue I have been pursuing is setting user-specified blob metadata at the container level. However, I have not been able to successfully retrieve this information with either the indexer or skillset. I do not want to set this user-specified blob metadata on every single blob in a container.
Unfortunately it is not possible to configure a blob data source in a way that will pass unique information to the skillset. Having a separate skillset per datasource may be the cleanest option. Alternatively, you could pass metadata_storage_path to a custom skill and parse the container path to return a value by convention or mapping.
Solr (among others) permits fields to be indexed, but not stored. Unless I’ve missed something in the documentation, Azure Search doesn’t appear to support this option.
It does have an attribute called retrievable, but it states
Currently, selecting this attribute does not cause a measurable increase in index storage requirements.
This suggests to me that Azure Search is storing everything anyway, and perhaps enabling toggling of this behaviour in-place?
My question is, how can I define a field in an equivalent way to stored=false in Azure Search?
As MatsLindh said, in Azure Search, an index is a persistent store of documents and other constructs used for filtered and full text search on an Azure Search service. So, you could not define a field to store=false.
According to that you have a large index, one of the simplest mechanisms is to submit multiple documents or records in a single request.
Note: To keep document size down, remember to exclude non-queryable data from the request. Images and other binary data are not directly searchable and shouldn't be stored in the index. To integrate non-queryable data into search results, you should define a non-searchable field that stores a URL reference to the resource.
For more details about indexing large data sets in Azure Search, you could refer to this article.
Is it possible to merge multiple blob into a single Azure Search record?
Complete Scenario: We have list of companies stored as json in cosmosDB and its related documents(.docx/pdf) in blob storage. A company can have multiple documents with varying size up to 20 MB and there is no upper limit of number of documents. How can we merge content of all documents and push into 'content' field of Azure Search Index, so that we could perform full-text search in companies data coming from cosmos and blob.
I've looked into https://www.lytzen.name/2017/01/30/combine-documents-with-other-data-in.html - Scenario discuss in the tutorial has one-to-one relationship between candidate data and CV. In our case there is one-to-many relationship between company and its documents.
Any help / direction would be appreciated.
Thanks
Azure Search Blob Indexer maps each blob to a document in the search index 1:1. At the moment, there isn't a way to merge the content of multiple blobs into a single document automatically. However, you can always write a client application that does this and pushes the aggregated content to the Azure Search index using our SDK or REST API..
I'm curious to learn more about scenario. With a single document in the index per company, you won't be able to search for individual documents from blob storage. Is that what want?
It is possible to merge data from different datasources into a single document in a search index, as long as you're trying to "assemble" a document from multiple fields and not merging into a single field.
Note that:
All the datasources agree on what the document key is. By default, the key is blob path. Since path are unique across blobs, the need to agree on keys means that you need to set a metadata property on your "secondary" blobs that correlates them with the "primary" blob.
You can't use indexers to merge multiple source documents into a single index field such as content. Likely, this is not what you need anyway for JSON metadata stored in Cosmos DB, since you probably want to capture that metadata into its own set of fields. For merging into the content field, you would need to write your own merging logic as noted in the previous response.
It seems that the fundamental primitive that would make your scenario "just work" is collection merge - you would model content not as a string, but as a collection of strings, where each element is extracted from one of your blobs. Please feel free to add a suggestion for collection merge functionality to our UserVoice.
One solution that I found is to compress the documents into ZIP and pass ZIP file to Azure Search indexer. Only problem with this solution is that I have to add another processing step for ZIP creation and additional storage cost for keeping ZIP
I am trying to filter rows against a String Type column. Basically I wanted to filter with part of string. It is very similar to LIKE operation in MySQL.
I have gone through this document https://learn.microsoft.com/en-us/rest/api/storageservices/querying-tables-and-entities
However, I couldn't find relevant information for my requirement. Any suggestion more helpful.
Basically I wanted to filter with part of string. It is very similar
to LIKE operation in MySQL.
Azure Tables have limited querying support and unfortunately LIKE is unsupported. What you would need to do is fetch all the entities on the client side and then apply the filter there.
I have two fairly general question about full text search in a database. I was looking into elastic search and solr and it seems to me that one needs to produce separate documents made up of table entries, which then get searched. So the result of such a search is not actually a database entry? Or did I misunderstand something?
I also looked into whoosh search, which does index table columns and the result of whoosh are actual table rows.
When using solr or elastic search, should I put the row id into the document which gets searched and after I have my result use that id to retrieve the relevant rows from the table? Or is there a better solution?
Another question I have is if I have a id like abc/123.64664, which is stored as a string, is there any advantage in searching such a column with a FTS? It seems to me there is not much to be gained by indexing? Or am I wrong?
thanks
Elasticsearch can store the indexed document, and you can retrieve it as a part of query result. Usually ppl still store the original data in an usual DB, it gives you more reliability and flexibility on reindexing. Mind that ES indexes non-relational data. You can have you data stored in relational manner and compose denormalized documents for indexing.
As for "abc/123.64664" you can index it as tokenized string or you can tune the index for prefix search etc. It's up to you
(TL;DR) Don't think about what your data is structured in your RDBS. Think about what you are searching.
Content storage for good full text search is quite different from relational database standard storage. So, your data going into Search Engine can end up looking quite differently from the way you stored it.
This is all driven by your expected search results. You may increase granularity of the data or - opposite - denormalize it so the parent/related record content shows up in the records you actually want returned as part of search. Text processing (copyField, tokenization, pre-processing, etc) is also where a lot of content modifications happen to make a record findable.
Sometimes, relational databases support full-text search. PostgreSQL is getting better and better at that. But most of the time, relational databases just do not provide enough flexibility to support good relevancy-driven search.
Finally, if the original schema is quite complex, it may make sense to only use search engine to get the right - relevant - IDs out and then merge them in the client code with the details from the original database records.