using pdf files and their metadata for Azure cognitive search - azure

I'm uploading hundreds of PDF files into blob storage to be used in Azure cognitive search.
I would like the user to be able to get the title and author of these PDF files on top of their search results.
I'm not sure how the metadata for these PDF files (e.g., 'author', 'date', 'title') can be added (e.g., as a json file) to the blob storage.
Any advice would be appreciated.
Thanks

I'm from the Microsoft for Founders Hub team. Azure blob storage has blob properties and metadata built into it! You can view and add metadata through various tools including the Azure Portal, CLI, PowerShell, or the REST API. To learn more, here's a great area to get started:
View Blob Properties and Meta Data using Azure Tools
Add Blob Metadata using Azure tools and code

If you would like the title, author and date to be returned in your search results, you can add them to the index. Thus, you can create fields called author, title and date in your index. Then, in the indexer, you can return the specific metadata for PDF's, as mentioned here, like this:
indexer= {
"name":...,
"dataSourceName":...,
"targetIndexName":...,
"skillsetName":...,
"fieldMappings": [
{
....
},
{
"sourceFieldName": "metadata_title",
"targetFieldName": "title"
},
{
"sourceFieldName": "metadata_creation_date",
"targetFieldName": "date"
},
{
"sourceFieldName": "metadata_author",
"targetFieldName": "author"
}
],
"outputFieldMappings": [
...
]
...
}
Where the "..." means that you add your own code.
Of course, the PDF's should have the metadata, otherwise it will return an empty value [].
You can then access the fields like you'd normally do for content for example.
NOTE: if you happen to put a null mappingFunction for the title, date and author, you might also get a []. If you don't use it, best remove it.

Related

How to map blob file "content" to an existing "content" field in an index based on blob metadata_storage_path property?

I am trying to create an index using Azure SQL and Azure Blob Data source. Blob container contains files in following formats word, pdf, pptx, txt, and etc.
Click here for Index Structure
"ItemId" is the Key field in the index, data is pulled from item table in the db.
"DocumentList" is a collection which holds files metadata including file storage path.
"DocumentList" is derived from an SQL Json array column. Json column holds the files metadata against item.
Files are stored in blob and blob path is stored in the above json column under "DocumentLocation" property.
Note: Each row in db. can hold multiple files in the blob.
Questions:
How to map the blob "content" to the "Content" field under "DocumentList" field in the index using "DocumentLocation" as the basis for joining?
Can we define field mapping or output field mapping for above scenario. if it is possible, how to do that?
Any other approach to above scenario ?
Any suggestions are much appreciated.
Using the current code snippet, make the name changes according to the requirement.
"fieldMappings": [
{
"sourceFieldName": "metadata_storage_path",
"targetFieldName": "metadata_storage_path"
},
{
"sourceFieldName": "metadata_storage_path",
"targetFieldName": "index_key",
"mappingFunction": {
"name": "base64Encode"
}
}
]
For better understanding, follow the procedure mentioned in the below link.
https://learn.microsoft.com/en-us/azure/search/search-indexer-field-mappings

Azure Search with documentDB does not find data

It's the first time that i'm using Azure Search. I follewed the example with the generated dataset. Now I want to implement Azure search on my database.
This is an example of an item of the collection I want to index.
{
"_id" : "Watch",
"name" : "Watch",
"cloudProvider" : "azure",
"channel" : "C6SELFQMD",
"services" : [
"azure-backup",
"azure-data-lake-analytics",
"backup",
"blobstorage",
"site-recovery",
"storage"
],
"__v" : 0
}
Azure search itself doesn't even detect the fields.
This are the steps i'm doing:
If I add the fields manually it returns useless data. Does somebody know why this is happening? I do only have 2 items in my collection at the moment but I don't think that this is the problem?
UPDATE:
So the problem is that I've an underscore before my id ("_id"). Now I'm trying to use fieldMappings to solve this issue. But the api's response is:
{
"error": {
"code": "",
"message": "Data source does not contain column '_id', which is required because it maps to the document key field 'id' in the index 'index'. Ensure that the '_id' column is present in the data source, or add a field mapping that maps one of the existing column names to 'id'."
}
}
Azure Search currently does not support Cosmos DB Table API accounts, as seems to be the case here.
If you want to see Table API supported by Azure Search, please vote for Azure Search should be able to index Cosmos DB Table API collections to help us prioritize that work.

How To Retrieve Custom Columns For DriveItems in MS Graph

I'm trying to use the Graph API to retrieve a hierarchy of files in a Sharepoint document library. Since document libraries are stored in "drives" (is it technically correct to call it OneDrive?), I'm using the /drives endpoint to fetch a list of files, like this:
https://graph.microsoft.com/beta/drives/{driveid}/root/children
I would like to get information from some of the custom columns that exist when viewing these items through Sharepoint. Using ?expand=fields doesn't work because fields only exists in listItem object of the /sites endpoint, not in the driveItem object of /drives endpoint. If I try obtaining the listItem from a single driveItem (traversing the Graph from OneDrive to Sharepoint), and then expanding the fields, like
https://graph.microsoft.com/beta/drives/{driveid}/items/{driveItemId}/listItem?expand=fields
this retrieves built-in columns (Author, DocIcon, and some others) but doesn't seem to retrieve the custom columns.
I've also tried getting the list of files from the /sites endpoint, and using ?expand=fields will get the custom columns, but it gets every file from every subfolder, rather than the current folder path. But I feel that deserves its own SO question.
Is it possible to retrieve custom column information from driveItems?
I spent a lot of time digging around with the different syntax possibilities and was finally able to get custom library properties using this query format. This is the only one that has produced my custom/user-defined fields for a document library.
https://graph.microsoft.com/v1.0/drives/insert_drive_id_here/root/children?expand=listItem
Shortened result:
{
"#odata.context": "...",
"value": [
{
"#microsoft.graph.downloadUrl": "...",
"listItem#odata.context": "...",
"listItem": {
"#odata.etag": "...",
"fields#odata.context": "...",
"fields": {
"#odata.etag": "...",
"Title": "...",
"Other_Custom_Property": "..."
}
}
}
]
}
I did some testing. What SHOULD work is:
https://graph.microsoft.com/beta/drives/{driveid}/root/children?$select=id,MyCustomColumnName
However, when I did that, it just returned that id field. In my opinion, that is a bug in the graph because this same type of query does work in the SharePoint REST api.
If this helps, you can accomplish this by using the SharePoint REST api. Your endpoint query would be something like:
https://{yoursite}.sharepoint.com/sites/{sitename}/_api/web/lists/(' {DocumentLibraryID}')/items?$select=id,MyCustomColumnName
There are other ways to do the same query.
Try the list endpoint then expand driveItem and fields. You now have both custom column fields and drive item fields.
/beta/sites/[site-id]/lists/[list-id]/items?expand=driveitem,fields&filter=(fields/customColumn eq 'someValue')

Azure Search - Match value from comma-separated values string

How do you structure a Azure POST REST call to match a value on a comma-separated list string?
For Example:
I want to search for "GWLAS" or "SAMGV" within the Azure field "ProductCategory".
The "ProductCategory" field in the documents will have a comma-separated value string such as "GWLAS, EXDEB, SAMGV, AMLKYC".
Any ideas?
If you use the default analyzer for your ProductCategory field (assuming it is searchable), it should word-break on commas by default. This means all you should need to do is search for the terms you're interested in and limit it to the right field:
POST /indexes/yourindex/docs/search?api-version=2016-09-01
{
"search": "GWLAS SAMGV",
"searchFields": [ "ProductCategory" ]
}
There are other ways to do this, but this is the simplest. If you already scope parts of your search query to other fields, here is how you can scope just the desired terms to ProductCategory:
POST /indexes/yourindex/docs/search?api-version=2016-09-01
{
"search": "(Name:\"Anderson John\"~3 OR Text:\"Anderson John\"~3) AND ProductCategory:GWLAS SAMGV",
"queryType": "full"
}
Please consult the Azure Search REST API documentation for details on other options you can set in the Search request. Also, this article will help you understand how Azure Search executes queries. You can find the reference for the full Lucene query syntax here.

MongoDB C# Logging Search Results

I am working on a mobile site that lets you search for tags of a MongoDB collection of articles.
Basically, each article object has a tags property, which stores an array of tag strings.
The search works fine, but I also want to add logging to the searches.
The reason is that I want to see what visitors are searching for and what results they are getting in order to optimize the tags.
For example, if the user enters the tag grocery, then I want to save the query results.
Hope my question is clear. thank you!
You can't optimize something without measuring. You'll need to be able to compare new results with old results. So you'll have to save a snapshot of all the information crucial to a search query. This obviously includes the search terms itself, but also an accurate snapshot of the result.
You could create snapshots of entire products, but it's probably more efficient to save only the information involved in determining the search results. In your case these are the article tags, but perhaps also the article description if this is used by your search engine.
After each search query you'll have to build a document similar to the following, and save this in a searchLog collection in MongoDB.
{
query: "search terms",
timestamp: new Date(), // time of the search
results: [ // array of articles in the search result
{
articleId: 123, // _id of the original article
name: "Lettuce", // name of the article, for easier analysis
tags: [ "grocery", "lettuce" ] // snapshot of the article tags
// snapshots of other article properties, if relevant
},
{
articleId: 456,
name: "Bananas",
tags: [ "fruit", "banana", "yellow" ]
}
]
}

Resources