Issue with Azure Blob Indexer - azure

I have come across a scenario where I want to index all the files that are present in the blob storage.
But, In a scenario if the file that is uploaded in Blob is password protected, the indexer fails and also the indexer is now not able to index the remaining files.
[
{
"key": null,
"errorMessage": "Error processing blob 'url' with content type ''. Status:422, error: "
}
]
Is there a way to ignore the password protected files or a way to continue with the indexing process even if there is an error in some file.

See Dealing with unsupported content types section in Controlling which blobs are indexed. Use failOnUnsupportedContentType configuration setting:
PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=2016-09-01
Content-Type: application/json
api-key: [admin key]
{
... other parts of indexer definition
"parameters" : { "configuration" : { "failOnUnsupportedContentType" : false } }
}

Is there a way to ignore the password protected files or a way to
continue with the indexing process even if there is an error in some
file.
One possible way to do it is define a metadata on the blob by the name AzureSearch_Skip and set its value to true. In this case, Azure Search Service will ignore this blob and moves to the next blob in the list.
You can read more about this here: https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage#controlling-which-parts-of-the-blob-are-indexed.

Related

Why can't Azure Search import JSON blobs?

When importing data using the configuration found below, Azure Cognitive Search returns the following error:
Error detecting index schema from data source: ""
Is this configured incorrectly? The files are stored in the container "example1" and in the blob folder "json". When creating the same index with the same data in the past there were no errors, so I am not sure why it is different now.
Import data:
Data Source: Azure Blob Storage
Name: test-example
Data to extract: Content and metadata
Parsing mode: JSON
Connection string:
DefaultEndpointsProtocol=https;AccountName=EXAMPLESTORAGEACCOUNT;AccountKey=EXAMPLEACCOUNTKEY;
Container name: example1
Blob folder: json
.json file structure.
{
"string1": "vaule1",
"string2": "vaule2",
"string3": "vaule3",
"string4": "vaule4",
"string5": "vaule5",
"string6": "vaule6",
"string7": "vaule7",
"string8": "vaule8",
"list1": [
{
"nested1": "value1",
"nested2": "value2",
"nested3": "value3",
"nested4": "value4"
}
],
"FileLocation": null
}
Here is an image of the screen with the error when clicking "Next: Add cognitive skills (Optional)" button:
To clarify there are two problems:
1) There is a bug in the portal where the actual error message is not showing up for errors, hence we are observing the unhelpful empty string "" as an error message. A fix is on the way and should be rolled out early next week.
2) There is an error when the portal attempts to detect index schema from your data source. It's hard to say what the problem is when the error message is just "". I've tried your sample data and it works fine with importing.
I'll update the post once the fix for displaying the error message is out. In the meantime (again we're flying blind here without the specific error string) here are a few things to check:
1) Make sure your firewall rules allow the portal to read from your blob storage
2) Make sure there are no extra characters inside your JSON files. Check the whitespace charcters are whitespace (you should be able to open the file in VSCode and check).
Update: The portal fix for the missing error messages has been deployed. You should be able to see a more specific error message should an error occur during import.
Seems to me that is a problem related to the list1 data type. Make sure you're selecting: "Collection(Edm.String)" for it during the index creation.
more info, please check step 5 of the following link: https://learn.microsoft.com/en-us/azure/search/search-howto-index-json-blobs
I have been in contact with Microsoft, and this is a bug in the Azure Portal. The issue is the connection string wizard does not append the Endpoint suffix correctly. They have recommeded to manually pasting the connection string, but this still does not work for me. So this is a suggested answer by Microsoft, but I don't believe is completely correct because the portal outputs the same error message:
Error detecting index schema from data source: ""

Azure : How to write path to get a file from a time series partitioned folder using the Azure logic apps

I am trying to retrieve a csv file from the Azure blob storage using the logic apps.
I set the azure storage explorer path in the parameters and in the get blob content action I am using that parameter.
In the Parameters I have set the value as:
concat('Directory1/','Year=',string(int(substring(utcNow(),0,4))),'/Month=',string(int(substring(utcnow(),5,2))),'/Day=',string(int(substring(utcnow(),8,2))),'/myfile.csv')
So during the run time this path should form as:
Directory1/Year=2019/Month=12/Day=30/myfile.csv
but during the execution action is getting failed with the following error message
{
"status": 400,
"message": "The specifed resource name contains invalid characters.\r\nclientRequestId: 1e2791be-8efd-413d-831e-7e2cd89278ba",
"error": {
"message": "The specifed resource name contains invalid characters."
},
"source": "azureblob-we.azconn-we-01.p.azurewebsites.net"
}
So my question is: How to write path to get data from the time series partitioned path.
The response of the Joy Wang was partially correct.
The Parameters in logic apps will treat values as a String only and will not be able to identify any functions such as concat().
The correct way to use the concat function is to use the expressions.
And my solution to the problem is:
concat('container1/','Directory1/','Year=',string(int(substring(utcNow(),0,4))),'/Month=',string(int(substring(utcnow(),5,2))),'/Day=',string(int(substring(utcnow(),8,2))),'/myfile.csv')
You should not use that in the parameters, when you use this line concat('Directory1/','Year=',string(int(substring(utcNow(),0,4))),'/Month=',string(int(substring(utcnow(),5,2))),'/Day=',string(int(substring(utcnow(),8,2))),'/myfile.csv') in the parameters, its type is String, it will be recognized as String by logic app, then the function will not take effect.
And you need to include the container name in the concat(), also, no need to use string(int()), because utcNow() and substring() both return the String.
To fix the issue, use the line below directly in the Blob option, my container name is container1.
concat('container1/','Directory1/','Year=',substring(utcNow(),0,4),'/Month=',substring(utcnow(),5,2),'/Day=',substring(utcnow(),8,2),'/myfile.csv')
Update:
As mentioned in #Stark's answer, if you want to drop the leading 0 from the left.
You can convert it from string to int, then convert it back to string.
concat('container1/','Directory1/','Year=',string(int(substring(utcNow(),0,4))),'/Month=',string(int(substring(utcnow(),5,2))),'/Day=',string(int(substring(utcnow(),8,2))),'/myfile.csv')

Azure Search Service REST API Delete Error: "Document key cannot be missing or empty."

I am seeing some intermittent and odd behavior when trying to use the Azure Search Service REST API to delete a blob storage blob/document. It works, sometimes, and then other times I get this:
The request is invalid. Details: actions : 0: Document key cannot be
missing or empty.
Once I start getting this error, it's the same results when I try to delete any of the document/blobs stored in that index. I do have 'metadata_storage_path' listed as my index key (see below).
I have not been able to get the query to succeed again, or I would examine the differences in Fiddler.
I have also tried the following with no luck:
Resetting and re-running the associated search indexer.
Creating a new indexer & index against the same container and deleting from that.
Creating a new container, indexer, & index and deleting from that.
Any additional suggestions or thoughts?
Copy/paste error: "metadata_storage_name" should be "metadata_storage_path".
[Insert head-banging-on-wall emoji here.]
For those who are still searching for the solution...
Instead of id,
{
"value": [
{
"#search.action": "delete",
"id":"TDVRT0FPQXcxZGtTQUFBQUFBQUFBQT090fdf"
}
]
}
Use rid of your document to delete.
{
"value": [
{
"#search.action": "delete",
"rid":"TDVRT0FPQXcxZGtTQUFBQUFBQUFBQT090fdf"
}
]
}
Because while creating Search Index, you might have selected rid as your unique id column.
Note: We can delete a document only with Unique Id Columns.

Azure Search decode base64 file contents for Index

I am attempting to use Azure search on a blob container that contains a ton of .htm files. Each one of these files is entirely encoded in base64 with padding. One of these files may be "example.htm", and if you opened it you would see:
//This decodes to html
PCEtLSBBIHNlZ21lbnQgb2YgYSBzd2VldCBib2R5IC0tPg0KPGRpdiBjbGFzcz0iYS1uaWNlLWNsYXNzIiBpZD0iaW1tYS1pZCI+DQoJPHA+Q2F0J3MgYXJlIGhhcmQgdG8gZGVjb2RlPC9wPg0KPC9kaXY+
I have tried to add a field mapping to decode this in my indexer. If I set "useHttpServerUtilityUrlTokenDecode": true then I get “Error applying mapping function ‘base64Decode’ to field ‘NAME’: Array cannot be null.\r\nParameter name: bytes”. , and if I set it to false then no files are indexed even though it says "success".
{
"name":"demoindexer",
"dataSourceName" : "demodata",
"targetIndexName" : "demoindex",
"fieldMappings" : [
{
"sourceFieldName" : "content",
"targetFieldName" : "content",
"mappingFunction" :
{ "name" : "base64Decode", "parameters" : {
"useHttpServerUtilityUrlTokenDecode" : false } }
}
],
"parameters":
{
"maxFailedItems":-1,
"maxFailedItemsPerBatch":-1
}
}
It seems that a clue may be a note on the field mappings page for Azure where it says that for "Base64 encoding with padding, use URL-safe characters and remove padding through additional processing after library encoding". I am not sure if this can be done through the Azure Search API and if so how to go about it, or if it is just really saying before uploading into Azure storage encode differently.
How would I go about decoding the contents of these files for my index so that search results will not return base64 stings?
If I understand your situation correctly, your blobs contain only base64-encoded text. If so, you should use text parsing mode to preserve your text as-is so it can be decoded. See Indexing plain text.
After you modify your indexer to use text parsing mode, don't forget to reset to the indexer so that it will start indexing your blobs from scratch. It's necessary because previously the indexer skipped over the blobs since you set "maxFailedItems" : -1.

Azure Logic Apps - Get Blob Content - Setting Content type

The Azure Logic Apps action "Get Blob Content" doesn't allow us to set the return content-type.
By default, it returns the blob as binary (octet-stream), which is useless in most cases. In general it would be useful to have text (e.g. json, xml, csv, etc.).
I know the action is in beta. Is that on the short term roadmap?
Workaround I found is to use the Logic App expression base64ToString.
For instance, create an action of type "Compose" (Data Operations group) with the following code:
"ComposeToString": {
"inputs": "#base64ToString(body('Get_blob_content').$content)",
"runAfter": {
"Get_blob_content": [
"Succeeded"
]
},
"type": "Compose"
}
The output will be the text representation of the blob.
So I had a blob sitting in az storage with json in it.
Fetching blob got me a octet back that was pretty useless, as I was unable to parse it.
BadRequest. The property 'content' must be of type JSON in the
'ParseJson' action inputs, but was of type 'application/octet-stream'.
So I setup an "Initialize variable", content type of string, pointing to GetBlobContent->File Content. The base64 conversion occurs under the hood and I am now able to access my json via the variable.
No code required.
JSON OUTPUT...
FLOW, NO CODE...
Enjoy! Healy in Tampa...
After fiddling much with Logic Apps, I finally understood what was going on.
The JSON output from the HTTP request is the JSON representation of an XML payload:
{
"$content-type": "application/xml",
"$content": "77u/PD94bWwgdm..."
}
So we can decode it, but it is useless really. That is an XML object for Logic App. We can apply xml functions to it, such as xpath.
You would need to know the content-type.
Use #{body('Get_blob_content')['$content']} to get the content part alone.
Is enough to "Initialize Variable" and take the output of the Get Blob Content as type "String". This will automatically parse the content:

Resources