Azure Search decode base64 file contents for Index - azure

I am attempting to use Azure search on a blob container that contains a ton of .htm files. Each one of these files is entirely encoded in base64 with padding. One of these files may be "example.htm", and if you opened it you would see:
//This decodes to html
PCEtLSBBIHNlZ21lbnQgb2YgYSBzd2VldCBib2R5IC0tPg0KPGRpdiBjbGFzcz0iYS1uaWNlLWNsYXNzIiBpZD0iaW1tYS1pZCI+DQoJPHA+Q2F0J3MgYXJlIGhhcmQgdG8gZGVjb2RlPC9wPg0KPC9kaXY+
I have tried to add a field mapping to decode this in my indexer. If I set "useHttpServerUtilityUrlTokenDecode": true then I get “Error applying mapping function ‘base64Decode’ to field ‘NAME’: Array cannot be null.\r\nParameter name: bytes”. , and if I set it to false then no files are indexed even though it says "success".
{
"name":"demoindexer",
"dataSourceName" : "demodata",
"targetIndexName" : "demoindex",
"fieldMappings" : [
{
"sourceFieldName" : "content",
"targetFieldName" : "content",
"mappingFunction" :
{ "name" : "base64Decode", "parameters" : {
"useHttpServerUtilityUrlTokenDecode" : false } }
}
],
"parameters":
{
"maxFailedItems":-1,
"maxFailedItemsPerBatch":-1
}
}
It seems that a clue may be a note on the field mappings page for Azure where it says that for "Base64 encoding with padding, use URL-safe characters and remove padding through additional processing after library encoding". I am not sure if this can be done through the Azure Search API and if so how to go about it, or if it is just really saying before uploading into Azure storage encode differently.
How would I go about decoding the contents of these files for my index so that search results will not return base64 stings?

If I understand your situation correctly, your blobs contain only base64-encoded text. If so, you should use text parsing mode to preserve your text as-is so it can be decoded. See Indexing plain text.
After you modify your indexer to use text parsing mode, don't forget to reset to the indexer so that it will start indexing your blobs from scratch. It's necessary because previously the indexer skipped over the blobs since you set "maxFailedItems" : -1.

Related

IBM API Connect "XML to JSON" policy appends $ char and does not give expected JSON

Hi I am using "XML to JSON" policy to change my XML into JSON but it is adding an extra "$" character. not sure what is the benefit of having it and how to get rid of that.
Currently:
hello becomes { "a": { "$" : "hello" } }
Expecting it to return { "a": "hello" }
Can anyone please help here?
Just because it uses Badger Fish for Transformation. For this you can use Mapper policy to map your attributes to transform from XML to JSON.
The best way to do XML to JSON and JSON to XML transforms is using the mapping node, if you use the "automatic" nodes, it is slower than the mapping nodes and also might have weird behaviours like adding "$" or later removing them... (according to my experience, if the XML has attributes, it might remove the "$".. so I moved to mapping nodes and ready..)

Issue with Azure Blob Indexer

I have come across a scenario where I want to index all the files that are present in the blob storage.
But, In a scenario if the file that is uploaded in Blob is password protected, the indexer fails and also the indexer is now not able to index the remaining files.
[
{
"key": null,
"errorMessage": "Error processing blob 'url' with content type ''. Status:422, error: "
}
]
Is there a way to ignore the password protected files or a way to continue with the indexing process even if there is an error in some file.
See Dealing with unsupported content types section in Controlling which blobs are indexed. Use failOnUnsupportedContentType configuration setting:
PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=2016-09-01
Content-Type: application/json
api-key: [admin key]
{
... other parts of indexer definition
"parameters" : { "configuration" : { "failOnUnsupportedContentType" : false } }
}
Is there a way to ignore the password protected files or a way to
continue with the indexing process even if there is an error in some
file.
One possible way to do it is define a metadata on the blob by the name AzureSearch_Skip and set its value to true. In this case, Azure Search Service will ignore this blob and moves to the next blob in the list.
You can read more about this here: https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage#controlling-which-parts-of-the-blob-are-indexed.

U-SQL: How to skip files from analysis based on content

I have a lot of files each containing a set of json objects like this:
{ "Id": "1", "Timestamp":"2017-07-20T10:43:21.8841599+02:00", "Session": { "Origin": "WebClient" }}
{ "Id": "2", "Timestamp":"2017-07-20T10:43:21.8841599+02:00", "Session": { "Origin": "WebClient" }}
{ "Id": "3", "Timestamp":"2017-07-20T10:43:21.8841599+02:00", "Session": { "Origin": "WebClient" }}
etc.
Each file containts information about a specific type of session. In this case it are sessions from a Web App, but it could also be sessions of a Desktop App. In that case the value for Origin is "DesktopClient" instead of "WebClient"
For analysis purposes say I am only interested in DesktopClient sessions.
All files representing a session are stored in Azure Blob Storage like this:
container/2017/07/20/00399076-2b88-4dbc-ba56-c7afeeb9ef77.json
container/2017/07/20/00399076-2b88-4dbc-ba56-c7afeeb9ef78.json
container/2017/07/20/00399076-2b88-4dbc-ba56-c7afeeb9ef79.json
Is it possible to skip files of which the first line already makes it clear if it is not a DesktopClient session file, like in my example? I think it would save a lot of query resources if files that I know of do not contain the right session type can be skipped since they can be quit big.
At the moment my query read the data like this:
#RawExtract = EXTRACT [RawString] string
FROM #"wasb://plancare-events-blobs#centrallogging/2017/07/20/{*}.json"
USING Extractors.Text(delimiter:'\b', quoting : false);
#ParsedJSONLines = SELECT Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple([RawString]) AS JSONLine
FROM #RawExtract;
...
Or should I create my own version of Extractors.Text and if so, how should I do that.
To answer some questions that popped up in the comments to the question first:
At this point we do not provide access to the Blob Store meta data. That means that you need to express any meta data either as part of the data in the file or as part of the file name (or path).
Depending on the cost of extraction and sizes of files, you can either extract all the rows and then filter out the rows where the beginning of the row is not fitting your criteria. That will extract all files and all rows from all files, but does not need a custom extractor.
Alternatively, write a custom extractor that checks for only the files that are appropriate (that may be useful if the first solution does not give you the performance and you can determine the conditions efficiently inside the extractors). Several example extractors can be found at http://usql.io in the example directory (including an example JSON extractor).

Azure Logic Apps - Get Blob Content - Setting Content type

The Azure Logic Apps action "Get Blob Content" doesn't allow us to set the return content-type.
By default, it returns the blob as binary (octet-stream), which is useless in most cases. In general it would be useful to have text (e.g. json, xml, csv, etc.).
I know the action is in beta. Is that on the short term roadmap?
Workaround I found is to use the Logic App expression base64ToString.
For instance, create an action of type "Compose" (Data Operations group) with the following code:
"ComposeToString": {
"inputs": "#base64ToString(body('Get_blob_content').$content)",
"runAfter": {
"Get_blob_content": [
"Succeeded"
]
},
"type": "Compose"
}
The output will be the text representation of the blob.
So I had a blob sitting in az storage with json in it.
Fetching blob got me a octet back that was pretty useless, as I was unable to parse it.
BadRequest. The property 'content' must be of type JSON in the
'ParseJson' action inputs, but was of type 'application/octet-stream'.
So I setup an "Initialize variable", content type of string, pointing to GetBlobContent->File Content. The base64 conversion occurs under the hood and I am now able to access my json via the variable.
No code required.
JSON OUTPUT...
FLOW, NO CODE...
Enjoy! Healy in Tampa...
After fiddling much with Logic Apps, I finally understood what was going on.
The JSON output from the HTTP request is the JSON representation of an XML payload:
{
"$content-type": "application/xml",
"$content": "77u/PD94bWwgdm..."
}
So we can decode it, but it is useless really. That is an XML object for Logic App. We can apply xml functions to it, such as xpath.
You would need to know the content-type.
Use #{body('Get_blob_content')['$content']} to get the content part alone.
Is enough to "Initialize Variable" and take the output of the Get Blob Content as type "String". This will automatically parse the content:

How can I get elastic search to return results inside angle brackets?

I'm new to elastic search. I'm trying to fix our search so that it will allow users to search on content within html tags. Currently, we're using a whitespace tokenizer because we need it to return results on hyphenated names. Consequently, aname123-suffix project is indexed as ["aname123-suffix", "project"] and a user search for "aname123-*" returns the correct results.
My problem arises because we also want to be able to search on content within html tags. So, for example for a project called <aname123>-suffix project, we'd like to be able to enter the search term <aname123>-* and get back the correct results.
The index has the correct tokens for a whitespace tokenizer, namely ["<aname123>-suffix", "project"] but when my search string is "\<aname123\>\-suffix" or "\\<aname123\\>\\-suffix" elastic search returns no results.
I think the solution lies either in
modifying the search string so that elastic search returns <aname123>-suffix when I ask for it; or
being able to index the content within the tag separately from the whitespace tokens, i.e. ["<aname123>-suffix", "project", "aname123", "suffix"]
So far I've been approaching it by changing the indexing, but I have not yet succeeded. A standard tokenizer will allow search results for content within tags, but it fails to return search results for aname123-*. Currently my analyzer settings look like this:
{ "analysis":
{ "analyzer":
{ "my_whitespace_analyzer" :
{"type": "custom"
{"tokenizer": "whitespace},
{"filter": ["standard", "lowercase", "stop"]}
}
},
{ "my_tag_analyzer":
{"type": "custom"
{"tokenizer": "standard"},
{"filter": ["standard", "lowercase", "stop"]}
}
}
}
}
I can create a custom char filter that strips out the < and the >, so my index contains aname123; but for some reason elastic search still does not return correct results when searching on <aname123>*. However, when I use instead a standard analyzer, the index contains aname123 and it returns the expected results for <aname123>* ... What is so special about angle brackets in elastic search?
You may want to take a look at the html_strip character filter:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html
An example from one of the elasticsearch developers is here:
https://gist.github.com/clintongormley/780895

Resources