Escape characters in .csv for Azure Synapse seems to dissapear? - azure

I have .csv file that looks like this:
"ID", "Name", "Extra Info"
"1", "John", "{\"Event\": \"Click\", \"Button Name\": \"Accept\"}
"2", "Adam", "{\"Event\": \"Click\", \"Button Name\": \"Accept\"}
I'm trying to load this file using this code in Synapse:
SELECT
TOP 2 *
FROM
OPENROWSET(
BULK 'https://[MY STORAGE ACCOUNT].dfs.core.windows.net/[FILE PATH]/[...]/*.csv',
FORMAT = 'CSV',
        PARSER_VERSION = '2.0'
)
AS [result]
Expecting this result:
ID
Name
Extra Info
1
John
{"Event": "Click", "Button Name": "Accept"}
2
Adam
{"Event": "Click", "Button Name": "Accept"}
But I keep getting this error:
Error handling external file: 'Unexpected token 'Event\' at [byte: XXX].
Expecting tokens ',', ' ', or '"'. '.
File/External table name: 'https://[MY STORAGE ACCOUNT].dfs.core.windows.net/[FILE PATH]/[...]/[SPECIFIC FILE NAME].csv'.
It looks like it's ignoring the first quote (") and Escape character in the Extra Info column? Leading to it think that \Event\ is some special token?
I just don't understand why or what I can do to fix this?

I think I found the answer based on this post and some of the Azure documentation:
How Field Quote works: Is my understanding of how FIELDQUOTE works correct?
Escaping Quotes in the Azure Documentation: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-single-csv-file#escape-quoting-characters
It seems that the only valid way to escape Quotes is by using double quotes.
This means my .csv should be formatted like this:
"ID", "Name", "Extra Info"
"1", "John", "{""Event"": ""Click"", ""Button Name"": ""Accept""}
"2", "Adam", "{""Event"": ""Click"", ""Button Name"": ""Accept""}
Instead of the original (which uses ):
"ID", "Name", "Extra Info"
"1", "John", "{\"Event\": \"Click\", \"Button Name\": \"Accept\"}
"2", "Adam", "{\"Event\": \"Click\", \"Button Name\": \"Accept\"}
Unfortunately I don't see a way around this other than BULK editing all my .csv files...

Related

Azure full phrase match cognitive search query returning results which don't fully match

I am getting unexpected results when using a phrase search query. According to the Microsoft docs(https://learn.microsoft.com/en-us/azure/search/query-simple-syntax) phrases encapsulated within quotation marks (" ") should only return the full phrase. However I am getting results back I shouldn't be as they don't fully match.
Query string: "building"&parameterName=propertyName&queryType=Full
Results:
"value": [
{
"#search.score": 3.236124,
"id": "PROP127",
"propertyName": "SILVER BUILDING",
"address": "test address",
"fullAddress": "test full address",
"division": "commercial",
"transaction": "lettings",
"selectedCount": null
},
{
"#search.score": 3.2345672,
"id": "PROP323",
"propertyName": "SJW BUILDING",
"address": "test address",
"fullAddress": "test full address",
"division": "commercial",
"transaction": "lettings",
"selectedCount": null
},
The results are returning property names with the word building but this should only appear when typing in "Silver building" for example.
Is there something wrong with the query string?
Any help would be much appreciated!
The document with the property name "silver building" is being returned because Azure Cognitive Search tokenizes the phrase into individual terms . Therefore, searching for the following will return the document:
building
silver
silver building
"silver building"
The quotes are used to make sure that a specific phrase is found within a document but it does not mean exact match. For example, a document with the phrase "The quick brown fox" will be found if you search for "brown fox" or "quick brown".
If you do not want the document to be tokenized (broken up into words) then you can use the keyword analyzer which will emit the entire field as a single token. This means a document with "Silver Building" will only match if you search for the specific text.

How do you return a value within a 2D array in Azure Data Flows Open Expression Builder?

Background: I have a CSV file with a column that has a list of tags for a given row. The tag list is not in any specific order and varies for each cell in the tags column. I am looking for the value for a row which matches the string "Owner". When pulling in the CSV file, the entire cell is 1 string per cell. An example cell in this column looks like following:
"Organization": "Microsoft", "Owner": "Eric Holmes", "DateCreated": "07/09/2021"
Goal: I would like to find a way in Azure Data Flows or Azure Data Factory to make a new column with a value for a specific key in a list.
Example:
Current Column
Tags
"Department": "Business", "Owner": "Karen Singh", "DateCreated": "09/20/2019"
"Owner": "Henry Francis", "AppName": "physics-engine", "Department": "GeospatialServices"
"Department": "Fashion", "DateCreated": "01/10/2015", "Owner": "Xiuxiang Long"
Desired Column
Owner
"Karen Singh"
"Henry Francis"
"Xiuxiang Long"
Work So Far: I have taken each string in the tags column split it into an array by breaking it apart and the commas (,). Then I have split each string at each index by the colon (:). This makes the values look like:
Tags
[["Department", "Business"], ["Owner", "Karen Singh"], ["DateCreated", "09/20/2019"]]
[["Owner", "Henry Francis"], ["AppName", "physics-engine"], ["Department", "GeospatialServices"]]
[[Department", "Fashion"], ["DateCreated", "01/10/2015"], ["Owner", "Xiuxiang Long"]]
To split the strings, I've used this open expression
mapIndex(split(replace(Tags, '"', ''), ','), split(#item, ':'))
Problems
I am new to Open Expressions and Azure Data Factory and Data Flows. Does anyone know how I would:
Search for the desired tag like "Owner"
And return the value associated to it
Sorry I know this question sounds very simple but using only open expression functions makes this more convoluted than necessary. Additionally, if there is a better way to go about this problem I'd appreciate any input! I've been banging my head against the wall and any leads help. Thank you!
I have tried to repro it, could achieve it using Derived Column, where you could Split():
Use Derived Column transformation and use below expression:
split(split(tags,'"Owner":')[2],'"')[2]
Data Preview:

How can I search the special characters in Solr

I'm used Solr 6.6.2
I need to search the special characters and highlight it in Solr,
But it does not work,
my data :
[
{
"id" : "test1",
"title" : "test1# title C# ",
"dynamic_s": 5
},
{
"id" : "test2",
"title" : "test2 title C#",
"dynamic_s": 10
},
{
"id" : "test3",
"title" : "test3 title",
"dynamic_s": 0
}
]
When I search "C#",
Then it will just response like this "test1# title C# ",
It just highlights "C" this word...and "#" will not searching and highlight.
How can I make the search and highlight work for special characters?
The StandardTokenizer splits tokens on special characters, meaning that # will split the content into separate tokens - the first token will be C - and that's what's being highlighted. You'll probably get the exact same result if you just search for C.
The tokenization process will make your tokens end up being test2 title C .
Using a field type with a WhitespaceTokenizer that only splits on whitespace will probably be a better choice for this exact use case, but it's impossible to say if that'll be a good match for your regular search behavior (i.e. if you actually want to match 'C' to `C-99' etc., splitting by those characters can be needed). But - you can use a specific field for highlighting, and that fields analysis chain will be used to determine what to highlight. And you can ask for both the original and the more specific field to be highlighted, and then use the best result in your frontend application.

SELECT string expression recognizes single not double quotes?

I'm creating an Azure Stream Analytics query and I needed to output a constant column header/value, and I noticed that if I include a string expression as a SELECT item, it needs to be enclosed in single (') not double (") quotes, otherwise I get a NULL.
SELECT 'foo' as "bar" INTO ... FROM ... >>> outputs foo as a value of bar.
SELECT "foo" as "bar" INTO ... FROM ... >>> outputs null as a value of bar.
Why does a string literal require single quotes? And if I use double quotes, what is it interpreting that literal as?
Thanks
-John
Based on my test, both single (') and double (") quotes could get the result.
My test json as below:
{"plantId": "Plant A", "machineId" : "M001", "sensorId": "S001", "unit": "kg", "time": "2017-09-05T22:00:14.9410000Z", "value": 1234.56}
{"plantId": "Plant A", "machineId" : "M001", "sensorId": "S001", "unit": "kg", "time": "2017-09-05T22:00:19.5410000Z", "value": 1334.76}
The result as below:
If you use single quote to select the value, the result will be the single quote's value. The query will not regard the value as the column name to select the result:
Like this:
But if you use double quote to select the value, the query will regards it as the column name. So you will get the null result. Because the query couldn't select the data.

Searching for terms with underscore doesn't return expected results

How can I search a documents named "Hola-Mundo_Army.jpg" searching by the Army* word (always using the asterisk key at the end please)? The thing is that if I search the documents using Army* the result is zero. I think that the problem is the underscore before Army word.
But if I search Mundo_Army* the result is one found, correctly.
docs?api-version=2016-09-01&search=Mundo_Army* <--- 1 result OK
docs?api-version=2016-09-01&search=Army* <--- 0 results and it should find 1 result like the previous search. I always need to use the asterisk at the end.
Thank you!
This is the blob information that I have to search and find:
{
"#search.score": 1,
"content": "{\"azure_cdn\":\"http:\\/\\/dev-dr-documents.azureedge.net\\/localhost-hugo-docs-not-indexed\\/Hola-Mundo_Army.jpg\"}\n",
"source": "dr",
"title": "Hola-Mundo_Army.jpg",
"file_name": "Hola-Mundo_Army.jpg",
"file_type": "Image",
"year_created": "2017",
"client": "LALALA",
"brand": "LELELE",
"description": "HUGO_DEV-TUCUMAN",
"categories": "Clothing and Accessories",
"media": "Online media",
"tags": null,
"channel": "Case Study",
"azuresearch_skipcontent": "1",
"id": "1683",
"metadata_storage_content_type": "application/octet-stream",
"metadata_storage_size": 109,
"metadata_storage_last_modified": "2017-04-26T18:30:35Z",
"metadata_storage_content_md5": "o2yZWelvS/EAukoOhCuuKg==",
"metadata_storage_name": "Hola-Mundo_Army.json",
"metadata_content_encoding": "ISO-8859-1",
"metadata_content_type": "text/plain; charset=ISO-8859-1",
"metadata_language": "en"
}
The best way to troubleshoot cases like this is by using the Analyze API. It will help you understand how your documents and query terms are processed by the search engine. In your case, assuming you are not setting the analyzer property on the field you are searching against, the text Hola-Mundo_Army.jpg is broken down by the default analyzer into the following two terms: hola, mundo_army.jpg. These are the terms that are in your index. That's why, when you are searching for the prefix mundo_army*, the term mundo_army.jpg is matched. Prefix army* doesn't match anything in your index.
You can learn more about the the default behavior of the search engine and how to customize it from this article: How full text search works in Azure Search

Resources