Ingest base64 encoded avro messages in druid - base64

I want to ingest base64 encoded avro messages in druid. I am getting the following error -
Avro's unnecessary EOFException, detail: https://issues.apache.org/jira/browse/AVRO-813
Going through the code (line 88) https://github.com/apache/druid/blob/master/extensions-core/avro-extensions/src/main/java/org/apache/druid/data/input/avro/InlineSchemaAvroBytesDecoder.java , it does not seem to be decoding the messages using base64 decoder. Am I missing something? How can we configure druid to parse base64 encoded avro messages?
Spec used -
"inputFormat": {
"type": "avro_stream",
"avroBytesDecoder": {
"type": "schema_inline",
"schema": {
"namespace": "org.apache.druid.data",
"name": "User",
"type": "record",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "price",
"type": "int"
}
]
}
},
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [
{
"type": "path",
"name": "someRecord_subInt",
"expr": "$.someRecord.subInt"
}
]
},
"binaryAsString": false
}
Thanks:)

I haven't used Avro ingestion, but from the Apache Druid docs here it seems like you need to set "binaryAsString" to true.
The extension returns bytes and fixed Avro types as base64 encoded strings by default. To decode these types as UTF-8 strings, enable the binaryAsString option on the Avro parser.

Related

python3 Unicode to chinese

I have a coding problem, have a json, and now I need to convert the content field to traditional Chinese which may contain emjo and the like, I hope to do it with python3,The json file example is as follows:
"messages": [
{
"sender_name": "#20KAREL\u00e2\u0080\u0099s \u00f0\u009f\u008e\u0088\u00f0\u009f\u0092\u009b",
"timestamp_ms": 1610288228221,
"content": "\u00e6\u0088\u0091\u00e9\u009a\u0094\u00e9\u009b\u00a2",
"type": "Generic",
"is_unsent": false
},
{
"sender_name": "#20KAREL\u00e2\u0080\u0099s \u00f0\u009f\u008e\u0088\u00f0\u009f\u0092\u009b",
"timestamp_ms": 1610288227699,
"share": {
"link": "https://www.instagram.com/p/B6UlYZvA4Pd/",
"share_text": "//\nMemorabilia\u00f0\u009f\u0087\u00b0\u00f0\u009f\u0087\u00b7\u00f0\u009f\u0091\u00a9\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a9\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a7\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a7\u00f0\u009f\u0091\u00a8\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a8\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a6\n\u00f0\u009f\u0098\u0086\u00f0\u009f\u00a4\u00a3\u00f0\u009f\u00a4\u00ac\u00f0\u009f\u0098\u008c\u00f0\u009f\u0098\u00b4\u00f0\u009f\u00a4\u00a9\u00f0\u009f\u00a4\u0093\n#191214\n#191221",
"original_content_owner": "_ki.zeng"
},
"type": "Share",
"is_unsent": false
},
{
"sender_name": "#20KAREL\u00e2\u0080\u0099s \u00f0\u009f\u008e\u0088\u00f0\u009f\u0092\u009b",
"timestamp_ms": 1607742844729,
"content": "\u00e6\u0089\u00ae\u00e7\u009e\u0093\u00e5\u00b0\u00b1\u00e5\u00a5\u00bd",
"type": "Generic",
"is_unsent": false
}]
The data posted isn't valid JSON (at least missing a set of outer curly braces) and was encoded incorrectly. UTF-8 bytes were written as Unicode code points. Ideally correct the original code, but the following will fix the mess you have now, if "input.json" is the original data with the outer curly braces added:
import json
# Read the raw bytes of the data file
with open('input.json','rb') as f:
raw = f.read()
# There are some newline escapes that shouldn't be converted,
# so double-escape them so the result leaves them escaped.
raw = raw.replace(rb'\n',rb'\\n')
# Convert all the escape codes to Unicode characters
raw = raw.decode('unicode_escape')
# The characters are really UTF-8 byte values.
# The "latin1" codec translates Unicode code points 1:1 to byte values,
# resulting in a byte string again.
raw = raw.encode('latin1')
# Decode correctly as UTF-8
raw = raw.decode('utf8')
# Now that the JSON is fixed, load it into a Python object
data = json.loads(raw)
# Re-write the JSON correctly.
with open('output.json','w',encoding='utf8') as f:
json.dump(data,f,ensure_ascii=False,indent=2)
Result:
{
"messages": [
{
"sender_name": "#20KAREL’s 🎈💛",
"timestamp_ms": 1610288228221,
"content": "我隔離",
"type": "Generic",
"is_unsent": false
},
{
"sender_name": "#20KAREL’s 🎈💛",
"timestamp_ms": 1610288227699,
"share": {
"link": "https://www.instagram.com/p/B6UlYZvA4Pd/",
"share_text": "//\nMemorabilia🇰🇷👩‍👩‍👧‍👧👨‍👨‍👦\n😆🤣🤬😌😴🤩🤓\n#191214\n#191221",
"original_content_owner": "_ki.zeng"
},
"type": "Share",
"is_unsent": false
},
{
"sender_name": "#20KAREL’s 🎈💛",
"timestamp_ms": 1607742844729,
"content": "扮瞓就好",
"type": "Generic",
"is_unsent": false
}
]
}

How to set friendlyName for an Azure Data Catalog Asset via REST API?

I am registering assets to Azure Data Catalog via REST API. I could register my assets without any problem. When I want to add a "friendyName" to my assets, I get an error. I am using the exact syntax that is shown here. Here is the json that I am sending:
"annotations": {
"schema": {
"properties": {
"fromSourceSystem": false,
"columns": [{"name": "com.xxx.xx.claim ", "type": " VARCHAR"}, {"name": "com.xx.xx.requirement ", "type": " VARCHAR"}]
}
},
"tableDataProfiles": [{"properties": {"dataModifiedTime": "2020-05-12 17:26:37.706521", "schemaModifiedTime": "2020-05-12 17:26:37.706537", "fromSourceSystem": false, "key": "tableDataProfiles"}}],
"columnsDataProfiles": [{"properties": {"columns": [{"columnName": "com.xx.xx.claim ", "type": " VARCHAR"}, {"columnName": "com.xx.xx.requirement ", "type": " VARCHAR"},],
"tags": [{"properties": {"tag": "uploadedByScript", "key": "tag", "fromSourceSystem": false}}],
"experts": [{"properties": {"expert": {"upn": "Berkan#xx.de"}, "fromSourceSystem": false, "key": "expert"}}],
"friendlyName": {"properties": {"friendlyName": "Requirements", "fromSourceSystem": false}}
}
I have cut the irrelevant parts of the json to make it readible. Notice "friendlyName" annotation is under "annotations" as described in the sample code. Can someone point out, what is wrong with my json?
Apparently the syntax was correct. I have realized, some of my input strings are either null or non-trimmed. That was the problem. I confirm the syntax above is correct.

How to insert data to bigquery table with custom fields with NodeJS?

I'm using npm BigQuery module for inserting data into bigquery. I have a custom field say params which is of type RECORD and accept any int,float or string value as a key value pair. How can I insert to such fields?
Looked into this, but could not find anything useful
[https://cloud.google.com/nodejs/docs/reference/bigquery/1.3.x/Table#insert]
If I understand correctly, you are asking for a map with ANY TYPE value, which is not support in BigQuery.
You may have a map with value type info with a record like below schema.
Your insert code needs to pick correct type_value to set.
{
"name": "map_field",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "key",
"type": "STRING",
},
{
"name": "int_value",
"type": "INTEGER"
},
{
"name": "string_value",
"type": "STRING"
},
{
"name": "float_value",
"type": "FLOAT"
}
]
}

Custom date in azure blob folder path

I have looked at some posts and documentation on how to specify custom folder paths while creating an azure blob (using the azure data factories).
Official documentation:
https://learn.microsoft.com/en-us/azure/data-factory/v1/data-factory-azure-blob-connector#using-partitionedBy-property
Forums posts:
https://dba.stackexchange.com/questions/180487/datafactory-tutorial-blob-does-not-exist
I am successfully able to put into date indexed folders, however what I am not able to do is put into incremented/decremented date folders.
I tried using $$Text.Format (like below) but it gives a compile error --> Text.Format is not a valid blob path .
"folderPath": "$$Text.Format('MyRoot/{0:yyyy/MM/dd}/', Date.AddDays(SliceEnd,-2))",
I tried using the PartitionedBy section (like below) but it too gives a compile error --> Only SliceStart and SliceEnd are valid options for "date"
{
"name": "MyBlob",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "MyLinkedService",
"typeProperties": {
"fileName": "MyTsv.tsv",
"folderPath": "MyRoot/{Year}/{Month}/{Day}/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\t",
"nullValue": ""
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "Date.AddDays(SliceEnd,-2)",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "Date.AddDays(SliceEnd,-2)",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "Date.AddDays(SliceEnd,-2)",
"format": "dd"
}
}
]
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": false,
"policy": {}
}
Any pointers are appreciated!
EDIT for response from Adam:
I also used folder structure directly in FileName as per suggestion from Adam as per below forum post:
Windows Azure: How to create sub directory in a blob container
I used it like in below sample.
"typeProperties": {
"fileName": "$$Text.Format('{0:yyyy/MM/dd}/MyBlob.tsv', Date.AddDays(SliceEnd,-2))",
"folderPath": "MyRoot/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\t",
"nullValue": ""
},
It gives no compile error and also no error during deployment. But it throws an error during execution!!
Runtime Error is ---> Error in Activity: ScopeJobManager:PrepareScopeScript, Unsupported unstructured stream format '.adddays(sliceend,-2))', can't convert to unstructured stream.
I think the problem is that FileName can be used to create folders but not dynamic folder names, only static ones.
you should create a blob using the following convention: "foldername/myfile.txt" , so you could also append additional blobs under that foldername. I'd recommend checking this thread: Windows Azure: How to create sub directory in a blob container , It may help you resolve this case.

NULLS in File output are \N and I want them to be empty

I have a datafactory that reads from a table and stores the output as a CSV to Blob Storage.
I have noticed that instead of leaving a NULL field blank it inserts the NULL character \N.. Now the external system that is ingesting this can't handle \N.
Is there anyway in my dataset where I can say leave nulls blank.
Below is my dataset properties:
"typeProperties": {
"fileName": "MasterFile-{fileDateNameVariable}.csv",
"folderPath": "master-file-landing",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"firstRowAsHeader": true
},
"partitionedBy": [
{
"name": "fileDateNameVariable",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyyMMdd"
}
}
]
},
Thanks in advance.
You could set the Null value to "" when you set your dataset. Please refer to my test.
Table data:
Output Dataset:
Generate csv file:
Hope it helps you.

Resources