Work with decimal values after avro deserialization

Work with decimal values after avro deserialization - decimal

I take AVRO bytes from Kafka and deserialize them.
But I get strange output because of decimal value and I cannot work with them next (for example turn into json or insert into DB):
import avro.schema, json
from avro.io import DatumReader, BinaryDecoder
# only needed part of schemaDict
schemaDict = {
"name": "ApplicationEvent",
"type": "record",
"fields": [
{
"name": "desiredCreditLimit",
"type": [
"null",
{
"type": "bytes",
"logicalType": "decimal",
"precision": 14,
"scale": 2
}
],
"default": null
}
]
}
schema_avro = avro.schema.parse(json.dumps(schemaDict))
reader = DatumReader(schema_avro_io)
decoder = BinaryDecoder(data) #data - bynary data from kafka
event_dict = reader.read(decoder)
print (event_dict)
#{'desiredCreditLimit': Decimal('100000.00')}
print (json.dumps(event_dict))
#TypeError: Object of type Decimal is not JSON serializable
I tried to use avro_json_serializer, but got error: "AttributeError: 'decimal.Decimal' object has no attribute 'decode'".
And because of this "Decimal" in dictionary I cannot insert values to DB too.
Also tried to use fastavro library, but I couldnot deserealize message, as I understand because sereliazation done without fastavro.

Related

Ingest base64 encoded avro messages in druid

I want to ingest base64 encoded avro messages in druid. I am getting the following error -
Avro's unnecessary EOFException, detail: https://issues.apache.org/jira/browse/AVRO-813
Going through the code (line 88) https://github.com/apache/druid/blob/master/extensions-core/avro-extensions/src/main/java/org/apache/druid/data/input/avro/InlineSchemaAvroBytesDecoder.java , it does not seem to be decoding the messages using base64 decoder. Am I missing something? How can we configure druid to parse base64 encoded avro messages?
Spec used -
"inputFormat": {
"type": "avro_stream",
"avroBytesDecoder": {
"type": "schema_inline",
"schema": {
"namespace": "org.apache.druid.data",
"name": "User",
"type": "record",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "price",
"type": "int"
}
]
}
},
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [
{
"type": "path",
"name": "someRecord_subInt",
"expr": "$.someRecord.subInt"
}
]
},
"binaryAsString": false
}
Thanks:)

I haven't used Avro ingestion, but from the Apache Druid docs here it seems like you need to set "binaryAsString" to true.
The extension returns bytes and fixed Avro types as base64 encoded strings by default. To decode these types as UTF-8 strings, enable the binaryAsString option on the Avro parser.

Json Schema minimum validator but for string

using:
from jsonschema import validate
import jsonschema
I am trying to validate a minimum of 4096 with no upper bound on a string using json schema. I see regex pattern matching may be an option but unsure of how to do this with no upper bound.
json_data = {"value": "4096"}
# what i would like to do
json_schema = {"type": "string", "minimum": 4096}
try:
validate(json_data, schema=json_schema)
return True
except jsonschema.ValidationError:
return False
Really appreciate any input. Please comment if other info is needed. Thank you.

First you should enter your value as int instead of string.
Then you're checking the schema of your whole json object, from which value is a key/property:
from jsonschema import validate
import jsonschema
json_data = {"value": 4096}
json_schema = {
"type": "object",
"properties": {
"value": {
"type": "number",
"minimum": 4096
}
}
}
try:
validate(json_data, schema=json_schema)
print(True)
except jsonschema.ValidationError:
print(False)
Try the above code with values < 4096 and >= 4096
EDIT: if changing the type of value is not an option, you can still use regex (in my opinion far uglier):
json_schema = {
"type": "object",
"properties": {
"value": {
"type": "string",
"pattern": "^\d{5,}|[5-9]\d{3}|4[1-9]\d\d|409[6-9]$"
}
}
}
This supposes your string has no leading zeroes and no kind of separators (,, ....). We match the following:
\d{5,}: 5 or more digits (>= 10000)
[5-9]\d{3}: 5 to 9 followed by 3 digits (5000-9999)
4[1-9]\d\d: 4100 to 4999
409[6-9]: 4096 to 4099

python3 Unicode to chinese

I have a coding problem, have a json, and now I need to convert the content field to traditional Chinese which may contain emjo and the like, I hope to do it with python3,The json file example is as follows：
"messages": [
{
"sender_name": "#20KAREL\u00e2\u0080\u0099s \u00f0\u009f\u008e\u0088\u00f0\u009f\u0092\u009b",
"timestamp_ms": 1610288228221,
"content": "\u00e6\u0088\u0091\u00e9\u009a\u0094\u00e9\u009b\u00a2",
"type": "Generic",
"is_unsent": false
},
{
"sender_name": "#20KAREL\u00e2\u0080\u0099s \u00f0\u009f\u008e\u0088\u00f0\u009f\u0092\u009b",
"timestamp_ms": 1610288227699,
"share": {
"link": "https://www.instagram.com/p/B6UlYZvA4Pd/",
"share_text": "//\nMemorabilia\u00f0\u009f\u0087\u00b0\u00f0\u009f\u0087\u00b7\u00f0\u009f\u0091\u00a9\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a9\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a7\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a7\u00f0\u009f\u0091\u00a8\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a8\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a6\n\u00f0\u009f\u0098\u0086\u00f0\u009f\u00a4\u00a3\u00f0\u009f\u00a4\u00ac\u00f0\u009f\u0098\u008c\u00f0\u009f\u0098\u00b4\u00f0\u009f\u00a4\u00a9\u00f0\u009f\u00a4\u0093\n#191214\n#191221",
"original_content_owner": "_ki.zeng"
},
"type": "Share",
"is_unsent": false
},
{
"sender_name": "#20KAREL\u00e2\u0080\u0099s \u00f0\u009f\u008e\u0088\u00f0\u009f\u0092\u009b",
"timestamp_ms": 1607742844729,
"content": "\u00e6\u0089\u00ae\u00e7\u009e\u0093\u00e5\u00b0\u00b1\u00e5\u00a5\u00bd",
"type": "Generic",
"is_unsent": false
}]

The data posted isn't valid JSON (at least missing a set of outer curly braces) and was encoded incorrectly. UTF-8 bytes were written as Unicode code points. Ideally correct the original code, but the following will fix the mess you have now, if "input.json" is the original data with the outer curly braces added:
import json
# Read the raw bytes of the data file
with open('input.json','rb') as f:
raw = f.read()
# There are some newline escapes that shouldn't be converted,
# so double-escape them so the result leaves them escaped.
raw = raw.replace(rb'\n',rb'\\n')
# Convert all the escape codes to Unicode characters
raw = raw.decode('unicode_escape')
# The characters are really UTF-8 byte values.
# The "latin1" codec translates Unicode code points 1:1 to byte values,
# resulting in a byte string again.
raw = raw.encode('latin1')
# Decode correctly as UTF-8
raw = raw.decode('utf8')
# Now that the JSON is fixed, load it into a Python object
data = json.loads(raw)
# Re-write the JSON correctly.
with open('output.json','w',encoding='utf8') as f:
json.dump(data,f,ensure_ascii=False,indent=2)
Result:
{
"messages": [
{
"sender_name": "#20KAREL’s 🎈💛",
"timestamp_ms": 1610288228221,
"content": "我隔離",
"type": "Generic",
"is_unsent": false
},
{
"sender_name": "#20KAREL’s 🎈💛",
"timestamp_ms": 1610288227699,
"share": {
"link": "https://www.instagram.com/p/B6UlYZvA4Pd/",
"share_text": "//\nMemorabilia🇰🇷👩‍👩‍👧‍👧👨‍👨‍👦\n😆🤣🤬😌😴🤩🤓\n#191214\n#191221",
"original_content_owner": "_ki.zeng"
},
"type": "Share",
"is_unsent": false
},
{
"sender_name": "#20KAREL’s 🎈💛",
"timestamp_ms": 1607742844729,
"content": "扮瞓就好",
"type": "Generic",
"is_unsent": false
}
]
}

How to convert string to json in Angular?

Starting to the following kind of string:
const json = '{"list":"[{"additionalInformation": {"source": "5f645d7d94-c6ktd"}, "alarmName": "data", "description": "Validation Error. Fetching info has been skipped.", "eventTime": "2020-01-27T14:42:44.143200 UTC", "expires": 2784, "faultyResource": "Data", "name": "prisco", "severity": "Major"}]"}'
How can I manage this as a JSON? The following approach doesn't work
const obj = JSON.parse(json );
unuspected result
How can I parse it correctly?
In conclusion, I should extract the part relative to the first item list and then parse the JSON that it contains.

Your JSON is invalid. The following is the valid version of your JSON:
const json= {
"list": [ {
"additionalInformation": {
"source": "5f645d7d94-c6ktd"
},
"alarmName": "data",
"description": "Validation Error. Fetching info has been skipped.",
"eventTime": "2020-01-27T14:42:44.143200 UTC",
"expires": 2784,
"faultyResource": "Data",
"name": "prisco",
"severity": "Major"
}
]
}
The above is already a JSON and parsing it as JSON again throws an error.
JSON.parse() parse string/ text and turn it into JavaScript object. The string/ text should be in a JSON format or it will throw an error.
Update:
Create a function to clean your string and prepare it for JSON.parse():
cleanString(str) {
str = str.replace('"[', '[');
str = str.replace(']"', ']');
return str;
}
And use it like:
json = this.cleanString(json);
console.log(JSON.parse(json));
Demo:
let json = '{"list":"[{"additionalInformation": {"source": "5f645d7d94-c6ktd"}, "alarmName": "data", "description": "Validation Error. Fetching info has been skipped.", "eventTime": "2020-01-27T14:42:44.143200 UTC", "expires": 2784, "faultyResource": "Data", "name": "prisco", "severity": "Major"}]"}';
json = cleanString(json);
console.log(JSON.parse(json));
function cleanString(str) {
str = str.replace('"[', '[');
str = str.replace(']"', ']');
return str;
}

Remove the double quotes from around the array brackets to make the json valid:
const json = '{"list":[{"additionalInformation": {"source": "5f645d7d94-c6ktd"}, "alarmName": "data", "description": "Validation Error. Fetching info has been skipped.", "eventTime": "2020-01-27T14:42:44.143200 UTC", "expires": 2784, "faultyResource": "Data", "name": "prisco", "severity": "Major"}]}'

Azure ADF sliceIdentifierColumnName is not populating correctly

I've set up a ADF pipeline using a sliceIdentifierColumnName which has worked well as it populated the field with a GUID as expected. Recently however this field stopped being populated, the refresh would work but the sliceIdentifierColumnName field would have a value of null, or occasionally the load would fail as it attempted to populate this field with a value of 1 which causes the slice load to fail.
This change occurred at a point in time, before it worked perfectly, after it repeatedly failed to populate the field correctly. I'm sure no changes were made to the Pipeline which caused this to suddenly fail. Any pointers where I should be looking?
Here an extract of the pipeline source, I'm reading from a table in Amazon Redshift and writing to an Azure SQL table.
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from mytable where eventtime >= \\'{0:yyyy-MM-ddTHH:mm:ssZ}\\' and eventtime < \\'{1:yyyy-MM-ddTHH:mm:ssZ}\\' ' , SliceStart, SliceEnd)"
},
"sink": {
"type": "SqlSink",
"sliceIdentifierColumnName": "ColumnForADFuseOnly",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "AmazonRedshiftSomeName"
}
],
"outputs": [
{
"name": "AzureSQLDatasetSomeName"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 10,
"style": "StartOfInterval",
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 2
},
"name": "Activity-somename2Hour"
}
],
Also, here is the error output text
Copy activity encountered a user error at Sink:.database.windows.net side: ErrorCode=UserErrorInvalidDataValue,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column 'ColumnForADFuseOnly' contains an invalid value '1'.,Source=Microsoft.DataTransfer.Common,''Type=System.ArgumentException,Message=Type of value has a mismatch with column typeCouldn't store <1> in ColumnForADFuseOnly Column.
Expected type is Byte[].,Source=System.Data,''Type=System.ArgumentException,Message=Type of value has a mismatch with column type,Source=System.Data,'.
Here is part of the source dataset, it's a table with all datatypes as Strings.
{
"name": "AmazonRedshiftsomename_2hourly",
"properties": {
"structure": [
{
"name": "eventid",
"type": "String"
},
{
"name": "visitorid",
"type": "String"
},
{
"name": "eventtime",
"type": "Datetime"
}
}
Finally, the target table is identical to the source table, mapping each column name to its counterpart in Azure, with the exception of the additional column in Azure named
[ColumnForADFuseOnly] binary NULL,
It is this column which is now either being populated with NULLs or 1.
thanks,

You need to define [ColumnForADFuseOnly] as binary(32), binary with no length modifier is defaulting to a length of 1 and thus truncating your sliceIdentifier...
When n is not specified in a data definition or variable declaration statement, the default length is 1. When n is not specified with the CAST function, the default length is 30. See here

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Work with decimal values after avro deserialization - decimal

Related

Ingest base64 encoded avro messages in druid

Json Schema minimum validator but for string

python3 Unicode to chinese

How to convert string to json in Angular?

Azure ADF sliceIdentifierColumnName is not populating correctly

Categories

Resources