I have a coding problem, have a json, and now I need to convert the content field to traditional Chinese which may contain emjo and the like, I hope to do it with python3,The json file example is as follows:
"messages": [
{
"sender_name": "#20KAREL\u00e2\u0080\u0099s \u00f0\u009f\u008e\u0088\u00f0\u009f\u0092\u009b",
"timestamp_ms": 1610288228221,
"content": "\u00e6\u0088\u0091\u00e9\u009a\u0094\u00e9\u009b\u00a2",
"type": "Generic",
"is_unsent": false
},
{
"sender_name": "#20KAREL\u00e2\u0080\u0099s \u00f0\u009f\u008e\u0088\u00f0\u009f\u0092\u009b",
"timestamp_ms": 1610288227699,
"share": {
"link": "https://www.instagram.com/p/B6UlYZvA4Pd/",
"share_text": "//\nMemorabilia\u00f0\u009f\u0087\u00b0\u00f0\u009f\u0087\u00b7\u00f0\u009f\u0091\u00a9\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a9\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a7\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a7\u00f0\u009f\u0091\u00a8\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a8\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a6\n\u00f0\u009f\u0098\u0086\u00f0\u009f\u00a4\u00a3\u00f0\u009f\u00a4\u00ac\u00f0\u009f\u0098\u008c\u00f0\u009f\u0098\u00b4\u00f0\u009f\u00a4\u00a9\u00f0\u009f\u00a4\u0093\n#191214\n#191221",
"original_content_owner": "_ki.zeng"
},
"type": "Share",
"is_unsent": false
},
{
"sender_name": "#20KAREL\u00e2\u0080\u0099s \u00f0\u009f\u008e\u0088\u00f0\u009f\u0092\u009b",
"timestamp_ms": 1607742844729,
"content": "\u00e6\u0089\u00ae\u00e7\u009e\u0093\u00e5\u00b0\u00b1\u00e5\u00a5\u00bd",
"type": "Generic",
"is_unsent": false
}]
The data posted isn't valid JSON (at least missing a set of outer curly braces) and was encoded incorrectly. UTF-8 bytes were written as Unicode code points. Ideally correct the original code, but the following will fix the mess you have now, if "input.json" is the original data with the outer curly braces added:
import json
# Read the raw bytes of the data file
with open('input.json','rb') as f:
raw = f.read()
# There are some newline escapes that shouldn't be converted,
# so double-escape them so the result leaves them escaped.
raw = raw.replace(rb'\n',rb'\\n')
# Convert all the escape codes to Unicode characters
raw = raw.decode('unicode_escape')
# The characters are really UTF-8 byte values.
# The "latin1" codec translates Unicode code points 1:1 to byte values,
# resulting in a byte string again.
raw = raw.encode('latin1')
# Decode correctly as UTF-8
raw = raw.decode('utf8')
# Now that the JSON is fixed, load it into a Python object
data = json.loads(raw)
# Re-write the JSON correctly.
with open('output.json','w',encoding='utf8') as f:
json.dump(data,f,ensure_ascii=False,indent=2)
Result:
{
"messages": [
{
"sender_name": "#20KAREL’s 🎈💛",
"timestamp_ms": 1610288228221,
"content": "我隔離",
"type": "Generic",
"is_unsent": false
},
{
"sender_name": "#20KAREL’s 🎈💛",
"timestamp_ms": 1610288227699,
"share": {
"link": "https://www.instagram.com/p/B6UlYZvA4Pd/",
"share_text": "//\nMemorabilia🇰🇷👩👩👧👧👨👨👦\n😆🤣🤬😌😴🤩🤓\n#191214\n#191221",
"original_content_owner": "_ki.zeng"
},
"type": "Share",
"is_unsent": false
},
{
"sender_name": "#20KAREL’s 🎈💛",
"timestamp_ms": 1607742844729,
"content": "扮瞓就好",
"type": "Generic",
"is_unsent": false
}
]
}
Related
I want to ingest base64 encoded avro messages in druid. I am getting the following error -
Avro's unnecessary EOFException, detail: https://issues.apache.org/jira/browse/AVRO-813
Going through the code (line 88) https://github.com/apache/druid/blob/master/extensions-core/avro-extensions/src/main/java/org/apache/druid/data/input/avro/InlineSchemaAvroBytesDecoder.java , it does not seem to be decoding the messages using base64 decoder. Am I missing something? How can we configure druid to parse base64 encoded avro messages?
Spec used -
"inputFormat": {
"type": "avro_stream",
"avroBytesDecoder": {
"type": "schema_inline",
"schema": {
"namespace": "org.apache.druid.data",
"name": "User",
"type": "record",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "price",
"type": "int"
}
]
}
},
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [
{
"type": "path",
"name": "someRecord_subInt",
"expr": "$.someRecord.subInt"
}
]
},
"binaryAsString": false
}
Thanks:)
I haven't used Avro ingestion, but from the Apache Druid docs here it seems like you need to set "binaryAsString" to true.
The extension returns bytes and fixed Avro types as base64 encoded strings by default. To decode these types as UTF-8 strings, enable the binaryAsString option on the Avro parser.
I take AVRO bytes from Kafka and deserialize them.
But I get strange output because of decimal value and I cannot work with them next (for example turn into json or insert into DB):
import avro.schema, json
from avro.io import DatumReader, BinaryDecoder
# only needed part of schemaDict
schemaDict = {
"name": "ApplicationEvent",
"type": "record",
"fields": [
{
"name": "desiredCreditLimit",
"type": [
"null",
{
"type": "bytes",
"logicalType": "decimal",
"precision": 14,
"scale": 2
}
],
"default": null
}
]
}
schema_avro = avro.schema.parse(json.dumps(schemaDict))
reader = DatumReader(schema_avro_io)
decoder = BinaryDecoder(data) #data - bynary data from kafka
event_dict = reader.read(decoder)
print (event_dict)
#{'desiredCreditLimit': Decimal('100000.00')}
print (json.dumps(event_dict))
#TypeError: Object of type Decimal is not JSON serializable
I tried to use avro_json_serializer, but got error: "AttributeError: 'decimal.Decimal' object has no attribute 'decode'".
And because of this "Decimal" in dictionary I cannot insert values to DB too.
Also tried to use fastavro library, but I couldnot deserealize message, as I understand because sereliazation done without fastavro.
I am correlating json response, but from the captured json I need to replace a value with some different text.
For example - captured response is as below and saved to "corr_json"variable:
{
"data": [{
"type": "articles",
"id": "1",
"attributes": {
"title": JSON:API paints my bikeshed!,
"body": "The shortest article. Ever.",
"created": "2015-05-22T14:56:29.000Z",
"updated": "2015-05-22T14:56:28.000Z"
},
"relationships": {
"author": {
"data": {"id": "42", "type": "people"}
}
}
from this I need to replace string
API paints my bikeshed
with text Performance Testing and pass to next request, so the json to be pass as below:
{
"data": [{
"type": "articles",
"id": "1",
"attributes": {
"title": "Performance Testing",
"body": "The shortest article. Ever.",
"created": "2015-05-22T14:56:29.000Z",
"updated": "2015-05-22T14:56:28.000Z"
},
"relationships": {
"author": {
"data": {"id": "42", "type": "people"}
}
}
Is there a way to do this in Loadrunner?
Check lr_json_replace api provided by LoadRunner.
Depending upon your protocol in use, the language of LoadRunner is either C, C#, C++, VB, or JavaScript. Use the sting processing capabilities of your language, combined with your programming skills to reformat the text in question to include your tag.
Hint, you might consider two correlations on the returned data, one which begins with the first curly brace and ends with '"title":' with the second beginning with '"title":' and ending with the '\t\t}\r\t}' (if I am reading the text right) structure. Then you could simply sprintf() in C to pack a few strings (corr1+ your tag+corr2+ end structure) together to hit your mark.
I have a pipeline that retrieves an FTP hosted CSV file. It is comma delimited with double quote identifiers. The issue exists where a string is encapsulated in double quotes, but the string itself contains double quotes.
string example: "Spring Sale" this year.
How it looks in the csv (followed and lead by two null columns):
"","""Spring Sale"" this year",""
SSIS handles this fine, but Data Factory wants to transform it into an extra column that isn't separated by a comma. I have removed the extra quotes on this line and it works fine.
Is there a way around this besides altering the source?
I got this to work using Escape character set as quote (") with the Azure Data Factory Copy Task. Screen shot:
This was based on a file as per your spec:
"","""Spring Sale"" this year",""
and also worked as in insert into an Azure SQL Database table. The sample JSON:
{
"name": "DelimitedText1",
"properties": {
"linkedServiceName": {
"referenceName": "linkedService2",
"type": "LinkedServiceReference"
},
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "quotes.txt",
"container": "someContainer"
},
"columnDelimiter": ",",
"escapeChar": "\"",
"quoteChar": "\""
},
"schema": [
{
"name": "Prop_0",
"type": "String"
},
{
"name": "Prop_1",
"type": "String"
},
{
"name": "Prop_2",
"type": "String"
}
]
}
}
Maybe the example file is too simple but it did work for me in this configuration.
Alternately, just use SSIS and host it in Data Factory.
I have a datafactory that reads from a table and stores the output as a CSV to Blob Storage.
I have noticed that instead of leaving a NULL field blank it inserts the NULL character \N.. Now the external system that is ingesting this can't handle \N.
Is there anyway in my dataset where I can say leave nulls blank.
Below is my dataset properties:
"typeProperties": {
"fileName": "MasterFile-{fileDateNameVariable}.csv",
"folderPath": "master-file-landing",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"firstRowAsHeader": true
},
"partitionedBy": [
{
"name": "fileDateNameVariable",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyyMMdd"
}
}
]
},
Thanks in advance.
You could set the Null value to "" when you set your dataset. Please refer to my test.
Table data:
Output Dataset:
Generate csv file:
Hope it helps you.