I have a dataframe that I need to write to Kafka.
I have the avro schema defined, similar to this:
{
"namespace": "my.name.space",
"type": "record",
"name": "MyClass",
"fields": [
{"name": "id", "type": "string"},
{"name": "parameter1", "type": "string"},
{"name": "parameter2", "type": "string"},
...
]
}
and it's auto-generated to java bean. It's something similar to this:
public class MyClass extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {
String id;
String parameter1;
String parameter2;
...
}
I found that to write in avro format there is only to_avro method that takes a column.
So my question, is there a way to force writing to Kafka in Avro format in this defined schema?
You can only do this when using Confluent. See https://aseigneurin.github.io/2018/08/02/kafka-tutorial-4-avro-and-schema-registry.html
Related
I want to ingest base64 encoded avro messages in druid. I am getting the following error -
Avro's unnecessary EOFException, detail: https://issues.apache.org/jira/browse/AVRO-813
Going through the code (line 88) https://github.com/apache/druid/blob/master/extensions-core/avro-extensions/src/main/java/org/apache/druid/data/input/avro/InlineSchemaAvroBytesDecoder.java , it does not seem to be decoding the messages using base64 decoder. Am I missing something? How can we configure druid to parse base64 encoded avro messages?
Spec used -
"inputFormat": {
"type": "avro_stream",
"avroBytesDecoder": {
"type": "schema_inline",
"schema": {
"namespace": "org.apache.druid.data",
"name": "User",
"type": "record",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "price",
"type": "int"
}
]
}
},
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [
{
"type": "path",
"name": "someRecord_subInt",
"expr": "$.someRecord.subInt"
}
]
},
"binaryAsString": false
}
Thanks:)
I haven't used Avro ingestion, but from the Apache Druid docs here it seems like you need to set "binaryAsString" to true.
The extension returns bytes and fixed Avro types as base64 encoded strings by default. To decode these types as UTF-8 strings, enable the binaryAsString option on the Avro parser.
We are streaming data from the Kafka Eventhub. The records may have a nested structure. The schema is inferred dynamically from the data and the Delta table is formed with the schema of the first incoming batch of data.
Note: The data read from Kafka topic will be a whole JSON string. Hence,
When we apply schema and convert to a dataframe, we lose the fields' values with mismatch datatype or newly added fields.
When we do spark.read.json, the entire field values are converted to String.
We encounter a situation where the Source data has some schema changes. Some of the scenarios we faced are :
The datatype changes at the parent level
The datatype changes at the nested level
There are duplicate keys in a different case
There are the addition of new fields
A sample Source data with the Actual schema
{
"Id": "101",
"Name": "John",
"Department": {
"Id": "Dept101",
"Name": "Technology",
"EmpId": "10001"
},
"Experience": 2,
"Organization": [
{
"Id": "Org101",
"Name": "Google"
},
{
"Id": "Org102",
"Name": "Microsoft"
}
]
}
A sample Source data addressing the 4 points mentioned above
{
"Id": "102",
"name": "Rambo", --- Point 3
"Department": {
"Id": "Dept101",
"Name": "Technology",
"EmpId": 10001 ---- Point 2
},
"Experience": "2", --- Point 1
"Organization": [
{
"Id": "Org101",
"Name": "Google",
"Experience": "2", --- Point 4
},
{
"Id": "Org102",
"Name": "Microsoft",
"Experience": "2",
}
]
}
We need a solution to overcome the above issues. Though it's difficult to embed the new schema to the existing delta table, at least we should be able to separate the records with schema changes without losing the original data.
I'm using npm BigQuery module for inserting data into bigquery. I have a custom field say params which is of type RECORD and accept any int,float or string value as a key value pair. How can I insert to such fields?
Looked into this, but could not find anything useful
[https://cloud.google.com/nodejs/docs/reference/bigquery/1.3.x/Table#insert]
If I understand correctly, you are asking for a map with ANY TYPE value, which is not support in BigQuery.
You may have a map with value type info with a record like below schema.
Your insert code needs to pick correct type_value to set.
{
"name": "map_field",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "key",
"type": "STRING",
},
{
"name": "int_value",
"type": "INTEGER"
},
{
"name": "string_value",
"type": "STRING"
},
{
"name": "float_value",
"type": "FLOAT"
}
]
}
Spark Dataframe Schema:
StructType(
[StructField("a", StringType(), False),
StructField("b", StringType(), True),
StructField("c" , BinaryType(), False),
StructField("d", ArrayType(StringType(), False), True),
StructField("e", TimestampType(), True)
])
When I write the data frame to parquet and load it into BigQuery, it interprets the schema differently. It is a simple load from JSON and write to parquet using spark dataframe.
BigQuery Schema:
[
{
"type": "STRING",
"name": "a",
"mode": "REQUIRED"
},
{
"type": "STRING",
"name": "b",
"mode": "NULLABLE"
},
{
"type": "BYTES",
"name": "c",
"mode": "REQUIRED"
},
{
"fields": [
{
"fields": [
{
"type": "STRING",
"name": "element",
"mode": "NULLABLE"
}
],
"type": "RECORD",
"name": "list",
"mode": "REPEATED"
}
],
"type": "RECORD",
"name": "d",
"mode": "NULLABLE"
},
{
"type": "TIMESTAMP",
"name": "e",
"mode": "NULLABLE"
}
]
Is this something to do with the way spark writes or they way BigQuery reads parquet. Any idea how I can fix this?
This is due to the intermediate file format (parquet by default) that the spark-bigquery connector uses.
The connector first writes the data to parquet files, then loads them to BigQuery using BigQuery Insert API.
If you check the intermediate parquet schema using parquet-tools, you would find something like this the field d (ArrayType(StringType) in Spark)
optional group a (LIST) {
repeated group list {
optional binary element (STRING);
}
}
Now, if you were loading this parquet yourself in BigQuery using bq load or the BigQuery Insert API directly, you could be able to tell BQ to ignore the intermediate fields by enabling parquet_enable_list_inference
Unfortunately, I don't see how to enable this option when using the spark-bigquery connector!
As a workaround, you can try to use orc as the intermediate format.
df
.write
.format("bigquery")
.option("intermediateFormat", "orc")
avro.schema.Parse works fine. The values to the data are read correctly. But looks like data and schema doesn't sync in.
I get the error:
The datum {datafromfile{whole bunch fields from data}, DATE = value, file_name = value}
is not an example of the schema from the .avsc file.
My schema is a nested avro schema. ***The last field of type 'info' contains some fields of type strings but not all the fields as in the data coming in.
Is this mismatch between fields causing an issue?
Below is my .avsc file.
{
"type": "record",
"name": "header_data",
"doc": "A list of strings.",
"fields": [
{"name": "DATE", "type": "string"},
{"name": "file_name", "type": "string"},
{"name": "info", "type": "info"}]
}