avro.schema.Parse works fine. The values to the data are read correctly. But looks like data and schema doesn't sync in.
I get the error:
The datum {datafromfile{whole bunch fields from data}, DATE = value, file_name = value}
is not an example of the schema from the .avsc file.
My schema is a nested avro schema. ***The last field of type 'info' contains some fields of type strings but not all the fields as in the data coming in.
Is this mismatch between fields causing an issue?
Below is my .avsc file.
{
"type": "record",
"name": "header_data",
"doc": "A list of strings.",
"fields": [
{"name": "DATE", "type": "string"},
{"name": "file_name", "type": "string"},
{"name": "info", "type": "info"}]
}
Related
We are streaming data from the Kafka Eventhub. The records may have a nested structure. The schema is inferred dynamically from the data and the Delta table is formed with the schema of the first incoming batch of data.
Note: The data read from Kafka topic will be a whole JSON string. Hence,
When we apply schema and convert to a dataframe, we lose the fields' values with mismatch datatype or newly added fields.
When we do spark.read.json, the entire field values are converted to String.
We encounter a situation where the Source data has some schema changes. Some of the scenarios we faced are :
The datatype changes at the parent level
The datatype changes at the nested level
There are duplicate keys in a different case
There are the addition of new fields
A sample Source data with the Actual schema
{
"Id": "101",
"Name": "John",
"Department": {
"Id": "Dept101",
"Name": "Technology",
"EmpId": "10001"
},
"Experience": 2,
"Organization": [
{
"Id": "Org101",
"Name": "Google"
},
{
"Id": "Org102",
"Name": "Microsoft"
}
]
}
A sample Source data addressing the 4 points mentioned above
{
"Id": "102",
"name": "Rambo", --- Point 3
"Department": {
"Id": "Dept101",
"Name": "Technology",
"EmpId": 10001 ---- Point 2
},
"Experience": "2", --- Point 1
"Organization": [
{
"Id": "Org101",
"Name": "Google",
"Experience": "2", --- Point 4
},
{
"Id": "Org102",
"Name": "Microsoft",
"Experience": "2",
}
]
}
We need a solution to overcome the above issues. Though it's difficult to embed the new schema to the existing delta table, at least we should be able to separate the records with schema changes without losing the original data.
I have a dataframe that I need to write to Kafka.
I have the avro schema defined, similar to this:
{
"namespace": "my.name.space",
"type": "record",
"name": "MyClass",
"fields": [
{"name": "id", "type": "string"},
{"name": "parameter1", "type": "string"},
{"name": "parameter2", "type": "string"},
...
]
}
and it's auto-generated to java bean. It's something similar to this:
public class MyClass extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {
String id;
String parameter1;
String parameter2;
...
}
I found that to write in avro format there is only to_avro method that takes a column.
So my question, is there a way to force writing to Kafka in Avro format in this defined schema?
You can only do this when using Confluent. See https://aseigneurin.github.io/2018/08/02/kafka-tutorial-4-avro-and-schema-registry.html
I'm using npm BigQuery module for inserting data into bigquery. I have a custom field say params which is of type RECORD and accept any int,float or string value as a key value pair. How can I insert to such fields?
Looked into this, but could not find anything useful
[https://cloud.google.com/nodejs/docs/reference/bigquery/1.3.x/Table#insert]
If I understand correctly, you are asking for a map with ANY TYPE value, which is not support in BigQuery.
You may have a map with value type info with a record like below schema.
Your insert code needs to pick correct type_value to set.
{
"name": "map_field",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "key",
"type": "STRING",
},
{
"name": "int_value",
"type": "INTEGER"
},
{
"name": "string_value",
"type": "STRING"
},
{
"name": "float_value",
"type": "FLOAT"
}
]
}
Reading https://avro.apache.org/docs/current/spec.html it says a schema must be one of:
A JSON string, naming a defined type.
A JSON object, of the form:
{"type": "typeName" ...attributes...} where typeName is either a
primitive or derived type name, as defined below. Attributes not
defined in this document are permitted as metadata, but must not
affect the format of serialized data.
A JSON array, representing a
union of embedded types.
I want a schema that describes a tree, using the recursive definition that a tree is either:
A node with a value (say, integer) and a list of trees (the children)
A leaf with a value
My initial attempt looked like:
{
"name": "Tree",
"type": [
{
"name": "Node",
"type": "record",
"fields": [
{
"name": "value",
"type": "long"
},
{
"name": "children",
"type": { "type": "array", "items": "Tree" }
}
]
},
{
"name": "Leaf",
"type": "record",
"fields": [
{
"name": "value",
"type": "long"
}
]
}
]
}
But the Avro compiler rejects this, complaining there is nothing of type {"name":"Tree","type":[{"name":"Node".... It seems Avro doesn't like the union type at the top-level. I'm guessing this falls under the aforementioned rule "a schema must be one of .. a JSON object .. where typeName is either a primitive or derived type name." I am not sure what a "derived type name" is though. At first I thought it was the same as a "complex type" but that includes union types..
Anyways, changing it to the more convoluted definition:
{
"name": "Tree",
"type": "record",
"fields": [{
"name": "ctors",
"type": [
{
"name": "Node",
"type": "record",
"fields": [
{
"name": "value",
"type": "long"
},
{
"name": "children",
"type": { "type": "array", "items": "Tree" }
}
]
},
{
"name": "Leaf",
"type": "record",
"fields": [
{
"name": "value",
"type": "long"
}
]
}
]
}]
}
works, but now I have this weird record with just a single field whose sole purpose is to let me define the top-level union type I want.
Is this the only way to get what I want in Avro or is there a better way?
Thanks!
While this is not an answer to the actual question about representing a recursive named union (which isn't possible as of late 2022), it is possible to work around this for a tree-like data structure.
If you represent a Tree as a node, and a Leaf as a node with an empty list of children, then one recursive type is sufficient:
{
"type": "record",
"name": "TreeNode",
"fields": [
{
"name": "value",
"type": "long"
},
{
"name": "children",
"type": { "type": "array", "items": "TreeNode" }
}
]
}
Now, your three types Tree, Node, and Leaf are unified into one type TreeNode, and there is no union of Node and Leaf necessary.
I just stumbled uppon the same problem wanting to define a recursive union. I'm quite pessimistic about a cleaner solution than your convoluted one, because there is currently no way to name an union, and hence no way to recursively refer to it while constructing it, see this open ticket
I want to do Geographic search in cloud search, i do indexing like this
when i uploading document
[{"type": "add", "id": "kdhrlfh1304532987654321987654321", "fields":{"name": "user1", "latlon":[12.628611, 120.694152] , "phoneverifiedon": "2015-05-04T15:39:03Z", "fbnumfriends": 172}},
{"type": "add", "id": "kdhrlfh1304532987654321987654322", "fields": {"name": "user2", "latlon":[12.628645,20.694178] , "phoneverifiedon": "2015-05-04T15:39:03Z", "fbnumfriends": 172}}]
i got below error
Status: error
Adds: 0
Deletes: 0
Errors:
{ ["Field "latlon" must have array type to have multiple values (near operation with index 1; document_id kdhrlfh1304532987654321987654321)","Validation error for field 'latlon': Invalid latlon value 12.628611"] }
i tried multiple format for "latlon" field
please suggest what is the correct format for the lat long in cloudsearch
The correct syntax for doc submission is a single string with the two values comma-separated, eg "latlon" : "12.628611, 120.694152".
[
{
"type": "add",
"id": "kdhrlfh1304532987654321987654321",
"fields": {
"name": "user1",
"latlon" : "12.628611, 120.694152"
"phoneverifiedon": "2015-05-04T15:39:03Z",
"fbnumfriends": 172
}
}
]
It is definitely confusing that the submission syntax doesn't match the query syntax, which uses an array to represent lat-lon.
https://forums.aws.amazon.com/thread.jspa?threadID=151633