I am trying to generate a test AVRO file from a collection of objects represented by generated classes (TestAggregate.java, TestTuple.java). I used avro-tools-1.10.2.jar to generate those classes from this AVRO schema (dataset.avsc):
{
"type" : "record",
"name" : "TestAggregate",
"namespace" : "com....",
"fields" : [ {
"name" : "uuid",
"type" : "string"
}, {
"name" : "bag",
"type" : {
"type" : "array",
"items" : {
"type" : "record",
"name" : "TestTuple",
"fields" : [ {
"name" : "s",
"type" : "int"
}, {
"name" : "n",
"type" : "int"
}, {
"name" : "c",
"type" : "int"
}, {
"name" : "f",
"type" : "int"
} ]
}
},
"aliases" : [ "bag" ]
} ]
}
When I try to create an Encoder using
Encoder<TestAggregate> datasetEncoder = Encoders.bean(TestAggregate.class); , it throws an Exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema...
There is no circular reference in those generated files (or schema) as far as I can tell.
I am using Spark release 3.2.1.
Any ideas on how to resolve it?
I'm not sure you need an encoder (or the compiled class)
Take the AVSC text itself, and you can get a Schema like so
SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
Then this can be given to the spark-sql from_avro function.
Related
I have a Spark-Kakfa-Strucutre streaming pipeline. Listening to a topic, which may have json records of varying schema.
Now I want to resolve the schema based on the key(x_y), and then apply to value portion to parse the json record.
So here key's 'y' part tells about the schema type.
I tried to get the schema string from udf and then pass to from_json() function.
But it fails with exception
org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of `schema`
Code used:
df.withColumn("data_type", element_at(split(col("key").cast("string"),"_"),1))
.withColumn("schema", schemaUdf($"data_type"))
.select(from_json(col("value").cast("string"), col("schema")).as("data"))
Schema demo:
{
"type" : "struct",
"fields" : [ {
"name" : "name",
"type" : {
"type" : "struct",
"fields" : [ {
"name" : "firstname",
"type" : "string",
"nullable" : true,
"metadata" : { }
}]
},
"nullable" : true,
"metadata" : { }
} ]
}
UDF used:
lazy val fetchSchema = (fileName : String) => {
DataType.fromJson(mapper.readTree(new File(fileName)).toString)
}
val schemaUdf = udf[DataType, String](fetchSchema)
Note: I am not using confluent feature.
I am new to MongoDB and NodeJS,
When i try to create the JsonSchema with data types, string, integer, date and bool, it is created but always throwing an error as document validation error while inserting the data, So i changed the bsonType of one data type to number, then it started creating collection records, but the observation is it is storing as Double datatype, I read somewhere in the stackoverflow, that it stores like that only, but my question is why is this behavior? WHY THE ERROR IS NOT BEING THROWN AT THE TIME OF CREATION OF THE JSONSCHEMA but it is throwing at the time of data insertion?
Also, if we have nested objects let us say, Customer object with Address as nested object, the main object's int/number values are stored as Double where as inside the address object's pincode storing as Int32. This is also very confusing. what is the difference between these objects but the structure of the schema is same.
What are the other ways to implement and having proper validated schema for MongoDB.
>
db.getCollectionInfos({name:"companysInt1s1"})
[
{
"name" : "companysInt1s1",
"type" : "collection",
"options" : {
"validator" : {
"$jsonSchema" : {
"bsonType" : "object",
"required" : [
"tin"
],
"properties" : {
"tin" : {
"bsonType" : "int",
"minLength" : 2,
"maxLength" : 11,
"description" : "must be a string and is not required, should be 11 characters length"
}
}
}
}
},
"info" : {
"readOnly" : false,
"uuid" : UUID("27cba650-7bd3-4930-8d3e-7e6cbbf517db")
},
"idIndex" : {
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "invoice.companysInt1s1"
}
}
]
> db.companysInt1s1.insertOne({tin:22222})
2019-02-14T15:04:28.712+0530 E QUERY [js] WriteError: Document failed validation :
WriteError({
"index" : 0,
"code" : 121,
"errmsg" : "Document failed validation",
"op" : {
"_id" : ObjectId("5c653624e382c2ec16c16893"),
"tin" : 22222
}
})
WriteError#src/mongo/shell/bulk_api.js:461:48
Bulk/mergeBatchResults#src/mongo/shell/bulk_api.js:841:49
Bulk/executeBatch#src/mongo/shell/bulk_api.js:906:13
Bulk/this.execute#src/mongo/shell/bulk_api.js:1150:21
DBCollection.prototype.insertOne#src/mongo/shell/crud_api.js:252:9
#(shell):1:1
Am i missing something or any other documentation should i be following? Appreciate your guidance...
You need to insert as NumberInt.
when you run this
db.companysInt1s1.insertOne({tin:22222})
you are actually inserting tin as float.
so the correct way to do it is
db.companysInt1s1.insertOne({tin: NumberInt(22222) })
I have documents that contains a object which the attributes are editable (add/delete/edit) in runtime.
{
"testIndex" : {
"mappings" : {
"documentTest" : {
"properties" : {
"typeTestId" : {
"type" : "string",
"index" : "not_analyzed"
},
"createdDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"designation" : {
"type" : "string",
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
},
"id" : {
"type" : "string",
"index" : "not_analyzed"
},
"modifiedDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"stuff" : {
"type" : "string"
},
"suggest" : {
"type" : "completion",
"analyzer" : "simple",
"payloads" : true,
"preserve_separators" : true,
"preserve_position_increments" : true,
"max_input_length" : 50,
"context" : {
"typeTestId" : {
"type" : "category",
"path" : "typeTestId",
"default" : [ ]
}
}
},
"values" : {
"properties" : {
"Att1" : {
"type" : "string"
},
"att2" : {
"type" : "string"
},
"att400" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
}
}
}
}
The field values is a object that can be edited throug typeTest, so if I change something in typeTestit should be reflected here. If i create a new field theres no problem, but it should be possible to edit or delete existing fields in typeTest. For example If I delete values.att1 all documentTest should lose these, as well as the mapping should be updated.
For what I saw, we cannot do these without reindexing. So for now my solution is to remove the fields in elastic search just like mentioned in this question and have a worker do the reindexing time to time if needed.
This does not seems to me a "solution". Is there a better way to have document of this type in elasticsearch? with this flexibility without having to reindex time to time?
You can use the Update API to delete, add or modify a field.
The issue is docs are immutable in elasticsearch, so when you make some changes with the update API it is executed in a manner mark as deleted to old one and add a new one with the updates.
The deletion and the creating the new documents is transparent to you, so you do not have to reindex or do any other thing. Down side is if you are planning to modify very large numbers of documents (like an update query to modify 5mil documents.) it will be very I/O intensive for the nodes.
BTW, this is also applies to deletions
I am trying to remove the lowest homework score.
I tried this,
var a = db.students.find({"scores.type":"homework"}, {"scores.$":1}).sort({"scores.score":1})
but how can I remove this set of data?
I have 200 pieces of similar data below.
{
"_id" : 148,
"name" : "Carli Belvins",
"scores" : [
{
"type" : "exam",
"score" : 84.4361816750119
},
{
"type" : "quiz",
"score" : 1.702113040528119
},
{
"type" : "homework",
"score" : 22.47397850465176
},
{
"type" : "homework",
"score" : 88.48032660881387
}
]
}
you are trying to remove an element but the statement you provided is just to find it.
Use db.students.remove(<query>) instead. Full documentation here
Given this Person collection:
{
"_id" : ObjectId("4f8e95a718bcv9c74da1e6511a"),
"name" : "John",
"hobbies" : [{
"id" : 001,
"name" : "reading",
"location" : "home"
},{
"id" : 002,
"name" : "sport",
"location" : "outside"
}]
}
and these new/edited Hobby objects:
{
"name" : "walking",
"location" : "outside"
}
and
{
"id" : 001,
"name" : "reading",
"location" : "outside"
}
If I know the Person that I want to manage, what is be the best way to upsert embedded objects?
Currently my approach is to find the Person object, make the required modifications to it in my code, and then save it back to the DB. This works. But I'd like to simplify and reduce the number of round trips to the database.