Does pubsub topic schema validation support nested json? - nested

I'm having trouble creating an AVRO schema with a nested message.
e.g. JSON message,
{"metadata": {"key1": "value1", "key2": "value2"}, "payload": {"key1": "value1", "key2": "value2"}}
From the apache avro documentation I think this schema definition should work but it doesn't seem to:
{
"type" : "record",
"name" : "Avro",
"fields" : [
{
"name" : "metadata",
"type" : "record",
"fields": [
{
"type" : "string",
"name" : "key1"
},
{
"type" : "string",
"name" : "key2"
}
]
},
{
"name" : "payload",
"type" : "record",
"fields": [
{
"type" : "string",
"name" : "key1"
},
{
"type" : "string",
"name" : "key2"
}
]
}
]
}
Am I doing something wrong or is nesting just not supported?

The Avro schema definition you have provided isn't actually valid. The way to specify this schema would be:
{
"type":"record",
"name":"Avro",
"fields":[
{
"name":"metadata",
"type":{
"type":"record",
"name":"MetadataRecord",
"fields":[
{
"type":"string",
"name":"key1"
},
{
"type":"string",
"name":"key2"
}
]
}
},
{
"name":"payload",
"type":{
"type":"record",
"name":"PayloadRecord",
"fields":[
{
"type":"string",
"name":"key1"
},
{
"type":"string",
"name":"key2"
}
]
}
}
]
}
There is still more working being done to ensure that the error messages returned on schema creation provide more details while the feature is in public preview.
You can see more details on the error if you run it through the avro parser in, say, Python:
import avro.io
schema = """
{
"type" : "record",
"name" : "Avro",
"fields" : [
{
"name" : "metadata",
"type" : "record",
"fields": [
{
"type" : "string",
"name" : "key1"
},
{
"type" : "string",
"name" : "key2"
}
]
},
{
"name" : "payload",
"type" : "record",
"fields": [
{
"type" : "string",
"name" : "key1"
},
{
"type" : "string",
"name" : "key2"
}
]
}
]
}
"""
parsed_schema = avro.schema.parse(schema)
Running this script will yield the error:
avro.schema.SchemaParseException: Type property "record" not a valid Avro schema: Could not make an Avro Schema object from record.

Related

handle empty date in elasticsearch dynamically

I have below dynamic mapping template.
PUT my_index
{
"mappings": {
"dynamic_templates": [
{
"objects": {
"match_mapping_type": "object",
"mapping": {
"type": "nested"
}
}
}
],
"dynamic_date_formats": ["yyyy-MM-dd" , "yyyy-MM-dd HH:mm:ss"]
}
}
Only problem is when I am having empty date it is throwing error. I just want to ignore empty dates. My data having multiple date fields hence don't want to do mapping for each date fields.
Below is the error I am getting:
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: illegal_argument_exception: mapper [pb_bureau.applications.accounts.dateclosed] of different type, current_type [text], merged_type [date]
{"index":{"_id":"02ade9b5-1ca5-4006-ab06-9c96439e7d02"}}
below date we are inserting: blank field is null value of date
select date1, date2 from cbl_application_credit_report_account ;
2014-11-14
2018-03-31
2012-07-27 2012-07-23
2015-11-30
2017-08-04 2016-05-13
below is mapping which I am applying:
PUT my_index
{
"mappings": {
"dynamic_templates": [
{
"objects": {
"match_mapping_type": "object",
"mapping": {
"type": "nested"
}
}
},
{
"dates_ignore_malformed": {
"path_match": "*",
"match_mapping_type": "date",
"mapping": {
"format": "yyyy-MM-dd||yyyy-MM-dd HH:mm:ss",
"ignore_malformed": true
}
}
}
],
"dynamic_date_formats": ["yyyy-MM-dd" , "yyyy-MM-dd HH:mm:ss"]
}
}
Is there any way in the dynamic mapping to ignore empty dates?
Mapping:
PUT my_index4
{
"mappings": {
"dynamic_templates": [
{
"objects": {
"match_mapping_type": "object",
"mapping": {
"type": "nested"
}
}
},
{
"dates_ignore_malformed": {
"path_match": "*",
"match_mapping_type": "date",
"mapping": {
"format": "yyyy-MM-dd||yyyy-MM-dd HH:mm:ss" ---> date format on which to be applied ,
"ignore_malformed": true ---> Ignores if field s malformed
}
}
}
],
"dynamic_date_formats": [
"yyyy-MM-dd",
"yyyy-MM-dd HH:mm:ss"
]
}
}
Data:
POST my_index4/_doc
{
"date":"2019-01-01 04:30:22",
"Id":1
}
POST my_index4/_doc
{
"name":2,
"date":"2019-01-01"
}
POST my_index4/_doc
{
"name":2,
"date":""
}
Query:
GET my_index4/_search
Result:
"hits" : [
{
"_index" : "my_index4",
"_type" : "_doc",
"_id" : "NT5XSG0BbzgYofLxTDZ_",
"_score" : 1.0,
"_source" : {
"date" : "2019-01-01 04:30:22",
"Id" : 1
}
},
{
"_index" : "my_index4",
"_type" : "_doc",
"_id" : "Nj5XSG0BbzgYofLxUTaT",
"_score" : 1.0,
"_source" : {
"name" : 2,
"date" : "2019-01-01"
}
},
{
"_index" : "my_index4",
"_type" : "_doc",
"_id" : "Nz5XSG0BbzgYofLxWDYi",
"_score" : 1.0,
"_ignored" : [
"date"
],
"_source" : {
"name" : 2,
"date" : ""
}
}
]

elastic search query to select most relevant data for given keyword

I have search query to get data from elastic search DB. The query is below
GET /v_entity_master/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "(*gupta*)",
"fields": [
"id","name","mobile"
]
}
},
{
"query_string": {
"query": "*7542*",
"fields": [
"id","name","mobile"
]
}
}
]
}
}
}
This query return
{
"id":34501,
"name": "sree gupta",
"mobile":"98775421
},
{
"id":12302,
"name": "gupta",
"mobile":"98775422
}
But what I required is, the exact match of the given search key word should be in the 1st result
Expected output is ,
{
"id":12302,
"name": "gupta",
"mobile":"98775422
},{
"id":34501,
"name": "sree gupta",
"mobile":"98775421
}
Please share your suggestion and idea to slove this issue. Thanks in advance
So first of all, why would you search for "(gupta)" in the id and mobile (phone?) field? Based on the two results you shared, they are numeric fields so whats your intention with that?
Same issue with the second must-clause. I've never encountered a real name of a human being that includes numeric values...
I also don't get why you use the wildcards in the first must-clause. I assume you want to do a fulltext search. So you can simply use the match query.
Now to your actual question:
I created an index in my test cluster and indexed the two responses you showed as documents. This is the response when I execute your query:
{
...
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 2.0,
"hits" : [
{
"_index" : "gupta",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.0,
"_source" : {
"id" : 12302,
"name" : "gupta",
"mobile" : "98775422"
}
},
{
"_index" : "gupta",
"_type" : "_doc",
"_id" : "2",
"_score" : 2.0,
"_source" : {
"id" : 34501,
"name" : "sree gupta",
"mobile" : "98775421"
}
}
]
}
}
Notice that both documents have the same score. That's because you specified wildcards in your search query.
Now let's modify your query:
GET gupta/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "gupta"
}
},
{
"query_string": {
"query": "*7542*",
"fields": ["mobile"]
}
}
]
}
}
}
The main difference is that this query uses a match query to do that fulltext search. You don't need to specify any wildcards since your text fields are analyzed.
This will return the following:
{
...
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.2111092,
"hits" : [
{
"_index" : "gupta",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.2111092,
"_source" : {
"id" : 12302,
"name" : "gupta",
"mobile" : "98775422"
}
},
{
"_index" : "gupta",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.160443,
"_source" : {
"id" : 34501,
"name" : "sree gupta",
"mobile" : "98775421"
}
}
]
}
}
Now the two documents have different scores due to Field length normalization. As stated in this article about elasticsearch scoring a term match found in a field with a low number of total terms is going to be more important than a match found in a field with a large number of terms.
I hope I could help you.

How to lookup and filter in array of objects mongo?

I have a question about lookup and filter array of objects in mongodb
I have structure: Person
{
"_id": "5cc3366c22c3767a2b114c6b",
"flags": [
"5cc30210fada5d7820d03aaf",
"5cc2c3924a94a575adbdc56a"
],
"key": "Animal",
"name": "name1",
"description": "description1",
"endpoints": [
{
"isEnabled": true,
"publishUrl": "megaUrl",
"env": "5cc1a8911b19026fd193506b"
},
{
"isEnabled": true,
"publishUrl": "megaUrl",
"env": "5ccaeef3312acb103730d4c5"
}
]
}
envs collection
{
"_id" : "5cc1a8911b19026fd193506b",
"name" : "name2",
"key" : "PROD",
"publishUrl" : "url1",
"__v" : 0
}
{
"_id" : "5ccaeef3312acb103730d4c5",
"name" : "name2",
"key" : "PROD",
"publishUrl" : "url1",
"__v" : 0
}
I should filter Document by endpoints.$.env
so, I have: accessKeys = ["PROD", "UAY"], and i should see result . with endpoints where env.key === "PROD" || env.key === "UAT"
Expected result:
{
"_id": "5cc3366c22c3767a2b114c6b",
"flags": [
"5cc30210fada5d7820d03aaf",
"5cc2c3924a94a575adbdc56a"
],
"key": "Animal",
"name": "name1",
"description": "description1",
"endpoints": [
{
"isEnabled": true,
"publishUrl": "megaUrl",
"env": {
"_id" : "5cc1a8911b19026fd193506b",
"name" : "name2",
"key" : "PROD",
"publishUrl" : "url1",
"__v" : 0
}
},
]
}
Help me pls, how i can do that? I know about aggregate, but cant do it :(
Try this :
db.persons.aggregate([{
$unwind : "$endpoints"
},{
$lookup :{
from : "envs",
localField : "endpoints.env",
foreignField : "_id",
as : "endpoints.env"
}
},{
$unwind : "$endpoints.env"
},{
$match : {
"endpoints.env.key" : {$in : accessKeys}
}
},{
$group : {
_id : "$_id",
flags : {$first : "$flags"},
key : {$first : "$key"},
name : {$first : "$name"},
description : {$first : "$description"},
endpoints : {$push : "$endpoints"},
}
}])

Converting a MongoDB aggregate into an ArangoDB COLLECT

I'm migrating data from Mongo to Arango and I need to reproduce a $group aggregation. I have successfully reproduced the results but I'm concerned that my approach maybe sub-optimal. Can the AQL be improved?
I have a collection of data that looks like this:
{
"_id" : ObjectId("5b17f9d85b2c1998598f054e"),
"department" : [
"Sales",
"Marketing"
],
"region" : [
"US",
"UK"
]
}
{
"_id" : ObjectId("5b1808145b2c1998598f054f"),
"department" : [
"Sales",
"Marketing"
],
"region" : [
"US",
"UK"
]
}
{
"_id" : ObjectId("5b18083c5b2c1998598f0550"),
"department" : "Development",
"region" : "Europe"
}
{
"_id" : ObjectId("5b1809a75b2c1998598f0551"),
"department" : "Sales"
}
Note the value can be a string, Array or not present
In Mongo I'm using the following code to aggregate the data:
db.test.aggregate([
{
$unwind:{
path:"$department",
preserveNullAndEmptyArrays: true
}
},
{
$unwind:{
path:"$region",
preserveNullAndEmptyArrays: true
}
},
{
$group:{
_id:{
department:{ $ifNull: [ "$department", "null" ] },
region:{ $ifNull: [ "$region", "null" ] },
},
count:{$sum:1}
}
}
])
In Arango I'm using the following AQL:
FOR i IN test
LET FIELD1=(FOR a IN APPEND([],NOT_NULL(i.department,"null")) RETURN a)
LET FIELD2=(FOR a IN APPEND([],NOT_NULL(i.region,"null")) RETURN a)
FOR f1 IN FIELD1
FOR f2 IN FIELD2
COLLECT id={department:f1,region:f2} WITH COUNT INTO counter
RETURN {_id:id,count:counter}
Edit:
The APPEND is used to convert string values into an Array
Both produce results that look like this;
{
"_id" : {
"department" : "Marketing",
"region" : "US"
},
"count" : 2.0
}
{
"_id" : {
"department" : "Development",
"region" : "Europe"
},
"count" : 1.0
}
{
"_id" : {
"department" : "Sales",
"region" : "null"
},
"count" : 1.0
}
{
"_id" : {
"department" : "Marketing",
"region" : "UK"
},
"count" : 2.0
}
{
"_id" : {
"department" : "Sales",
"region" : "UK"
},
"count" : 2.0
}
{
"_id" : {
"department" : "Sales",
"region" : "US"
},
"count" : 2.0
}
Your approach seems alright. I would suggest to use TO_ARRAY() instead of APPEND() to make it easier to understand though.
Both functions skip null values, thus it is unavoidable to provide some placeholder, or test for null explicitly and return an array with a null value (or whatever works best for you):
FOR doc IN test
FOR field1 IN doc.department == null ? [ null ] : TO_ARRAY(doc.department)
FOR field2 IN doc.region == null ? [ null ] : TO_ARRAY(doc.region)
COLLECT department = field1, region = field2
WITH COUNT INTO count
RETURN { _id: { department, region }, count }
Collection test:
[
{
"_key": "5b17f9d85b2c1998598f054e",
"department": [
"Sales",
"Marketing"
],
"region": [
"US",
"UK"
]
},
{
"_key": "5b18083c5b2c1998598f0550",
"department": "Development",
"region": "Europe"
},
{
"_key": "5b1808145b2c1998598f054f",
"department": [
"Sales",
"Marketing"
],
"region": [
"US",
"UK"
]
},
{
"_key": "5b1809a75b2c1998598f0551",
"department": "Sales"
}
]
Result:
[
{
"_id": {
"department": "Development",
"region": "Europe"
},
"count": 1
},
{
"_id": {
"department": "Marketing",
"region": "UK"
},
"count": 2
},
{
"_id": {
"department": "Marketing",
"region": "US"
},
"count": 2
},
{
"_id": {
"department": "Sales",
"region": null
},
"count": 1
},
{
"_id": {
"department": "Sales",
"region": "UK"
},
"count": 2
},
{
"_id": {
"department": "Sales",
"region": "US"
},
"count": 2
}
]

How to find multiple exact values on a particular field in elasticsearch query?

My Sample Data is shown below:
{"listings":{"title" : "testing 1", "address" : { "line1" : "3rd cross", "line2" : "6th main", "line3" : "", "landmark" : "", "location" : "k r puram", "pincode" : "", "city" : "Bangalore" },"purpose":"rent","published": true, "inActive": false },
{"listings":{"title" : "testing 2", "address" : { "line1" : "3rd cross", "line2" : "6th main", "line3" : "", "landmark" : "", "location" : "banaswadi", "pincode" : "", "city" : "Bangalore" },"purpose":"sale","published": true, "inActive": false },
{"listings":{"title" : "testing 3", "address" : { "line1" : "3rd cross", "line2" : "6th main", "line3" : "", "landmark" : "", "location" : "tin factory", "pincode" : "", "city" : "Bangalore" },"purpose":"rent","published": true, "inActive": false }
My index mapping is shown below:
curl -X PUT localhost:9200/testing/listings/_mapping -d '{
"listings" : {
"properties" : {
"address" : {
"properties": {
"location": { "type" : "string",
"index" : "not_analyzed"
}
}
},
"suggest" : {
"type" : "completion",
"index_analyzer" : "simple",
"search_analyzer" : "simple",
"payloads" : true
}
}
}
}'
I access the listings object based on purpose property value like rent or sale. I am able to access the objects for rent and sale individually. How can i access listings object for both rent and sale values. I have used below query to fetch both rent & sale listings object.
{"query":{
"filtered": {
"filter": {
"terms": {
"purpose" : ["rent", "sale"]
}
}
},
"bool":{
"must":[
{"match":{"published":true}},
{"match":{"inActive":false}},
{"match":{"address.city": "bangalore"}}
]
}
}
}
Please suggest the changes as needed. Thanks in advance.
There're a couple of things:
address should be declared as nested object to prevent wrong search results in the future when searching address fields. You can refer here for more info about problems with inner object, nested object : http://www.elasticsearch.org/blog/managing-relations-inside-elasticsearch/
When use nested object, the mapping will look like this (address type : nested):
"listings" : {
"properties" : {
"address" : {
"type" : "nested",
"properties": {
"location": {
"type" : "string",
"index" : "not_analyzed"
}
}
},
"suggest" : {
"type" : "completion",
"index_analyzer" : "simple",
"search_analyzer" : "simple",
"payloads" : true
}
}
}
And the query will change a bit: use terms with execution=and, and use nested query for address fields:
"query":{
"filtered": {
"filter": {
"terms": {
"execution" : "and",
"purpose" : ["rent", "sale"]
}
},
"query" : {
"bool":{
"must":[
{"term":{"published":true}},
{"term":{"inActive":false}},
{ "nested": {
"path": "address",
"query":
{"match":{"address.city": "bangalore"}}
}
}
]
}
}
}
}
You should refer the elasticsearch documents for syntax of each kind of query.
Hope it help

Resources