Editable documents fields in elasticsearch - node.js

I have documents that contains a object which the attributes are editable (add/delete/edit) in runtime.
{
"testIndex" : {
"mappings" : {
"documentTest" : {
"properties" : {
"typeTestId" : {
"type" : "string",
"index" : "not_analyzed"
},
"createdDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"designation" : {
"type" : "string",
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
},
"id" : {
"type" : "string",
"index" : "not_analyzed"
},
"modifiedDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"stuff" : {
"type" : "string"
},
"suggest" : {
"type" : "completion",
"analyzer" : "simple",
"payloads" : true,
"preserve_separators" : true,
"preserve_position_increments" : true,
"max_input_length" : 50,
"context" : {
"typeTestId" : {
"type" : "category",
"path" : "typeTestId",
"default" : [ ]
}
}
},
"values" : {
"properties" : {
"Att1" : {
"type" : "string"
},
"att2" : {
"type" : "string"
},
"att400" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
}
}
}
}
The field values is a object that can be edited throug typeTest, so if I change something in typeTestit should be reflected here. If i create a new field theres no problem, but it should be possible to edit or delete existing fields in typeTest. For example If I delete values.att1 all documentTest should lose these, as well as the mapping should be updated.
For what I saw, we cannot do these without reindexing. So for now my solution is to remove the fields in elastic search just like mentioned in this question and have a worker do the reindexing time to time if needed.
This does not seems to me a "solution". Is there a better way to have document of this type in elasticsearch? with this flexibility without having to reindex time to time?

You can use the Update API to delete, add or modify a field.
The issue is docs are immutable in elasticsearch, so when you make some changes with the update API it is executed in a manner mark as deleted to old one and add a new one with the updates.
The deletion and the creating the new documents is transparent to you, so you do not have to reindex or do any other thing. Down side is if you are planning to modify very large numbers of documents (like an update query to modify 5mil documents.) it will be very I/O intensive for the nodes.
BTW, this is also applies to deletions

Related

Apache Spark circular reference Exception when creating Encoders

I am trying to generate a test AVRO file from a collection of objects represented by generated classes (TestAggregate.java, TestTuple.java). I used avro-tools-1.10.2.jar to generate those classes from this AVRO schema (dataset.avsc):
{
"type" : "record",
"name" : "TestAggregate",
"namespace" : "com....",
"fields" : [ {
"name" : "uuid",
"type" : "string"
}, {
"name" : "bag",
"type" : {
"type" : "array",
"items" : {
"type" : "record",
"name" : "TestTuple",
"fields" : [ {
"name" : "s",
"type" : "int"
}, {
"name" : "n",
"type" : "int"
}, {
"name" : "c",
"type" : "int"
}, {
"name" : "f",
"type" : "int"
} ]
}
},
"aliases" : [ "bag" ]
} ]
}
When I try to create an Encoder using
Encoder<TestAggregate> datasetEncoder = Encoders.bean(TestAggregate.class); , it throws an Exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema...
There is no circular reference in those generated files (or schema) as far as I can tell.
I am using Spark release 3.2.1.
Any ideas on how to resolve it?
I'm not sure you need an encoder (or the compiled class)
Take the AVSC text itself, and you can get a Schema like so
SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
Then this can be given to the spark-sql from_avro function.

$merge, $match and $update in one aggregate query

I have data in a collection ex:"jobs". I am trying to copy specific data from "jobs" after every 2 hours to a new collection (which may not exist initially) and also add a new key to the copied data.
I have been trying with this query to copy the data:
db.getCollection("jobs").aggregate([{ $match: { "job_name": "UploadFile", "created_datetime" : {"$gte":"2021-08-18 12:00:00"} } },{"$merge":{into: {coll : "reports"}}}])
But after this, the count in "reports" collection is 0. Also, how can I update the documents (with an extract key "report_name") without using an extra updateMany() query?
The data in jobs collection is as shown:
{
"_id" : ObjectId("60fa8e8283dc22799134dc6f"),
"job_id" : "408a5654-9a89-4c15-82b4-b0dc894b19d7",
"job_name" : "UploadFile",
"data" : {
"path" : "share://LOCALNAS/Screenshot from 2021-07-23 10-34-34.png",
"file_name" : "Screenshot from 2021-07-23 10-34-34.png",
"parent_path" : "share://LOCALNAS",
"size" : 97710,
"md5sum" : "",
"file_uid" : "c4411f10-a745-48d0-a55d-164707b7d6c2",
"version_id" : "c3dfd31a-80ba-4de0-9115-2d9b778bcf02",
"session_id" : "c4411f10-a745-48d0-a55d-164707b7d6c2",
"resource_name" : "Screenshot from 2021-07-23 10-34-34.png",
"metadata" : {
"metadata" : {
"description" : "",
"tag_ids" : [ ]
},
"category_id" : "60eed9ea33c690a0dfc89b41",
"custom_metadata" : [ ]
},
"upload_token" : "upload_token_c5043927484e",
"upload_url" : "/mnt/share_LOCALNAS",
"vfs_action_handler_id" : "91be4282a9ad5067642cdadb75278230",
"element_type" : "file"
},
"user_id" : "60f6c507d4ba6ee28aee5723",
"node_id" : "syeda",
"state" : "COMPLETED",
"priority" : 2,
"resource_name" : "Screenshot from 2021-07-23 10-34-34.png",
"group_id" : "upload_group_0babf8b7ce0b",
"status_info" : {
"progress" : 100,
"status_msg" : "Upload Completed."
},
"error_code" : "",
"error_message" : "",
"created_datetime" : ISODate("2021-07-23T15:10:18.506Z"),
"modified_datetime" : ISODate("2021-07-23T15:10:18.506Z"),
"schema_version" : "1.0.0",
}
Your $match stage contains a condition which takes created_datetime as string while in your sample data it is an ISODate. Such condtion won't return any document, try:
{
$match: {
"job_name": "UploadFile",
"created_datetime": {
"$gte": ISODate("2021-07-01T12:00:00.000Z")
}
}
}
Mongo Playground

How to combine Elasticsearch highlighting and query types?

If I do a search like:
{
"query" : {
"bool" : {
"should" : [
{"match" : { "FullName" : "MiddleName1" } },
{"match_phrase" : { "FullName" : "FirstName2 LastName2" } }
]
}
}
}
I would get documents like:
{
...
"_id": "1",
...
"FullName" : "FirstName1 MiddleName1 LastName1",
...
}
{
...
"_id": "2"
...
"FullName" : "FirstName2 LastName2",
...
}
I know highlighting shows what parts of the documents caused retrievals, but not what query type.
I can also do a multi query then merge the results.
Is there a way do a single query and find out what query type fields and tokens that caused certain documents to be retrieved?
Ideal result:
{
"id" : "1",
"clause" : "match",
"field" : "FullName",
"tokens" : ["MiddleName1"]
}
{
"id" : "2",
"clause" : "match_phrase",
"field" : "FullName",
"tokens" : ["FirstName2", "LastName2"]
}

MongoDB remove the lowest score, node.js

I am trying to remove the lowest homework score.
I tried this,
var a = db.students.find({"scores.type":"homework"}, {"scores.$":1}).sort({"scores.score":1})
but how can I remove this set of data?
I have 200 pieces of similar data below.
{
"_id" : 148,
"name" : "Carli Belvins",
"scores" : [
{
"type" : "exam",
"score" : 84.4361816750119
},
{
"type" : "quiz",
"score" : 1.702113040528119
},
{
"type" : "homework",
"score" : 22.47397850465176
},
{
"type" : "homework",
"score" : 88.48032660881387
}
]
}
you are trying to remove an element but the statement you provided is just to find it.
Use db.students.remove(<query>) instead. Full documentation here

How to geo_distance filter against multiple location fields in Elasticsearch

I have an arbitrary # of location data points per document (anywhere up to 80). I want to perform a geo_distance filter against these locations. The elasticsearch docs claim that:
The geo_distance filter can work with multiple locations / points per document.
Once a single location / point matches the filter, the document will be included in the filter.
It's never made clear how to achieve this. I assume that you have to define the # of locations ahead of time, such that your indexed document looks contains these nested fields:
{
"pin" : {
"location" : {
"lat" : 40.12,
"lon" : -71.34
}
}
}
{
"alt_pin" : {
"location" : {
"lat" : 41.12,
"lon" : -72.34
}
}
}
I assume that you would then filter against pin.location and alt_pin.location somehow.
What if I had an arbitrary number of locations (pin1, pin2, pin3, ...)? Can I do something like this:
"pin" : {
"locations" : [{
"lat" : 41.12,
"lon" : -72.34
}, {
"lat" : 41.12,
"lon" : -72.34
}]
}
}
Would some variation on that work? Maybe using geo_hashes instead of lat/lng coordinates?
Multiple location values can be represented as an array of location fields. Try this:
{
"pin": [
{
"location" :{
"lat": 40.12,
"lon": -71.34
}
},
{
"location" :{
"lat": 41.12,
"lon": -72.34
}
}
]
}

Resources