Elasticsearch Analyze API Oddity - search

My question is fairly simple. Say that I have a type mapping in an index that looks like this:
"mappings" : {
"post" : {
"analyzer" : "my_custom_analyzer",
"properties" : {
"body" : {
"type" : "string",
"store" : true
}
}
}
}
Note that I specified my_custom_analyzer as the analyzer for the type. When I search the body field without specifying an analyzer in the query, I expect my_custom_analyzer to be used. However, when I use the Analyze API to query the field:
curl http://localhost:9200/myindex/_analyze?field=post.body&text=test
It returns standard analysis results for string. When I specify the analyzer it works:
curl http://localhost:9200/myindex/_analyze?analyzer=my_custom_analyzer&text=test
My question is: why doesn't the Analyze API use the default type analyzer when I specify a field?

Analyzer is per string field.
You cant apply it over an object or nested object and hope all the fields under that object field will inherit that analyzer.
The right approach is as follows -
"mappings" : {
"post" : {
"properties" : {
"body" : {
"type" : "string",
"analyzer" : "my_custom_analyzer",
"store" : true
}
}
}
}
The reason the analyzer worked for analyzer API is because you have declared analyzer for that index.
If you want to define analyzer for all the string fields under a particular object ,you need to mention that in the type template. You can get more information about that here - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-root-object-type.html#_dynamic_templates

Related

MongoDB with Editor.js expandable data

I am currently designing a blog-type application with a MongoDB backend. The blog application will use editor.js to allow the creation and editing of 'blogs'.
https://editorjs.io/
Editor.js is very friendly, and returns data like this:
{
"time" : 1610826755415,
"blocks" : [
{
"type" : "header",
"data" : {
"text" : "Editor.js",
"level" : 2
}
},
{
"type" : "paragraph",
"data" : {
"text" : "Hey. Meet the new Editor. On this page you can see it in action — try to edit this text."
}
},
{
"type" : "header",
"data" : {
"text" : "Key features",
"level" : 3
}
}
}
My concern is that even though this is very MongoDB friendly, depending on the size of the blog, it is possible that the document will lean towards the 16MB document limit (which we should get close to). Is there a sensible way to split this up without facing the limits of Mongo? Perhaps taking the different types and dividing it up that way?
Thank you.
You can split out content that can be big by referring to the _id the data is in, e.g.:
If you decide that a paragraph can exceed
{
"time" : 1610826755415,
"docId": ObjectId("00000001"),
"blocks" : [
{
"type" : "paragraph",docuId"
"summary": "Hey. Meet the new Editor. On this page you...",
"dataId" : ObjectId("123456789")
}
},
{/*other blocks... */}
]
}
Then have a collection per type where you do your lazy loading. A MongoD collection named paragraph would have records like this:
"ObjectId(123456789)" : {
"docId": ObjectId("00000001"),
"text" : "Hey. Meet the new Editor. On this page you can see it in action — try to edit this text."
}
This has the additional benefit of faster db loading/data transfers if you don't need the entire Object : your queries from the browser may include an option "lazy":"true" which will only load the summary of paragraphs and delay the big content to a later phasis. The docId would allow you to load all dependencies of a document at once

How to prevent Morphia from storing empty string in MongoDb

I am trying to Save a Object to MongoDb using morphia which contains fields that have value as empty string. And I don't want those empty string to be saved in mongoDB.
For Example : (Json mentioned)I don't want fields like "addressLine2" , "postalCd2" to be saved in Mongo.
{
"_id" : ObjectId("5cf8d100fe85543cdc1e3183"),
"accountNbr" : "test Acct",
"effectiveDt" : "2019-02-19",
"entryDt" : "2019-06-06",
"expirationDt" : "2020-02-19",
"insuredMailAddress" : {
"stateCd" : "TestCd",
"cityNm" : "testCity",
"addressLine1" : "Test address Line1",
"addressLine2" : "",
"postalCd2" : ""
}
"streamLineRenewInd" : {
"code" : " "
}
}
Is there a way to achieve this.
Morphia does not currently support such a feature. You can, however, filter out the nulls. You'd just need to make sure your application stores a null instead of "".

Auto Increment a field value every time a doc is inserted in elastic search

I have a requirement to generate a unique number (ARN) in this format
DD/MM/YYYY/1, DD/MM/YYYY/2
and insert these in elastic search index.
The approach i am thinking of is to create an auto increment field in the doc and use it to generate a new entry and use the new auto generated number to create the ARN and update the doc.
doc structure that i am planning to use:
{ id: 1, arn: 17/03/2018/01 }
something like this.
How can i get auto increment field in elastic search?
It can't be done in a single step. First you have to insert the record into the database, and then update the ARN with it's id
There is no auto-increment equivalent, for example, to hibernate id generator. You could use the Bulk API (if you have to save multiple documents at a time) and increase the _id and the ending of your ARN value programmatically.
Note: if you want to treat your id as a number, you should implement it yourself (in this example, I added a new field "my_id", because the _id of the documents is treated as a string.
POST /bulk
{ "index" : { "_index" : "your_index", "_type" : "your_type", "_id" : "1" } }
{ "arn" : "2018/03/17/1", my_id: 1 }
{ "index" : { "_index" : "your_index", "_type" : "your_type", "_id" : "2" } }
{ "arn" : "2018/03/17/2", my_id: 2 }
Then, the next time that you want to save new documents, you query for the maximum id something like:
POST /my_index/my_type/_search?size=1
{
"query": {
"fields": ["my_id"],
"sort": [{
"my_id": { "order": "desc" } }
]
}
}
If your only requirement is that this ARN should be unique, you could also let elasticsearch calculate your _id by simply not setting it. Then you could relay at some unique token generator (UID.randomUUID().toString() if work with java). Pseudo code follows:
String uuid = generateUUID() // depends on the programming language
String payload = "{ \"arn\" : + uuid + "}" // concatenate the payload
String url = "http://localhost:9200/my_index" // your target index
executePost(url, payload) // implement the call with some http client library

How do I sort the search results according to the number of items in ElasticSearch?

Let's say that I store documents like this in ElasticSearch:
{
'name':'user name',
'age':43,
'location':'CA, USA',
'bio':'into java, scala, python ..etc.',
'tags':['java','scala','python','django','lift']
}
And let's say that I search using location=CA, how can I sort the results according to the number of the items in 'tags'?
I would like to list the people with the most number of tag in the first page.
You can do it indexing an additional field which contains the number of tags, on which you can then easily sort your results. Otherwise, if you are willing to pay a little performance cost at query time there's a nice solution that doesn't require to reindex your data: you can sort based on a script like this:
{
"query" : {
"match_all" : {}
},
"sort" : {
"_script" : {
"script" : "doc['tags'].values.length",
"type" : "number",
"order" : "asc"
}
}
}
As you can read from the script based sorting section:
Note, it is recommended, for single custom based script based sorting,
to use custom_score query instead as sorting based on score is faster.
That means that it'd be better to use a custom score query to influence your score, and then sort by score, like this:
{
"query" : {
"custom_score" : {
"query" : {
"match_all" : {}
},
"script" : "_score * doc['tags'].values.length"
}
}
}

Best way to do one-to-many "JOIN" in CouchDB

I am looking for a CouchDB equivalent to "SQL joins".
In my example there are CouchDB documents that are list elements:
{ "type" : "el", "id" : "1", "content" : "first" }
{ "type" : "el", "id" : "2", "content" : "second" }
{ "type" : "el", "id" : "3", "content" : "third" }
There is one document that defines the list:
{ "type" : "list", "elements" : ["2","1"] , "id" : "abc123" }
As you can see the third element was deleted, it is no longer part of the list. So it must not be part of the result. Now I want a view that returns the content elements including the right order.
The result could be:
{ "content" : ["second", "first"] }
In this case the order of the elements is already as it should be. Another possible result:
{ "content" : [{"content" : "first", "order" : 2},{"content" : "second", "order" : 1}] }
I started writing the map function:
map = function (doc) {
if (doc.type === 'el') {
emit(doc.id, {"content" : doc.content}); //emit the id and the content
exit;
}
if (doc.type === 'list') {
for ( var i=0, l=doc.elements.length; i<l; ++i ){
emit(doc.elements[i], { "order" : i }); //emit the id and the order
}
}
}
This is as far as I can get. Can you correct my mistakes and write a reduce function? Remember that the third document must not be part of the result.
Of course you can write a different map function also. But the structure of the documents (one definig element document and an entry document for each entry) cannot be changed.
EDIT: Do not miss JasonSmith's comment to his answer, where he describes how to do this shorter.
Thank you! This is a great example to show off CouchDB 0.11's new
features!
You must use the fetch-related-data feature to reference documents
in the view. Optionally, for more convenient JSON, use a _list function to
clean up the results. See Couchio's writeup on "JOIN"s for details.
Here is the plan:
Firstly, you have a uniqueness contstraint on your el documents. If two of
them have id=2, that's a problem. It is necessary to use
the _id field instead if id. CouchDB will guarantee uniqueness, but also,
the rest of this plan requires _id in order to fetch documents by ID.
{ "type" : "el", "_id" : "1", "content" : "first" }
{ "type" : "el", "_id" : "2", "content" : "second" }
{ "type" : "el", "_id" : "3", "content" : "third" }
If changing the documents to use _id is absolutely impossible, you can
create a simple view to emit(doc.id, doc) and then re-insert that into a
temporary database. This converts id to _id but adds some complexity.
The view emits {"_id": content_id} data keyed on
[list_id, sort_number], to "clump" the lists with their content.
function(doc) {
if(doc.type == 'list') {
for (var i in doc.elements) {
// Link to the el document's id.
var id = doc.elements[i];
emit([doc.id, i], {'_id': id});
}
}
}
Now there is a simple list of el documents, in the correct order. You can
use startkey and endkey if you want to see only a particular list.
curl localhost:5984/x/_design/myapp/_view/els
{"total_rows":2,"offset":0,"rows":[
{"id":"036f3614aeee05344cdfb66fa1002db6","key":["abc123","0"],"value":{"_id":"2"}},
{"id":"036f3614aeee05344cdfb66fa1002db6","key":["abc123","1"],"value":{"_id":"1"}}
]}
To get the el content, query with include_docs=true. Through the magic of
_id, the el documents will load.
curl localhost:5984/x/_design/myapp/_view/els?include_docs=true
{"total_rows":2,"offset":0,"rows":[
{"id":"036f3614aeee05344cdfb66fa1002db6","key":["abc123","0"],"value":{"_id":"2"},"doc":{"_id":"2","_rev":"1-4530dc6946d78f1e97f56568de5a85d9","type":"el","content":"second"}},
{"id":"036f3614aeee05344cdfb66fa1002db6","key":["abc123","1"],"value":{"_id":"1"},"doc":{"_id":"1","_rev":"1-852badd683f22ad4705ed9fcdea5b814","type":"el","content":"first"}}
]}
Notice, this is already all the information you need. If your client is
flexible, you can parse the information out of this JSON. The next optional
step simply reformats it to match what you need.
Use a _list function, which simply reformats the view output. People use them to output XML or HTML however we will make
the JSON more convenient.
function(head, req) {
var headers = {'Content-Type': 'application/json'};
var result;
if(req.query.include_docs != 'true') {
start({'code': 400, headers: headers});
result = {'error': 'I require include_docs=true'};
} else {
start({'headers': headers});
result = {'content': []};
while(row = getRow()) {
result.content.push(row.doc.content);
}
}
send(JSON.stringify(result));
}
The results match. Of course in production you will need startkey and endkey to specify the list you want.
curl -g 'localhost:5984/x/_design/myapp/_list/pretty/els?include_docs=true&startkey=["abc123",""]&endkey=["abc123",{}]'
{"content":["second","first"]}

Resources