elastic search exact phrase matching - search

I am new to ES. I am having trouble finding exact phrase matches.
Let's assume my index has a field called movie_name.
Let's assume I have 3 documents with the following values
movie_name = Mad Max
movie_name = mad max
movie_name = mad max 3d
If my search query is Mad Max, I want the first 2 documents to be returned but not the 3rd.
If I do the "not_analyzed" solution I will get only document 1 but not 2.
What am I missing?

I was able to do it using the following commands, basically create a custom analyzer, use the keyword tokenizer to prevent tokenization. Then use the analyzer in the "mappings" for the desired field, in this case "movie_name".
PUT /movie
{
"settings":{
"index":{
"analysis":{
"analyzer":{
"keylower":{
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
}
},
"mappings" : {
"search" : {
"properties" : {
"movie_name" : { "type" : "string", "analyzer":"keylower" }
}
}
}
}

Use Phrase matching like this :
{
"query": {
"match_phrase": {
"movie_name": "a"
}
}
}

Related

Auto Increment a field value every time a doc is inserted in elastic search

I have a requirement to generate a unique number (ARN) in this format
DD/MM/YYYY/1, DD/MM/YYYY/2
and insert these in elastic search index.
The approach i am thinking of is to create an auto increment field in the doc and use it to generate a new entry and use the new auto generated number to create the ARN and update the doc.
doc structure that i am planning to use:
{ id: 1, arn: 17/03/2018/01 }
something like this.
How can i get auto increment field in elastic search?
It can't be done in a single step. First you have to insert the record into the database, and then update the ARN with it's id
There is no auto-increment equivalent, for example, to hibernate id generator. You could use the Bulk API (if you have to save multiple documents at a time) and increase the _id and the ending of your ARN value programmatically.
Note: if you want to treat your id as a number, you should implement it yourself (in this example, I added a new field "my_id", because the _id of the documents is treated as a string.
POST /bulk
{ "index" : { "_index" : "your_index", "_type" : "your_type", "_id" : "1" } }
{ "arn" : "2018/03/17/1", my_id: 1 }
{ "index" : { "_index" : "your_index", "_type" : "your_type", "_id" : "2" } }
{ "arn" : "2018/03/17/2", my_id: 2 }
Then, the next time that you want to save new documents, you query for the maximum id something like:
POST /my_index/my_type/_search?size=1
{
"query": {
"fields": ["my_id"],
"sort": [{
"my_id": { "order": "desc" } }
]
}
}
If your only requirement is that this ARN should be unique, you could also let elasticsearch calculate your _id by simply not setting it. Then you could relay at some unique token generator (UID.randomUUID().toString() if work with java). Pseudo code follows:
String uuid = generateUUID() // depends on the programming language
String payload = "{ \"arn\" : + uuid + "}" // concatenate the payload
String url = "http://localhost:9200/my_index" // your target index
executePost(url, payload) // implement the call with some http client library

Remove duplicate array objects mongodb

I have an array and it contains duplicate values in BOTH the ID's, is there a way to remove one of the duplicate array item?
userName: "abc",
_id: 10239201141,
rounds:
[{
"roundId": "foo",
"money": "123
},// Keep one of these
{// Keep one of these
"roundId": "foo",
"money": "123
},
{
"roundId": "foo",
"money": "321 // Not a duplicate.
}]
I'd like to remove one of the first two, and keep the third because the id and money are not duplicated in the array.
Thank you in advance!
Edit I found:
db.users.ensureIndex({'rounds.roundId':1, 'rounds.money':1}, {unique:true, dropDups:true})
This doesn't help me. Can someone help me? I spent hours trying to figure this out.
The thing is, I ran my node.js website on two machines so it was pushing the same data twice. Knowing this, the duplicate data should be 1 index away. I made a simple for loop that can detect if there is duplicate data in my situation, how could I implement this with mongodb so it removes an array object AT that array index?
for (var i in data){
var tempRounds = data[i]['rounds'];
for (var ii in data[i]['rounds']){
var currentArrayItem = data[i]['rounds'][ii - 1];
if (tempRounds[ii - 1]) {
if (currentArrayItem.roundId == tempRounds[ii - 1].roundId && currentArrayItem.money == tempRounds[ii - 1].money) {
console.log("Found a match");
}
}
}
}
Use an aggregation framework to compute a deduplicated version of each document:
db.test.aggregate([
{ "$unwind" : "$stats" },
{ "$group" : { "_id" : "$_id", "stats" : { "$addToSet" : "$stats" } } }, // use $first to add in other document fields here
{ "$out" : "some_other_collection_name" }
])
Use $out to put the results in another collection, since aggregation cannot update documents. You can use db.collection.renameCollection with dropTarget to replace the old collection with the new deduplicated one. Be sure you're doing the right thing before you scrap the old data, though.
Warnings:
1: This does not preserve the order of elements in the stats array. If you need to preserve order, you will have retrieve each document from the database, manually deduplicate the array client-side, then update the document in the database.
2: The following two objects won't be considered duplicates of each other:
{ "id" : "foo", "price" : 123 }
{ "price" : 123, "id" : foo" }
If you think you have mixed key orders, use a $project to enforce a key order between the $unwind stage and the $group stage:
{ "$project" : { "stats" : { "id_" : "$stats.id", "price_" : "$stats.price" } } }
Make sure to change id -> id_ and price -> price_ in the rest of the pipeline and rename them back to id and price at the end, or rename them in another $project after the swap. I discovered that, if you do not give different names to the fields in the project, it doesn't reorder them, even though key order is meaningful in an object in MongoDB:
> db.test.drop()
> db.test.insert({ "a" : { "x" : 1, "y" : 2 } })
> db.test.aggregate([
{ "$project" : { "_id" : 0, "a" : { "y" : "$a.y", "x" : "$a.x" } } }
])
{ "a" : { "x" : 1, "y" : 2 } }
> db.test.aggregate([
{ "$project" : { "_id" : 0, "a" : { "y_" : "$a.y", "x_" : "$a.x" } } }
])
{ "a" : { "y_" : 2, "x_" : 1 } }
Since the key order is meaningful, I'd consider this a bug, but it's easy to work around.

Elastic search having "not_analyzed" and "analyzed" together

I'm new to elasticsearch. What my business needs is that I should also do a partial matching on searchable fields I ended up with wildcard queries. my query is like this :
{
"query" : {
"wildcard" : "*search_text_here*"
}
}
Suppose that I'm searching for Red Flowers before the above query I was using an analyzed match query which provided me with both results for Red and Flowers lonely. but now my query only works when both Red Flowers are present together.
Use match phrase query as shown below for more information refer the ES doc:
GET /my_index/my_type/_search
{
"query": {
"match_phrase": {
"title": "red floewers"
}
}
}

Searching term in subdocuments with elasticsearch

I have a index document structure like below;
{
"term":"some term",
"inlang":"some lang"
"translations" : {
{
"translation":"some translation",
"outlang":"some lang",
"translations" : {
{
"translation":"some translation 1"
"outlang": "some lang 1"
"translations" : {...}
}
}
},
...
}
}
I want to find a translation in such documents. However, this translation can exists at any level of this document. Is it possible to search term dynamically by using elasticsearch?
For example,
{
"query": {
"*.translation":"searchterm"
}
}
Thanks in advance
I have managed to do that with following query;
{
"query": {
"query_string": {
"query": "someterm",
"fields": ["*.translation"]
}
}
}
or
{
"query": {
"multi_match": {
"query": "someterm",
"fields": ["*.translation"]
}
}
}
You can see elasticsearch google group conversation here
No, I do not believe this functionality is built into ElasticSearch at the moment. This answer suggests you could build the functionality with a script, but it would be super slow.
In general, ES doesn't play nicely with nested data. It supports nested fields, but many of the more advanced search functionality isn't capable of operating on complex nested data. My suggestion is to denormalize your data so that every translation is represented by a single item in the index, and link between them with ID numbers.

How do I sort the search results according to the number of items in ElasticSearch?

Let's say that I store documents like this in ElasticSearch:
{
'name':'user name',
'age':43,
'location':'CA, USA',
'bio':'into java, scala, python ..etc.',
'tags':['java','scala','python','django','lift']
}
And let's say that I search using location=CA, how can I sort the results according to the number of the items in 'tags'?
I would like to list the people with the most number of tag in the first page.
You can do it indexing an additional field which contains the number of tags, on which you can then easily sort your results. Otherwise, if you are willing to pay a little performance cost at query time there's a nice solution that doesn't require to reindex your data: you can sort based on a script like this:
{
"query" : {
"match_all" : {}
},
"sort" : {
"_script" : {
"script" : "doc['tags'].values.length",
"type" : "number",
"order" : "asc"
}
}
}
As you can read from the script based sorting section:
Note, it is recommended, for single custom based script based sorting,
to use custom_score query instead as sorting based on score is faster.
That means that it'd be better to use a custom score query to influence your score, and then sort by score, like this:
{
"query" : {
"custom_score" : {
"query" : {
"match_all" : {}
},
"script" : "_score * doc['tags'].values.length"
}
}
}

Resources