We have a catalog of products stored in ElasticSearch.
Each document looks like this:
{
'family': 'products family'
'category': 'products category'
'name': 'product name'
'description': 'product description'
}
We are trying to build a query that will get the fuzzy match for a search term and will score the results by the following order of fields:
family
category
name
description
Is there a way to do it?
A simple approach would be to use multi-match query giving each field an appropriate boost.
{
"query": {
"multi_match": {
"query": "produce",
"fields": ["family^4","category^3","name^2","description"],
"fuzziness" : "AUTO",
"rewrite" : "constant_score_auto"
}
}
}
All documents which match on the same field would get the same score.
You can change this behavior by tweaking rewrite parameter
Article gives further insight to this.
Related
I am trying to create a CouchDB Mango Query with an index with the hope that the query runs faster. At the moment I have the following Mango Query which returns what I am looking for but it's slow. Therefore, I assume, I need to create an index to make it faster. I need help figuring out how to create that index.
selector: {
categoryIds: {
$in: categoryIds,
},
},
sort: [{ publicationDate: 'desc' }],
You can assume that my documents are let say news articles from different categories. Therefore in each document I have a field that contains one or more categories that the news article belongs to. For that I have an array of categoryIds for each document. My query needs to be optimized for queries like "Give me all news that have categoryId1 in their array of categoryIds sorted by publicationDate". What I don't know how to do is 1. How to define an index 2. What that index should be 3. How to use that index in "use_index" field of the Mango Query. Any help is appreciated.
Update after "Alexis Côté" answer:
If I define the index like this:
{
"_id": "_design/0f11ca4ef1ea06de05b31e6bd8265916c1bbe821",
"_rev": "6-adce50034e870aa02dc7e1e075c78361",
"language": "query",
"views": {
"categoryIds-json-index": {
"map": {
"fields": {
"categoryIds": "asc"
},
"partial_filter_selector": {}
},
"reduce": "_count",
"options": {
"def": {
"fields": [
"categoryIds"
]
}
}
}
}
}
And run the Mango Query like this:
{
"selector": {
"categoryIds": {
"$in": [
"e0bd5f97ac35bdf6893351337d269230"
]
}
},
"use_index": "categoryIds-json-index"
}
It still does return the results but they are not sorted in the order I want by publicationDate. So I am not clear what you are suggesting the solution is.
You can create an index as documented here
In your case, you will need an index on the "categoryIds" field.
You can specify the index using "use_index": "_design/<name>"
Note:The query planner should automatically pick this index if it's compatible.
I'm rather new to elasticsearch, so i'm coming here in hope to find advices.
I have two indices in elastic from two different csv files.
The index_1 has this mapping:
{'settings': {
'number_of_shards' : 3
},
'mappings': {
'properties': {
'place': {'type': 'keyword' },
'address': {'type': 'keyword' },
}
}
}
The file is about 400 000 documents long.
The index_2 with a much smaller file(about 50 documents) has this mapping:
{'settings': {
"number_of_shards" : 1
},
'mappings': {
'properties': {
'place': {'type': 'text' },
'address': {'type': 'keyword' },
}
}
}
The field "place" in index_2 is all of the unique values from the field "place" in index_1.
In both indices the "address" fields are postcodes of datatype keyword with a structure: 0000AZ.
Based on the "place" field keyword in index_1 I want to assign the term of field "address" from index_2.
I have tried using the pandas library but the index_1 file is too large. I have also to tried creating modules based off pandas and elasticsearch, quite unsuccessfully. Although I believe this is a promising direction. A good solution would be to stay into the elasticsearch library as much as possible as these indices will be later be used for further analysis.
If i understand correctly it sounds like you want to use updateByQuery.
the request body should look a little like this:
{
'query': {'term': {'place': "placeToMatch"}},
'script': 'ctx._source.address = "updatedZipCode"'
}
This will update the address field of all documents with the matched place.
EDIT:
So what we want to do is use updateByQuery while iterating over all the documents in index2.
First step: get all the documents from index2, will just do this using the basic search feature
{
"index": 'index2',
"size": 100 // get all documents, once size is over 10,000 you'll have to padginate.
"body": {"query": {"match_all": {}}}
}
Now we iterate over all the results and use updateByQuery for each of the results:
// sudo
doc = response[i]
// update by query request.
{
index: 'index1',
body: {
'query': {'term': {'address': doc._source.address}},
'script': 'ctx._source.place = "`${doc._source.place}`"'
}
}
In arangodb I have a Lookup Table as per below:
{
'49DD3A82-2B49-44F5-A0B2-BD88A32EDB13' = 'Human readable value 1',
'B015E210-27BE-4AA7-83EE-9F754F8E469A' = 'Human readable value 2',
'BC54CF8A-BB18-4E2C-B333-EA7086764819' = 'Human readable value 3',
'8DE15947-E49B-4FDC-89EE-235A330B7FEB' = 'Human readable value n'
}
I have documents in a seperate collection such as this which have non human readable attribute and value pairs as per "details" below:
{
"ptype": {
"name": "BC54CF8A-BB18-4E2C-B333-EA7086764819",
"accuracy": 9.6,
"details": {
"49DD3A82-2B49-44F5-A0B2-BD88A32EDB13": "B015E210-27BE-4AA7-83EE-9F754F8E469A",
"8DE15947-E49B-4FDC-89EE-235A330B7FEB": true,
}
}
}
I need to update the above document by looking up the human readable values out of the lookup table and I also need to update the non-human readable attributes with the readable attribute names also found in the lookup table.
The result should look like this:
{
"ptype": {
"name": "Human readable value 3",
"accuracy": 9.6,
"details": {
"Human readable value 1": "Human readable value 2",
"Human readable value n": true,
}
}
}
so ptype.name and ptype.details are updated with values from the lookup table.
This query should help you see how a LUT (Look Up Table) can be used.
One cool feature of AQL is that you can do a LUT query and assign it's value to a variable with the LET command, and then access the contents of that LUT later.
See if this example helps:
LET lut = {
'aaa' : 'Apples',
'bbb' : 'Bananas',
'ccc' : 'Carrots'
}
LET garden = [
{
'size': 'Large',
'plant_code': 'aaa'
},
{
'size': 'Medium',
'plant_code': 'bbb'
},
{
'size': 'Small',
'plant_code': 'ccc'
}
]
FOR doc IN garden
RETURN {
'size': doc.size,
'vegetable': lut[doc.plant_code]
}
The result of this query is:
[
{
"size": "Large",
"vegetable": "Apples"
},
{
"size": "Medium",
"vegetable": "Bananas"
},
{
"size": "Small",
"vegetable": "Carrots"
}
]
You'll notice in the bottom query that actually returns data, it's referring to the LUT by using the doc.plant_code as the look up key.
This is much more performant that performing subqueries there, because if you had 100,000 garden documents you don't want to perform a supporting query 100,000 times to work out the name of the plant_code.
If you wanted to confirm that you could find a value in the LUT, you could optionally have your final query in this format:
FOR doc IN garden
RETURN {
'size': doc.size,
'vegetable': (lut[doc.plant_code] ? lut[doc.plant_code] : 'Unknown')
}
This optional way to return the value for vegetable uses an inline if/then/else, where if the value is not found in the lut, it will return the value 'Unknown'.
Hope this helps you with your particular use case.
I have following type of documents:
{
"_id": "0710b1dd6cc2cdc9c2ffa099c8000f7b",
"_rev": "1-93687d40f54ff6ca72e66ca7fc99caff",
"date": "2018-06-04T07:46:08.848Z",
"topic": "some topic",
}
The collection is not very large. Only 20k documents.
However, the following query is very slow. Takes ca 5 secs!
{
selector: {
topic: 'some topic'
},
sort: ['date'],
}
I tried various indexes, e.g.
index: {
fields: ['topic', 'date']
}
but nothing really worked well.
What I am missing here?
When sorting in a Mango query, you need to ensure that the sort order you are asking for matches the index that you are using.
If you are indexing the data set in topic,date order then you can use the following query on "topic" to get the data out in data order using the index:
{
"selector": {
"topic": "some topic"
},
"sort": [
"topic",
"date"
]
}
Because the sort matches the form of the data in the index, the index is used to answer the query which should speed up your query time considerably.
I use elastic search for news articles search. If I search for "Vlamadir Putin", it works because he is in news a lot and Vlamidir and Putin are both not very popular. But if I search for "Raja Ram", it does not work. I have a few articles of "Raja Ram", but some of "Raja Mohanty" and "Ram Srivastava". These articles rank higher than articles quoting "Raja Ram". Is there something wrong in my tokenizer or search functions?
es.indices.create(
index="article-index",
body={
'settings': {
'analysis': {
'analyzer': {
'my_ngram_analyzer' : {
'tokenizer' : 'my_ngram_tokenizer'
}
},
'tokenizer' : {
'my_ngram_tokenizer' : {
'type' : 'nGram',
'min_gram' : '1',
'max_gram' : '50'
}
}
}
}
},
# ignore already existing index
ignore=400
)
res = es.search(index="article-index", fields="url", body={"query": {"query_string": {"query": keywordstr, "fields": ["text", "title", "tags", "domain"]}}})
You can use match_phrase option of elasticsearch
But you can't mention multiple fields for search, instead use _all field
Your query would be
res = es.search(index="article-index", fields="url", body={"query": "match_phrase": {"_all":"keywordstr"}})