Searching a number in a string field with query_string on Elasticsearch - string

Among other text fields, I've got this string field in my Elasticsearch index:
"user": { "type": "string", "analyzer": "simple", "norms": { "enabled": False } }
It gets filled with a typical username, e.g. "simon".
Using query_string I can limit my search results for "other search terms" to this particular user:
'query': { 'query_string': { 'query': 'user:simon other search terms' } }
Default operator is set to "AND". However, in case a username only consists of a number (saved and indexed as string), Elasticsearch appears to ignore the "user:..." statement. For example:
'query': { 'query_string': { 'query': 'user:111 other search terms' } }
yields the same results as
'query': { 'query_string': { 'query': 'other search terms' } }
Any idea what might be the cause or how to fix it?

You are using the simple tokenizer. As the documentation says:
An analyzer of type simple that is built using a Lower Case Tokenizer.
And the lower case tokenizer uses the letter tokenizer and the lower case token filter. The problem with your specific test data is that the letter tokenizer divides the text at non-letters. And the digits are non-letters. This method from Java API defines what exactly is a letter. In contrast, this method from Java API defines what exactly is a digit.
You may want to look at the standard tokenizer instead.

Related

MongoDB: Searching a text field using mathematical operators

I have documents in a MongoDB as below -
[
{
"_id": "17tegruebfjt73efdci342132",
"name": "Test User1",
"obj": "health=8,type=warrior",
},
{
"_id": "wefewfefh32j3h42kvci342132",
"name": "Test User2",
"obj": "health=6,type=magician",
}
.
.
]
I want to run a query say health>6 and it should return the "Test User1" entry. The obj key is indexed as a text field so I can do {$text:{$search:"health=8"}} to get an exact match but I am trying to incorporate mathematical operators into the search.
I am aware of the $gt and $lt operators, however, it cannot be used in this case as health is not a key of the document. The easiest way out is to make health a key of the document for sure, but I cannot change the document structure due to certain constraints.
Is there anyway this can be achieved? I am aware that mongo supports running javascript code, not sure if that can help in this case.
I don't think it's possible in $text search index, but you can transform your object conditions to an array of objects using an aggregation query,
$split to split obj by "," and it will return an array
$map to iterate loop of the above split result array
$split to split current condition by "=" and it will return an array
$let to declare the variable cond to store the result of the above split result
$first to return the first element from the above split result in k as a key of condition
$last to return the last element from the above split result in v as a value of the condition
now we have ready an array of objects of string conditions:
"objTransform": [
{ "k": "health", "v": "9" },
{ "k": "type", "v": "warrior" }
]
$match condition for key and value to match in the same object using $elemMatch
$unset to remove transform array objTransform, because it's not needed
db.collection.aggregate([
{
$addFields: {
objTransform: {
$map: {
input: { $split: ["$obj", ","] },
in: {
$let: {
vars: {
cond: { $split: ["$$this", "="] }
},
in: {
k: { $first: "$$cond" },
v: { $last: "$$cond" }
}
}
}
}
}
}
},
{
$match: {
objTransform: {
$elemMatch: {
k: "health",
v: { $gt: "8" }
}
}
}
},
{ $unset: "objTransform" }
])
Playground
The second upgraded version of the above aggregation query to do less operation in condition transformation if it's possible to manage in your client-side,
$split to split obj by "," and it will return an array
$map to iterate loop of the above split result array
$split to split current condition by "=" and it will return an array
now we have ready a nested array of string conditions:
"objTransform": [
["type", "warrior"],
["health", "9"]
]
$match condition for key and value to match in the array element using $elemMatch, "0" to match the first position of the array and "1" to match the second position of the array
$unset to remove transform array objTransform, because it's not needed
db.collection.aggregate([
{
$addFields: {
objTransform: {
$map: {
input: { $split: ["$obj", ","] },
in: { $split: ["$$this", "="] }
}
}
}
},
{
$match: {
objTransform: {
$elemMatch: {
"0": "health",
"1": { $gt: "8" }
}
}
}
},
{ $unset: "objTransform" }
])
Playground
Using JavaScript is one way of doing what you want. Below is a find that uses the index on obj by finding documents that have health= text followed by an integer (if you want, you can anchor that with ^ in the regex).
It then uses a JavaScript function to parse out the actual integer after substringing your way past the health= part, doing a parseInt to get the int, and then the comparison operator/value you mentioned in the question.
db.collection.find({
// use the index on obj to potentially speed up the query
"obj":/health=\d+/,
// now apply a function to narrow down and do the math
$where: function() {
var i = this.obj.indexOf("health=") + 7;
var s = this.obj.substring(i);
var m = s.match(/\d+/);
if (m)
return parseInt(m[0]) > 6;
return false;
}
})
You can of course tweak it to your heart's content to use other operators.
NOTE: I'm using the JavaScript regex capability, which may not be supported by MongoDB. I used Mongo-Shell r4.2.6 where it is supported. If that's the case, in the JavaScript, you will have to extract the integer out a different way.
I provided a Mongo Playground to try it out in if you want to tweak it, but you'll get
Invalid query:
Line 3: Javascript regex are not supported. Use "$regex" instead
until you change it to account for the regex issue noted above. Still, if you're using the latest and greatest, this shouldn't be a limitation.
Performance
Disclaimer: This analysis is not rigorous.
I ran two queries against a small collection (a bigger one could possibly have resulted in different results) with Explain Plan in MongoDB Compass. The first query is the one above; the second is the same query, but with the obj filter removed.
and
As you can see the plans are different. The number of documents examined is fewer for the first query, and the first query uses the index.
The execution times are meaningless because the collection is small. The results do seem to square with the documentation, but the documentation seems a little at odds with itself. Here are two excerpts
Use the $where operator to pass either a string containing a JavaScript expression or a full JavaScript function to the query system. The $where provides greater flexibility, but requires that the database processes the JavaScript expression or function for each document in the collection.
and
Using normal non-$where query statements provides the following performance advantages:
MongoDB will evaluate non-$where components of query before $where statements. If the non-$where statements match no documents, MongoDB will not perform any query evaluation using $where.
The non-$where query statements may use an index.
I'm not totally sure what to make of this, TBH. As a general solution it might be useful because it seems you could generate queries that can handle all of your operators.

Does EdgeNGram autocomplete_filter make sense with prefix search?

i have Elastic Search Index with around 1 million records.
I want to do multi prefix search against 2 fields in the Elastic Search Index, Name and ID (there are around 10 total).
Does creating EdgeNGram autocomplete filter make sense at all?
Or i am missing the point of the EdgeNGram.
Here is the code i have for creation of the index:
client.indices.create({
index: 'testing',
// type: 'text',
body: {
settings: {
analysis: {
filter: {
autocomplete_filter: {
type: 'edge_ngram',
min_gram: 3,
max_gram: 20
}
},
analyzer: {
autocomplete: {
type: 'custom',
tokenizer: 'standard',
filter: [
'lowercase',
'autocomplete_filter'
]
}
}
}
}
}
},function(err,resp,status) {
if(err) {
console.log(err);
}
else {
console.log("create",resp);
}
});
Code for searching
client.search({
index: 'testing',
type: 'article',
body: {
query: {
multi_match : {
query: "87041",
fields: [ "name", "id" ],
type: "phrase_prefix"
}
}
}
},function (error, response,status) {
if (error){
console.log("search error: "+error)
}
else {
console.log("--- Response ---");
console.log(response);
console.log("--- Hits ---");
response.hits.hits.forEach(function(hit){
console.log(hit);
})
}
});
The search returns the correct results, so my question being does creating the edgengram filter and analyzer make sense in this case?
Or this prefix functionality would be given out of the box?
Thanks a lot for your info
It is depending on your use case. Let me explain.
You can use ngram for this feature. Let's say your data is london bridge, then if your min gram is 1 and max gram is 20, it will be tokenized as l, lo, lon, etc..
Here the advantage is that even if you search for bridge or any tokens which is part of the generated ngrams, it will be matched.
There is one out of box feature completion suggester. It uses FST model to store them. Even the documentation says it is faster to search but costlier to build. But the think is it is prefix suggester. Meaning searching bridge will not bring london bridge by default. But there are ways to make this work. Workaround to achieve is that, to have array of tokens. Here london bridge and bridge are the tokens.
There is one more called context suggester. If you know that you are going to search on name or id, it is best over completion suggester. As completion suggester works over on all the index, context suggester works on a particular index based on the context.
As you say, it is prefix search you can go for completion. And you mentioned that there 10 such fields. And if you know the field to be suggested at fore front, then you can go for context suggester.
one nice answer about edge ngram and completion
completion suggester for middle of the words - I used this solution, it works like charm.
You can refer documentation for other default options available within suggesters.

ArangoDB - How to support case-insensitive n-gram index

I have successfully created an n-gram analyzer linked to an ArangoSearch view. The document field being indexed contains mixed case string content, but I would like users to be able to run case-insensitive queries angainst it. There is not an option for case in the n-gram analyzer properties, so I'm wondering how to do this. An example query I'm running, is as follows:
"for doc in myview search analyzer(doc.field in tokens('some input text','myanalyzer'), 'myanalyzer') sort BM25(doc) desc return doc"
This does not (fully) match fields containing "Some Input Text" due to case. Does anyone have recommendations to accomplish this? Thanks!
This ist possible since v3.8.0.
a new type of analyzer, pipeline, was introduced.
With this you can chain the effects of multiple analyzers together.
arangosh> var analyzers = require("#arangodb/analyzers");
arangosh> var a = analyzers.save("ngram_upper", "pipeline", { pipeline: [
........> { type: "norm", properties: { locale: "en.utf-8", case: "upper" } },
........> { type: "ngram", properties: { min: 2, max: 2, preserveOriginal:
false, streamType: "utf8" } }
........> ] }, ["frequency", "norm", "position"]);
arangosh> db._query(`RETURN TOKENS("Quick brown foX", "ngram_upper")`).toArray();
Source: https://www.arangodb.com/docs/stable/analyzers.html#pipeline

Elasticsearch dsl OR query formation

I have index with multiple documents. The documents contains below fields:
name
adhar_number
pan_number
acc_number
I want to create a elasticsearch dsl query. For this query two inputs are available like adhar_number and pan_number. This query should match OR Condition on this.
Example: If one document contains provided adhar_number only then I want that document too.
I have one dictionary with below contents (my_dict):
{
"adhar_number": "123456789012",
"pan_number": "BGPPG4315B"
}
I tried like below:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
s = Search(using=es, index="my_index")
for key, value in my_dict.items():
s = s.query("match", **{key:value})
print(s.to_dict())
response = s.execute()
print(response.to_dict())
It creates below query:
{
'query': {
'bool': {
'must': [
{
'match': {
'adhar_number': '123456789012'
}
},
{
'match': {
'pan_number': 'BGPPG4315B'
}
}
]
}
}
}
Above code is providing me the result with AND condition instead of OR Condition.
Please suggest me the good suggestions to include OR Condition.
To fix the ES query itself, all you need to do is use 'should' instead of 'must':
{
'query': {
'bool': {
'should': [
{
'match': {
'adhar_number': '123456789012'
}
},
{
'match': {
'pan_number': 'BGPPG4315B'
}
}
]
}
}
}
To achieve this in python, see the following example from the docs. The default logic is AND, but you can override it to OR as shown below.
Query combination Query objects can be combined using logical
operators:
Q("match", title='python') | Q("match", title='django')
# {"bool": {"should": [...]}}
Q("match", title='python') & Q("match", title='django')
# {"bool": {"must": [...]}}
~Q("match", title="python")
# {"bool": {"must_not": [...]}}
When you call the .query() method multiple times, the & operator will be used internally:
s = s.query().query() print(s.to_dict())
# {"query": {"bool": {...}}}
If you want to have precise control over the query form, use the Q shortcut to directly construct the combined
query:
q = Q('bool',
must=[Q('match', title='python')],
should=[Q(...), Q(...)],
minimum_should_match=1 ) s = Search().query(q)
So you want something like
q = Q('bool', should=[Q('match', **{key:value})])
You can use should as also mentioned by #ifo20. Note that you most likely want ot define the minimum_should_match parameters as well:
You can use the minimum_should_match parameter to specify the number or percentage of should clauses returned documents must match.
If the bool query includes at least one should clause and no must or filter clauses, the default value is 1. Otherwise, the default value is 0.
{
'query': {
'bool': {
'should': [
{
'match': {
'adhar_number': '123456789012'
}
},
{
'match': {
'pan_number': 'BGPPG4315B'
}
}
],
"minimum_should_match" : 1
}
}
}
Note also that the should clause contributes to the final score. I don't know how to avoid this but you may not want this to be part of an OR logic.

Global Search in Elastic Search

Working on Elasticsearch, my use case is very straight forward. When a user types in a search box I want to search all of my data set irrespective of field or column or any condition (search all data and provide all occurrences of searched word in documents).
This might be available in their documentation but I'm not able to understand it. Can somebody explain on this?
The easiest way to search across all fields in an index is to use the _all field.
The _all field is a catch-all field which concatenates the values of all of the other fields into one big string, using space as a delimiter, which is then analyzed and indexed, but not stored.
For example:
PUT my_index/user/1
{
"first_name": "John",
"last_name": "Smith",
"date_of_birth": "1970-10-24"
}
GET my_index/_search
{
"query": {
"match": {
"_all": "john smith 1970"
}
}
}
Highlighting is supported so matching occurrences can be returned in your search results.
Drawbacks
There are two main drawbacks to this approach:
Additional disk space and memory are needed to store the _all field
You lose flexibility in how the data and search terms are analysed
A better approach is to disable the _all field and instead list out the fields you are interested in:
GET /_search
{
"query": {
"query_string" : {
"query" : "this AND that OR thus",
"fields":[
"name",
"addressline1",
"dob",
"telephone",
"country",
"zipcode"
]
}
}
}
Query_string (link) can do this job for u .
It support partial search effectively , here is my analysis https://stackoverflow.com/a/43321606/2357869 .
Query_string is more powerful than match , term and wildcard query .
Scenario 1 - Suppose u want to search "Hello" :-
Then go with :-
{
"query": {
"query_string": {"query": "*Hello*" }
}
}
It will search all words like ABCHello , HelloABC , ABCHeloABC
By default it will search hello in all fields (_all)
2) Scenario 2 - Suppose u want to search "Hello" or "World" :-
Then go with :-
{
"query": {
"query_string": {"query": "*Hello* *World*" }
}
}
It will search all words like ABCHello , HelloABC , ABCHelloABC , ABCWorldABC ,ABChello ,ABCworldABC etc.
it will search like Hello OR World , so whichever word having Hello Or world , it wiil give .
By default query_string (link) use default operator OR , u can change that .

Resources