MongoDB: multi-lingual (accent insensitive), case insensitive search, with partial words?

MongoDB: multi-lingual (accent insensitive), case insensitive search, with partial words? - node.js

For the application we are developing we need to allow our searches to support accents, be case insensitive and search for partial words. For example, given the product name "La Niña" in our collection, the following searches should be expected to return the entry:
La Niña
niña
nina
nin
La nin
Currently I have tried two approaches, each with their appear apparent limitations, based on testing and some research:
Regex
supports case insensitive and partial searches
does not support accents such that, niña != nina
Text Search
support case insensitive, accents and partial phrases
does not support partial words
Example regex search, as we have used:
function escapeRegExp(text) {
return text.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
}
const escapedStr = this.escapeRegExp(searchTerm);
await Product.find({ name: new RegExp(`${escapedStr}`, 'i') });
Example text search, as we have used:
// On the schema
storeSchema.index({ name: 'text' });
// Searching:
await Product.find($text: { $search: searchTerm })
.collation({locale: 'en', strength: 1});
BTW We have set the schemas in question to use collation strength level 1.
Some approaches I am considering, if MongoDB doesn't provide a solution:
shadow name field (not sure the right term?), with the accents removed
a separate full text search engine
Can anyone help here?
Note, we are leveraging mongoose 5.9.5, with node 12.16.2 and mongodb 4.3.8 running in mongo cloud.

I believe the Text Search is what you need. There are two other features of Text Search that fulfills the requirement of a partial word match you described in the question.
Stop Words: Given a language option, MongoDB Text Search is capable of identifying words that shouldn't influence search results. The frequency of usage of these words is such that they appear in almost every sentence, for example, in English, words like "the", "a", "of", are all stop words. These words are stripped off the search phrase before the actual search takes place.
Word Stemming: Given a language option, MongoDB Text Search is capable of identifying the root version of a word, for example, in English, the stem version of "identifying" would be "identify" so they both would match in a text search".
I was able to figure with Google Translate that the "La Niña" example you gave is in Spanish.
If I insert the following into a sample product collection:
db.products.insertMany([
{ "term" : "La Niña" },
{ "term" : "niña" },
{ "term" : "nina" },
{ "term" : "nin" },
{ "term" : "La nin" },
])
By specifying a language option of "spanish" on my Test Search query:
db.products.find({ $text: { $search: "La Niña", $language: "spanish" } })
MongoDB would effectively match that with all the products that were previously inserted. You can get a list of the supported language options for MongoDB here.
I'm not 100% sure of how the accent matching works though.

Related

Weighted text search with MongoDB

I have a MongoDB Atlas cluster that I reach from a node.js server.
I implemented a text search where I take an input from the user, let's say "rainbow", and create a string like that: "rai rain rainb rainbo rainbow".
Then use that string to do a text search on indexed fields and sort the results by score.
That's the code I use to search over the database:
await myCollection.createIndex({ Description: 'text' });
const searchResult = await myCollection.find(
{ $text: { $search: rainbowString } },
{ projection: { Description: 1, Price: 1, score: { $meta: 'textScore' } } }
).sort({ score: { $meta: 'textScore' } }).limit(20);
// where rainbowString is the string I spoke about earlier
Now I would like to make some improvements. For example, a "rainbows" string in my database is not going to be found. (In general misspelled words or abbreviations are not going to find a match. By writing "pat" you will not find neither "path" nor "pet").
I could make the algorithm add an "s" at the end of every word typed by the user (or any letter in any place), ending up with this string "rai rain rainb rainbo rainbow rainbows". However in this way "rainbows" would score higher than "rainbow" (which is what the user originally typed).
I guess I could add an extra copy of every word typed by the user to make my own weighted search: "rainbow rai rain rainb rainbo rainbow rainbows". However searching twice the same word is a waste of resources. Imagine if you want to use like five different weights (with each weight assigned to a group of words).
So my question is: is there a way I can tell mongo that I want to look for "rainbow" with a weight of say 4 and for each of the words in the string: "rai rain rainb rainbo rainbows" with a weight of 1?

Find and update case insensitive data in MongoDB [duplicate]

Example:
> db.stuff.save({"foo":"bar"});
> db.stuff.find({"foo":"bar"}).count();
1
> db.stuff.find({"foo":"BAR"}).count();
0

You could use a regex.
In your example that would be:
db.stuff.find( { foo: /^bar$/i } );
I must say, though, maybe you could just downcase (or upcase) the value on the way in rather than incurring the extra cost every time you find it. Obviously this wont work for people's names and such, but maybe use-cases like tags.

UPDATE:
The original answer is now obsolete. Mongodb now supports advanced full text searching, with many features.
ORIGINAL ANSWER:
It should be noted that searching with regex's case insensitive /i means that mongodb cannot search by index, so queries against large datasets can take a long time.
Even with small datasets, it's not very efficient. You take a far bigger cpu hit than your query warrants, which could become an issue if you are trying to achieve scale.
As an alternative, you can store an uppercase copy and search against that. For instance, I have a User table that has a username which is mixed case, but the id is an uppercase copy of the username. This ensures case-sensitive duplication is impossible (having both "Foo" and "foo" will not be allowed), and I can search by id = username.toUpperCase() to get a case-insensitive search for username.
If your field is large, such as a message body, duplicating data is probably not a good option. I believe using an extraneous indexer like Apache Lucene is the best option in that case.

Starting with MongoDB 3.4, the recommended way to perform fast case-insensitive searches is to use a Case Insensitive Index.
I personally emailed one of the founders to please get this working, and he made it happen! It was an issue on JIRA since 2009, and many have requested the feature. Here's how it works:
A case-insensitive index is made by specifying a collation with a strength of either 1 or 2. You can create a case-insensitive index like this:
db.cities.createIndex(
{ city: 1 },
{
collation: {
locale: 'en',
strength: 2
}
}
);
You can also specify a default collation per collection when you create them:
db.createCollection('cities', { collation: { locale: 'en', strength: 2 } } );
In either case, in order to use the case-insensitive index, you need to specify the same collation in the find operation that was used when creating the index or the collection:
db.cities.find(
{ city: 'new york' }
).collation(
{ locale: 'en', strength: 2 }
);
This will return "New York", "new york", "New york" etc.
Other notes
The answers suggesting to use full-text search are wrong in this case (and potentially dangerous). The question was about making a case-insensitive query, e.g. username: 'bill' matching BILL or Bill, not a full-text search query, which would also match stemmed words of bill, such as Bills, billed etc.
The answers suggesting to use regular expressions are slow, because even with indexes, the documentation states:
"Case insensitive regular expression queries generally cannot use indexes effectively. The $regex implementation is not collation-aware and is unable to utilize case-insensitive indexes."
$regex answers also run the risk of user input injection.

If you need to create the regexp from a variable, this is a much better way to do it: https://stackoverflow.com/a/10728069/309514
You can then do something like:
var string = "SomeStringToFind";
var regex = new RegExp(["^", string, "$"].join(""), "i");
// Creates a regex of: /^SomeStringToFind$/i
db.stuff.find( { foo: regex } );
This has the benefit be being more programmatic or you can get a performance boost by compiling it ahead of time if you're reusing it a lot.

Keep in mind that the previous example:
db.stuff.find( { foo: /bar/i } );
will cause every entries containing bar to match the query ( bar1, barxyz, openbar ), it could be very dangerous for a username search on a auth function ...
You may need to make it match only the search term by using the appropriate regexp syntax as:
db.stuff.find( { foo: /^bar$/i } );
See http://www.regular-expressions.info/ for syntax help on regular expressions

db.company_profile.find({ "companyName" : { "$regex" : "Nilesh" , "$options" : "i"}});

db.zipcodes.find({city : "NEW YORK"}); // Case-sensitive
db.zipcodes.find({city : /NEW york/i}); // Note the 'i' flag for case-insensitivity

TL;DR
Correct way to do this in mongo
Do not Use RegExp
Go natural And use mongodb's inbuilt indexing , search
Step 1 :
db.articles.insert(
[
{ _id: 1, subject: "coffee", author: "xyz", views: 50 },
{ _id: 2, subject: "Coffee Shopping", author: "efg", views: 5 },
{ _id: 3, subject: "Baking a cake", author: "abc", views: 90 },
{ _id: 4, subject: "baking", author: "xyz", views: 100 },
{ _id: 5, subject: "Café Con Leche", author: "abc", views: 200 },
{ _id: 6, subject: "Сырники", author: "jkl", views: 80 },
{ _id: 7, subject: "coffee and cream", author: "efg", views: 10 },
{ _id: 8, subject: "Cafe con Leche", author: "xyz", views: 10 }
]
)
Step 2 :
Need to create index on whichever TEXT field you want to search , without indexing query will be extremely slow
db.articles.createIndex( { subject: "text" } )
step 3 :
db.articles.find( { $text: { $search: "coffee",$caseSensitive :true } } ) //FOR SENSITIVITY
db.articles.find( { $text: { $search: "coffee",$caseSensitive :false } } ) //FOR INSENSITIVITY

One very important thing to keep in mind when using a Regex based query - When you are doing this for a login system, escape every single character you are searching for, and don't forget the ^ and $ operators. Lodash has a nice function for this, should you be using it already:
db.stuff.find({$regex: new RegExp(_.escapeRegExp(bar), $options: 'i'})
Why? Imagine a user entering .* as his username. That would match all usernames, enabling a login by just guessing any user's password.

Suppose you want to search "column" in "Table" and you want case insensitive search. The best and efficient way is:
//create empty JSON Object
mycolumn = {};
//check if column has valid value
if(column) {
mycolumn.column = {$regex: new RegExp(column), $options: "i"};
}
Table.find(mycolumn);
It just adds your search value as RegEx and searches in with insensitive criteria set with "i" as option.

Mongo (current version 2.0.0) doesn't allow case-insensitive searches against indexed fields - see their documentation. For non-indexed fields, the regexes listed in the other answers should be fine.

For searching a variable and escaping it:
const escapeStringRegexp = require('escape-string-regexp')
const name = 'foo'
db.stuff.find({name: new RegExp('^' + escapeStringRegexp(name) + '$', 'i')})
Escaping the variable protects the query against attacks with '.*' or other regex.
escape-string-regexp

The best method is in your language of choice, when creating a model wrapper for your objects, have your save() method iterate through a set of fields that you will be searching on that are also indexed; those set of fields should have lowercase counterparts that are then used for searching.
Every time the object is saved again, the lowercase properties are then checked and updated with any changes to the main properties. This will make it so you can search efficiently, but hide the extra work needed to update the lc fields each time.
The lower case fields could be a key:value object store or just the field name with a prefixed lc_. I use the second one to simplify querying (deep object querying can be confusing at times).
Note: you want to index the lc_ fields, not the main fields they are based off of.

Using Mongoose this worked for me:
var find = function(username, next){
User.find({'username': {$regex: new RegExp('^' + username, 'i')}}, function(err, res){
if(err) throw err;
next(null, res);
});
}

If you're using MongoDB Compass:
Go to the collection, in the filter type -> {Fieldname: /string/i}
For Node.js using Mongoose:
Model.find({FieldName: {$regex: "stringToSearch", $options: "i"}})

The aggregation framework was introduced in mongodb 2.2 . You can use the string operator "$strcasecmp" to make a case-insensitive comparison between strings. It's more recommended and easier than using regex.
Here's the official document on the aggregation command operator: https://docs.mongodb.com/manual/reference/operator/aggregation/strcasecmp/#exp._S_strcasecmp .

You can use Case Insensitive Indexes:
The following example creates a collection with no default collation, then adds an index on the name field with a case insensitive collation. International Components for Unicode
/* strength: CollationStrength.Secondary
* Secondary level of comparison. Collation performs comparisons up to secondary * differences, such as diacritics. That is, collation performs comparisons of
* base characters (primary differences) and diacritics (secondary differences). * Differences between base characters takes precedence over secondary
* differences.
*/
db.users.createIndex( { name: 1 }, collation: { locale: 'tr', strength: 2 } } )
To use the index, queries must specify the same collation.
db.users.insert( [ { name: "Oğuz" },
{ name: "oğuz" },
{ name: "OĞUZ" } ] )
// does not use index, finds one result
db.users.find( { name: "oğuz" } )
// uses the index, finds three results
db.users.find( { name: "oğuz" } ).collation( { locale: 'tr', strength: 2 } )
// does not use the index, finds three results (different strength)
db.users.find( { name: "oğuz" } ).collation( { locale: 'tr', strength: 1 } )
or you can create a collection with default collation:
db.createCollection("users", { collation: { locale: 'tr', strength: 2 } } )
db.users.createIndex( { name : 1 } ) // inherits the default collation

I'm surprised nobody has warned about the risk of regex injection by using /^bar$/i if bar is a password or an account id search. (I.e. bar => .*#myhackeddomain.com e.g., so here comes my bet: use \Q \E regex special chars! provided in PERL
db.stuff.find( { foo: /^\Qbar\E$/i } );
You should escape bar variable \ chars with \\ to avoid \E exploit again when e.g. bar = '\E.*#myhackeddomain.com\Q'
Another option is to use a regex escape char strategy like the one described here Javascript equivalent of Perl's \Q ... \E or quotemeta()

Use RegExp,
In case if any other options do not work for you, RegExp is a good option. It makes the string case insensitive.
var username = new RegExp("^" + "John" + "$", "i");;
use username in queries, and then its done.
I hope it will work for you too. All the Best.

If there are some special characters in the query, regex simple will not work. You will need to escape those special characters.
The following helper function can help without installing any third-party library:
const escapeSpecialChars = (str) => {
return str.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
}
And your query will be like this:
db.collection.find({ field: { $regex: escapeSpecialChars(query), $options: "i" }})
Hope it will help!

Using a filter works for me in C#.
string s = "searchTerm";
var filter = Builders<Model>.Filter.Where(p => p.Title.ToLower().Contains(s.ToLower()));
var listSorted = collection.Find(filter).ToList();
var list = collection.Find(filter).ToList();
It may even use the index because I believe the methods are called after the return happens but I haven't tested this out yet.
This also avoids a problem of
var filter = Builders<Model>.Filter.Eq(p => p.Title.ToLower(), s.ToLower());
that mongodb will think p.Title.ToLower() is a property and won't map properly.

I had faced a similar issue and this is what worked for me:
const flavorExists = await Flavors.findOne({
'flavor.name': { $regex: flavorName, $options: 'i' },
});

Yes it is possible
You can use the $expr like that:
$expr: {
$eq: [
{ $toLower: '$STRUNG_KEY' },
{ $toLower: 'VALUE' }
]
}
Please do not use the regex because it may make a lot of problems especially if you use a string coming from the end user.

I've created a simple Func for the case insensitive regex, which I use in my filter.
private Func<string, BsonRegularExpression> CaseInsensitiveCompare = (field) =>
BsonRegularExpression.Create(new Regex(field, RegexOptions.IgnoreCase));
Then you simply filter on a field as follows.
db.stuff.find({"foo": CaseInsensitiveCompare("bar")}).count();

These have been tested for string searches
{'_id': /.*CM.*/} ||find _id where _id contains ->CM
{'_id': /^CM/} ||find _id where _id starts ->CM
{'_id': /CM$/} ||find _id where _id ends ->CM
{'_id': /.*UcM075237.*/i} ||find _id where _id contains ->UcM075237, ignore upper/lower case
{'_id': /^UcM075237/i} ||find _id where _id starts ->UcM075237, ignore upper/lower case
{'_id': /UcM075237$/i} ||find _id where _id ends ->UcM075237, ignore upper/lower case

For any one using Golang and wishes to have case sensitive full text search with mongodb and the mgo godoc globalsign library.
collation := &mgo.Collation{
Locale: "en",
Strength: 2,
}
err := collection.Find(query).Collation(collation)

As you can see in mongo docs - since version 3.2 $text index is case-insensitive by default: https://docs.mongodb.com/manual/core/index-text/#text-index-case-insensitivity
Create a text index and use $text operator in your query.

Query subdocuments without knowing the keys

I have a collection like this:
{
"_id" : ObjectId("5a7c49b02d2bbb28a4b2e6a2"),
"phone" : "Pinheiro",
"email" : "Pinheiro",
"variableParameters" : {
"loremIpsum" : "Do you see a little Asian child with a blank expression on his face sitting outside on a mechanical helicopter that shakes when you put quarters in it?",
"uf" : "Rio de Janeiro",
"city" : "Rio de Janeiro",
"end" : "RUA JARDIM BOTÂNICO 1060",
"tel" : "5521999999999",
"eml" : "teste#gmail.com",
"nome" : "Usuario de Teste"
}
}
And i want to query the "variableParameters" object, but like the name said, this properties are variable. So in some cases it will have "uf", but in other cases won't.
I'm actually doing a query that only matches the constant field from a mongoose schema:
{ 'phone': { $regex: filter, $options: 'i' } }
Is there any way that I can query "variableParameters" without knowing his child properties?

If you are unsure about the keys(since they are variable), then try using $text search
To use text search we need to index the variableParameters.
Case sensitive text search can also be performed but it comes with the impact on performance.
Please read https://docs.mongodb.com/manual/reference/operator/query/text/ for more information on text search

[SOLVED]
Thanks #Clement Amarnath for the help.
The solution is something like this:
_Events.find({ $text: { $search: 'searchText' } }, (err, events) => {
if (err) return Exceptions.HandleApiException(err, res);
res.send(events);
});
The $text parameter can have this properties:
{
$text: {
$search: <string>,
$language: <string>,
$caseSensitive: <boolean>,
$diacriticSensitive: <boolean>
}
}
$search A string of terms that MongoDB parses and uses to query the text index. MongoDB performs a logical OR search of the terms unless specified as a phrase. See Behavior for more information on the field.
$language Optional. The language that determines the list of stop words for the search and the rules for the stemmer and tokenizer. If not specified, the search uses the default language of the index. For supported languages, see Text Search Languages.
If you specify a language value of "none", then the text search uses simple tokenization with no list of stop words and no stemming.
$caseSensitive Optional. A boolean flag to enable or disable case sensitive search. Defaults to false; i.e. the search defers to the case insensitivity of the text index.
$diacriticSensitive Optional. A boolean flag to enable or disable diacritic sensitive search against version 3 text indexes. Defaults to false; i.e. the search defers to the diacritic insensitivity of the text index.
For more information, see MongoDB Documentation.

You can use $where to build your own matching function.
An example for a full-text match would be:
db.col.find({$where: function(){
return Object.values(this.variableParameters).includes("Rio de Janeiro")
}})

How can I create an autocomplete with MongoDB full text search

I want to create an autocomplete input box that shows word suggestions as users type.
Basically, my problem is that when I use the $text operator for searching strings in a document, the queries will only match on complete stemmed words. This is for the same reason that if a document field contains the word blueberry, a search on the term blue will not match the document. However, a search on either blueberry or blueberries would match.
find = {$text: { $search: 'blue' } };
^ (doesn't match blueberry or bluebird on a document.)
I want to be able to do this. I want to match 'blueberry' or 'bluebird' with 'blue', and initially I thought this was possible by using a 'starts with' (^) regular expression, but it seems like $text and $search only accepts a string; not a regexp.
I would like to know if there is a way to do this that is not excessively complex to implement/maintain. So far, I've only seen people trying to accomplish this by creating a new collection with the results of running a map/reduce across the collection with the text index.
I do not want to use ElasticSearch or Solr because I think it is overkill for what I am trying to do, and although I sometimes think that eventually I will have no other choice, I still cannot believe that there is not a simpler way to accomplish this.

MongoDB full text search matches whole words only, so it is inherently not suitable for auto complete.
The $text operator can search for words and phrases. The query matches on the complete stemmed words. For example, if a document field contains the word blueberry, a search on the term blue will not match the document. However, a search on either blueberry or blueberries will match.
(Source: http://docs.mongodb.org/manual/core/index-text/)

You can now use Atlas Search natively in MongoDB Atlas to achieve this. You will have to first add the autocomplete field mapping in your index definition before you can use the autocomplete operator to your query. This can be accomplished through the Visual Editor or the JSON editor - there's a tutorial which walks you through how to implement it.
Here's the index definition template from the docs:
{
"mappings": {
"dynamic": true|false,
"fields": {
"<field-name>": [
{
"type": "autocomplete",
"analyzer": "lucene.standard",
"tokenization": "edgeGram|rightEdgeGram|nGram",
"minGrams": <2>,
"maxGrams": <15>,
"foldDiacritics": true|false
}
]
}
}
}
And the query, where you can also specify support for typo-tolerance via the fuzzy parameter:
{
$search: {
"index": "<index name>", // optional, defaults to "default"
"autocomplete": {
"query": "<search-string>",
"path": "<field-to-search>",
"tokenOrder": "any|sequential",
"fuzzy": <options>,
"score": <options>
}
}
}

elasticsearch prefix query for multiple words to solve the autocomplete use case

How do I get elastic search to work to solve a simple autocomplete use case that has multiple words?
Lets say I have a document with the following title - Elastic search is a great search tool built on top of lucene.
So if I use the prefix query and construct it with the form -
{
"prefix" : { "title" : "Elas" }
}
It will return that document in the result set.
However if I do a prefix search for
{
"prefix" : { "title" : "Elastic sea" }
}
I get no results.
What sort of query do I need to construct so as to present to the user that result for a simple autocomplete use case.

A prefix query made on Elastic sea would match a term like Elastic search in the index, but that doesn't appear in your index if you tokenize on whitespaces. What you have is elastic and search as two different tokens. Have a look at the analyze api to find out how you are actually indexing your text.
Using a boolean query like in your answer you wouldn't take into account the position of the terms. You would get as a result the following document for example:
Elastic model is a framework to store your Moose object and search
through them.
For auto-complete purposes you might want to make a phrase query and use the last term as a prefix. That's available out of the box using the match_phrase_prefix type in a match query, which was made available exactly for your usecase:
{
"match" : {
"message" : {
"query" : "elastic sea",
"type" : "phrase_prefix"
}
}
}
With this query your example document would match but mine wouldn't since elastic is not close to search there.

To achieve that result, you will need to use a Boolean query. The partial word needs to be a prefix query and the complete word or phrase needs to be in a match clause. There are other tweaks available to the query like must should etc.. that can be applied as needed.
{
"query": {
"bool": {
"must": [
{
"prefix": {
"name": "sea"
}
},
{
"match": {
"name": "elastic"
}
}
]
}
}
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

MongoDB: multi-lingual (accent insensitive), case insensitive search, with partial words? - node.js

Related

Weighted text search with MongoDB

Find and update case insensitive data in MongoDB [duplicate]

Query subdocuments without knowing the keys

How can I create an autocomplete with MongoDB full text search

elasticsearch prefix query for multiple words to solve the autocomplete use case

Categories

Resources