ArangoDB - How to support case-insensitive n-gram index

ArangoDB - How to support case-insensitive n-gram index - arangodb

I have successfully created an n-gram analyzer linked to an ArangoSearch view. The document field being indexed contains mixed case string content, but I would like users to be able to run case-insensitive queries angainst it. There is not an option for case in the n-gram analyzer properties, so I'm wondering how to do this. An example query I'm running, is as follows:
"for doc in myview search analyzer(doc.field in tokens('some input text','myanalyzer'), 'myanalyzer') sort BM25(doc) desc return doc"
This does not (fully) match fields containing "Some Input Text" due to case. Does anyone have recommendations to accomplish this? Thanks!

This ist possible since v3.8.0.
a new type of analyzer, pipeline, was introduced.
With this you can chain the effects of multiple analyzers together.
arangosh> var analyzers = require("#arangodb/analyzers");
arangosh> var a = analyzers.save("ngram_upper", "pipeline", { pipeline: [
........> { type: "norm", properties: { locale: "en.utf-8", case: "upper" } },
........> { type: "ngram", properties: { min: 2, max: 2, preserveOriginal:
false, streamType: "utf8" } }
........> ] }, ["frequency", "norm", "position"]);
arangosh> db._query(`RETURN TOKENS("Quick brown foX", "ngram_upper")`).toArray();
Source: https://www.arangodb.com/docs/stable/analyzers.html#pipeline

Related

Does EdgeNGram autocomplete_filter make sense with prefix search?

i have Elastic Search Index with around 1 million records.
I want to do multi prefix search against 2 fields in the Elastic Search Index, Name and ID (there are around 10 total).
Does creating EdgeNGram autocomplete filter make sense at all?
Or i am missing the point of the EdgeNGram.
Here is the code i have for creation of the index:
client.indices.create({
index: 'testing',
// type: 'text',
body: {
settings: {
analysis: {
filter: {
autocomplete_filter: {
type: 'edge_ngram',
min_gram: 3,
max_gram: 20
}
},
analyzer: {
autocomplete: {
type: 'custom',
tokenizer: 'standard',
filter: [
'lowercase',
'autocomplete_filter'
]
}
}
}
}
}
},function(err,resp,status) {
if(err) {
console.log(err);
}
else {
console.log("create",resp);
}
});
Code for searching
client.search({
index: 'testing',
type: 'article',
body: {
query: {
multi_match : {
query: "87041",
fields: [ "name", "id" ],
type: "phrase_prefix"
}
}
}
},function (error, response,status) {
if (error){
console.log("search error: "+error)
}
else {
console.log("--- Response ---");
console.log(response);
console.log("--- Hits ---");
response.hits.hits.forEach(function(hit){
console.log(hit);
})
}
});
The search returns the correct results, so my question being does creating the edgengram filter and analyzer make sense in this case?
Or this prefix functionality would be given out of the box?
Thanks a lot for your info

It is depending on your use case. Let me explain.
You can use ngram for this feature. Let's say your data is london bridge, then if your min gram is 1 and max gram is 20, it will be tokenized as l, lo, lon, etc..
Here the advantage is that even if you search for bridge or any tokens which is part of the generated ngrams, it will be matched.
There is one out of box feature completion suggester. It uses FST model to store them. Even the documentation says it is faster to search but costlier to build. But the think is it is prefix suggester. Meaning searching bridge will not bring london bridge by default. But there are ways to make this work. Workaround to achieve is that, to have array of tokens. Here london bridge and bridge are the tokens.
There is one more called context suggester. If you know that you are going to search on name or id, it is best over completion suggester. As completion suggester works over on all the index, context suggester works on a particular index based on the context.
As you say, it is prefix search you can go for completion. And you mentioned that there 10 such fields. And if you know the field to be suggested at fore front, then you can go for context suggester.
one nice answer about edge ngram and completion
completion suggester for middle of the words - I used this solution, it works like charm.
You can refer documentation for other default options available within suggesters.

How to filter Subscribers based on array of tags in Loopabck

I've two models - subscribers and tags
Sample data:
{
subscribers: [
{
name: "User 1",
tags: ["a","b"]
},
{
name: "User 2",
tags: ["c","d"]
}
]
}
I want to filter subscribers based on their tags.
If I give a and b tags, User 1 should list
If I give a and c tags,
both User 1 and User 2 should list
Here is what I tried:
Method 1:
tags is a column in subscribers model with array data type
/subscribers/?filter={"where":{"tags":{"inq":["a","b"]}}} // doesn't work
Method 2:
Created a separate table tags and set subscribers has many tags.
/subscribers/?filter={"where":{"tags":{"inq":["a","b"]}}} // doesn't work
How can I achieve this in Loopback without writing custom methods?
I've Postgresql as the connector

UPDATE
As mentioned in the loopback docs you should use inq not In
The inq operator checks whether the value of the specified property matches any of the values provided in an array. The general syntax is:
{where: { property: { inq: [val1, val2, ...]}}}
From this:
/subscribers/?filter={"where":{"tags":{"In":["a","b"]}}}
To this:
/subscribers/?filter={"where":{"tags":{"inq":["a","b"]}}}

Finally found a hack, using Regex! it's not a performant solution, but it works!!
{ "where": { "tags": { "regexp": "a|b" } } }

Data validation in AVRO

I am new to AVRO and please excuse me if it is a simple question.
I have a use case where I am using AVRO schema for record calls.
Let's say I have avro schema
{
"name": "abc",
"namepsace": "xyz",
"type": "record",
"fields": [
{"name": "CustId", "type":"string"},
{"name": "SessionId", "type":"string"},
]
}
Now if the input is like
{
"CustId" : "abc1234"
"sessionID" : "000-0000-00000"
}
I want to use some regex validations for these fields and I want take this input only if it comes in particular format shown as above. Is there any way to specify in avro schema to include regex expression?
Any other data serialization formats which supports something like this?

You should be able to use a custom logical type for this. You would then include the regular expressions directly in the schema.
For example, here's how you would implement one in JavaScript:
var avro = require('avsc'),
util = require('util');
/**
* Sample logical type that validates strings using a regular expression.
*
*/
function ValidatedString(attrs, opts) {
avro.types.LogicalType.call(this, attrs, opts);
this._pattern = new RegExp(attrs.pattern);
}
util.inherits(ValidatedString, avro.types.LogicalType);
ValidatedString.prototype._fromValue = function (val) {
if (!this._pattern.test(val)) {
throw new Error('invalid string: ' + val);
}
return val;
};
ValidatedString.prototype._toValue = ValidatedString.prototype._fromValue;
And how you would use it:
var type = avro.parse({
name: 'Example',
type: 'record',
fields: [
{
name: 'custId',
type: 'string' // Normal (free-form) string.
},
{
name: 'sessionId',
type: {
type: 'string',
logicalType: 'validated-string',
pattern: '^\\d{3}-\\d{4}-\\d{5}$' // Validation pattern.
}
},
]
}, {logicalTypes: {'validated-string': ValidatedString}});
type.isValid({custId: 'abc', sessionId: '123-1234-12345'}); // true
type.isValid({custId: 'abc', sessionId: 'foobar'}); // false
You can read more about implementing and using logical types here.
Edit: For the Java implementation, I believe you will want to look at the following classes:
LogicalType, the base you'll need to extend.
Conversion, to perform the conversion (or validation in your case) of the data.
LogicalTypes and Conversions, a few examples of existing implementations.
TestGenericLogicalTypes, relevant tests which could provide a helpful starting point.

mongoose updating a specific field in a nested document at a 3rd level

mongoose scheme:
var restsSchema = new Schema({
name: String,
menu: mongoose.Schema.Types.Mixed
});
simplfied document:
{
name: "Dominos Pizza",
menu:{
"1":{
id: 1,
name: "Plain Pizza",
soldCounter: 0
},
"2":{
id: 2,
name: "Pizza with vegetables",
soldCounter: 0
}
}
}
I'm trying to update the soldCounter when given a single/array of "menu items" (such as "1" or "2" objects in the above document) as followed:
function(course, rest){
rest.markModified("menu.1");
db.model('rests').update({_id: rest._id},{$inc: {"menu.1.soldCounter":1}});
}
once this will work i obviously will want to make it more generic, something like: (this syntax is not working but demonstrate my needs)
function(course, rest){
rest.markModified("menu." + course.id);
db.model('rests').update({_id: rest._id},{$inc:{"menu.+"course.id"+.soldCounter":1}});
}
any one can help with this one?
I looked for an answer but couldn't find nothing regarding the 3rd level.
UPDATE:
Added id to the ducument's subDocument

I think you want add all ids into sub-document, one way you can do as following.
Rest.find({_id: rest._id}, function(err, o) {
// add all ids into sub-document...
Object.keys(o.menu).forEach(function(key) {
o.menu[key].id = key;
});
o.save(function(err){ ... });
});
It seems you want to operate the key in query, I am afraid you cannot do it in this way.
Please refer to the following questions.
Mongodb - regex match of keys for subobjects
MongoDB Query Help - query on values of any key in a sub-object

Searching a number in a string field with query_string on Elasticsearch

Among other text fields, I've got this string field in my Elasticsearch index:
"user": { "type": "string", "analyzer": "simple", "norms": { "enabled": False } }
It gets filled with a typical username, e.g. "simon".
Using query_string I can limit my search results for "other search terms" to this particular user:
'query': { 'query_string': { 'query': 'user:simon other search terms' } }
Default operator is set to "AND". However, in case a username only consists of a number (saved and indexed as string), Elasticsearch appears to ignore the "user:..." statement. For example:
'query': { 'query_string': { 'query': 'user:111 other search terms' } }
yields the same results as
'query': { 'query_string': { 'query': 'other search terms' } }
Any idea what might be the cause or how to fix it?

You are using the simple tokenizer. As the documentation says:
An analyzer of type simple that is built using a Lower Case Tokenizer.
And the lower case tokenizer uses the letter tokenizer and the lower case token filter. The problem with your specific test data is that the letter tokenizer divides the text at non-letters. And the digits are non-letters. This method from Java API defines what exactly is a letter. In contrast, this method from Java API defines what exactly is a digit.
You may want to look at the standard tokenizer instead.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ArangoDB - How to support case-insensitive n-gram index - arangodb

Related

Does EdgeNGram autocomplete_filter make sense with prefix search?

How to filter Subscribers based on array of tags in Loopabck

Data validation in AVRO

mongoose updating a specific field in a nested document at a 3rd level

Searching a number in a string field with query_string on Elasticsearch

Categories

Resources