Lucene Hierarchial Taxonomy Search - search

I've a set of documents annotated with hierarchial taxonomy tags,
E.g.
[
{
"id": 1,
"title": "a funny book",
"authors": ["Jean Bon", "Alex Terieur"],
"book_category": "/novel/comedy/new"
},
{
"id": 2,
"title": "a dramatic book",
"authors": ["Alex Terieur"],
"book_category": "/novel/drama"
},
{
"id": 3,
"title": "A hilarious book",
"authors": ["Marc Assin", "Harry Covert"],
"book_category": "/novel/comedy"
},
{
"id": 4,
"title": "A sad story",
"authors": ["Gerard Menvusa", "Alex Terieur"],
"book_category": "/novel/drama"
},
{
"id": 5,
"title": "A very sad story",
"authors": ["Gerard Menvusa", "Alain Terieur"],
"book_category": "/novel"
}]
I need to search book by "book_category". The search must return books that match the query category exactly or partially (with a defined depth threshold) and give them a different score in function of the match degree.
E.g.: query "book_category=/novel/comedy" and "depth_threshold=1" must return books with book_category=/novel/comedy (score=100%), /novel and /novel/comedy/new (score < 100%).
I tried the TopScoreDocCollector in the search, but it returns the book which book_category at least contains the query category, and gives them the same score.
How can i obtain this search function that returns also the more general category and gives different match scores to the results?
P.S.: i don't need a faced search.
Thanks

There is no built-in query, that supports this reuqirement, but you can use a DisjunctionMaxQuery with multiple ConstantScoreQuerys. The exact category and the more general category can be searched by simple TermQuerys. For the sub-categories, you can use a MultiTermQuery like the RegexpQuery to match all sub-categories, if you don't know them upfront. For example:
// the exact category
Query directQuery = new TermQuery(new Term("book_category", "/novel/comedy"));
// regex, that matches one level more that your exact category
Query narrowerQuery = new RegexpQuery(new Term("book_category", "/novel/comedy/[^/]+"));
// the more general category
Query broaderQuery = new TermQuery(new Term("book_category", "/novel"));
directQuery = new ConstantScoreQuery(directQuery);
narrowerQuery = new ConstantScoreQuery(narrowerQuery);
broaderQuery = new ConstantScoreQuery(broaderQuery);
// 100% for the exact category
directQuery.setBoost(1.0F);
// 80% for the more specific category
narrowerQuery.setBoost(0.8F);
// 50% for the more general category
broaderQuery.setBoost(0.5F);
DisjunctionMaxQuery query = new DisjunctionMaxQuery(0.0F);
query.add(directQuery);
query.add(narrowerQuery);
query.add(broaderQuery);
This would give a result like:
id=3 title=a hilarious book book_category=/novel/comedy score=1.000000
id=1 title=a funny book book_category=/novel/comedy/new score=0.800000
id=5 title=A very sad story book_category=/novel score=0.500000
For a complete test case, see this gist: https://gist.github.com/knutwalker/7959819

This could by a solution. But i have more than one hierarchic filed to query and i want to use the CategoryPath indexed in taxonomy.
I'm using the DrillDown query:
DrillDownQuery luceneQuery = new DrillDownQuery(searchParams.indexingParams);
luceneQuery.add(new CategoryPath("book_category/novel/comedy,'/'));
luceneQuery.add(new CategoryPath("subject/sub1/sub2",'/'));
In this way the search return the books how match the two category paths and their descendants.
To retrieve also the ancestors i can start the drilldown from a ancestor of the requested categoryPath (retrieved from the taxonomy).
The problem is the same score for all the results.
I want to override the similarity/score function in order to calculate a categoryPath lenght based score, comparing the query categoryPath with each returned document CategoryPath (book_category).
E.g.:
if(queryCategoryPath.compareTo(bookCategoryPath)==0){
document.score = 1
}else if(queryCategoryPath.compareTo(bookCategoryPath)==1){
document.score = 0.9
}else if(queryCategoryPath.compareTo(bookCategoryPath)==2){
document.score = 0.8
} and so on.

Related

ArangoDB - How to support case-insensitive n-gram index

I have successfully created an n-gram analyzer linked to an ArangoSearch view. The document field being indexed contains mixed case string content, but I would like users to be able to run case-insensitive queries angainst it. There is not an option for case in the n-gram analyzer properties, so I'm wondering how to do this. An example query I'm running, is as follows:
"for doc in myview search analyzer(doc.field in tokens('some input text','myanalyzer'), 'myanalyzer') sort BM25(doc) desc return doc"
This does not (fully) match fields containing "Some Input Text" due to case. Does anyone have recommendations to accomplish this? Thanks!
This ist possible since v3.8.0.
a new type of analyzer, pipeline, was introduced.
With this you can chain the effects of multiple analyzers together.
arangosh> var analyzers = require("#arangodb/analyzers");
arangosh> var a = analyzers.save("ngram_upper", "pipeline", { pipeline: [
........> { type: "norm", properties: { locale: "en.utf-8", case: "upper" } },
........> { type: "ngram", properties: { min: 2, max: 2, preserveOriginal:
false, streamType: "utf8" } }
........> ] }, ["frequency", "norm", "position"]);
arangosh> db._query(`RETURN TOKENS("Quick brown foX", "ngram_upper")`).toArray();
Source: https://www.arangodb.com/docs/stable/analyzers.html#pipeline

MongoDB schema: store id as FK or whole document

I am designing MongoDB structure (the models structure in NodeJS app actually). I will have players and matches collections.
Is it better to store only the ids of the matches the player joined,inside each player's object (like a FK in RDBM) or store the whole object of match inside the player object?
In the application one of the action would be to show the details of the match and on this view the user will see the players that joined this particular match (their names, country etc.). That makes me think that storing whole Match document inside the Player document is better.
Any tips?
Storing whole Match document inside the Player document is not a good option I think.
Your player document will need to be updated every time the player play in a match.
You have 2 main alternatives:
1-) Using child referencing. (referencing player in match).
So if we want to imlement this using mongoose models:
Player model:
const mongoose = require("mongoose");
const playerSchema = mongoose.Schema({
name: String,
country: String
});
const Player = mongoose.model("Player", playerSchema);
module.exports = Player;
Match model:
const mongoose = require("mongoose");
const matchSchema = mongoose.Schema({
date: {
type: Date,
default: Date.now()
},
players: [
{
type: mongoose.Schema.Types.ObjectId,
ref: "Player"
}
]
});
const Match = mongoose.model("Match", matchSchema);
module.exports = Match;
With these models, our match document will be like this (referencing playerId's):
{
"_id" : ObjectId("5dc419eff6ba790f4404fd07"),
"date" : ISODate("2019-11-07T16:19:39.691+03:00"),
"players" : [
ObjectId("5dc41836985aaa22c0c4d423"),
ObjectId("5dc41847985aaa22c0c4d424"),
ObjectId("5dc4184e985aaa22c0c4d425")
],
"__v" : 0
}
And we can use this route to get match info with all players info:
const Match = require("../models/match");
router.get("/match/:id", async (req, res) => {
const match = await Match.findById(req.params.id).populate("players");
res.send(match);
});
And the result will be like this:
[
{
"date": "2019-11-07T13:19:39.691Z",
"players": [
{
"_id": "5dc41836985aaa22c0c4d423",
"name": "player 1",
"country": "country 1",
"__v": 0
},
{
"_id": "5dc41847985aaa22c0c4d424",
"name": "player 2",
"country": "country 1",
"__v": 0
},
{
"_id": "5dc4184e985aaa22c0c4d425",
"name": "player 3",
"country": "country 2",
"__v": 0
}
],
"_id": "5dc419eff6ba790f4404fd07",
"__v": 0
}
]
2-) Embedding players inside match, and still keeping a independent players collection.
But this will need more space than first option.
So your a match will look like this in matches collection:
{
"date": "2019-11-07T13:19:39.691Z",
"players": [
{
"_id": "5dc41836985aaa22c0c4d423",
"name": "player 1",
"country": "country 1",
"__v": 0
},
{
"_id": "5dc41847985aaa22c0c4d424",
"name": "player 2",
"country": "country 1",
"__v": 0
},
{
"_id": "5dc4184e985aaa22c0c4d425",
"name": "player 3",
"country": "country 2",
"__v": 0
}
],
"_id": "5dc419eff6ba790f4404fd07",
"__v": 0
}
But this may be a little faster when getting a match info, since there is no need to populate players info.
const Match = require("../models/match");
router.get("/match/:id", async (req, res) => {
const match = await Match.findById(req.params.id);
res.send(match);
});
The way I see it, the matches collection here is a collection of documents that exists independently and then connected with the players that participates to the matches. With that said, I would do an array of match keys.
I would suggest going for a nested document structure if the document being nested can be considered as "owned" by the parent document. For example, a todo nested document inside of a todoList document.
This is a case of many-to-many relationship.
I am guessing that there will be about 100 players and 100 matches data, initially.
The design options are embedding or referencing.
(1) Embedding:
The most queried side will have the less queried side embedded.
Based on your requirement (show the details of the match and on this view the user will see the players that joined this particular match and their details) the match side will have the player data embedded.
The result is two collections.The main one is the matches. The secondary is the players; this will have all the source data for a player (id, name, dob, country, and other details).
Only a few players data is stored for a match, and only a subset of a player data is stored in the matches collection.
Results in duplication of player data. This is fine, it is mostly static info that will be duplicated; things like name and country. But, some of it may need updates over time and the application needs to take care of this.
The player data is stored as an array of embedded documents in the matches collection. This design is the possible solution.
matches:
_id
matchId
date
place
players [ { playerId 1, name1, country1 }, { playerId 2, ... }, ... ]
outcome
players:
_id
name
dob
country
ranking
(2) Referencing:
This will also have two collections: players and matches.
Referencing can happen with either side, the matches can refer the players or vice-versa.
Based on the requirement, the most queried side will have references of the less queried side; matches will have the player id references. This will be an array of player ids.
matches:
_id
matchId
date
place
players [ playerId 1, playerId 2, ... ]
The players collection will have the same data as in earlier case.

Azure Search. How to get result counts when having pagination

Lets say I have following schema in Azure Search Collection, with 100s of records.
{
"id": '1',
"Status" : "Available",
"name" : "demo 1"
},
{
"id": '2',
"Status" : "Available",
"name" : "demo 1"
},
{
"id": '3',
"Status" : "Removed",
"name" : "demo 1"
},
{
"id": '4',
"Status" : "Booked",
"name" : "demo 4"
}
In My Status field I can have three different values,
"Booked", "Available", "Removed".
Now I am getting the data by pagination using Skip and Top from the Azure Search.
However like in ElasticSearch Aggregation function, is there a way in Azure search to get total no of sites having status Booked , Available or not removed etc..
Because I cant do a count in Client Side because, I will have limited no of records, not all records from Azure Search.
If you are using a search index client object (Microsoft.Azure.Search.SearchIndexClient) to search for your documents you can provide an object as a parameter (Microsoft.Azure.Search.Models.SearchParameters) that contains a property IncludeTotalResultCount. Once you set this property to true and call the method Documents.Search(...) passing the parameter object, your response object (Microsoft.Azure.Search.DocumentSearchResult) will contain a value for the property Count considering the total count for the filter, regardless how many items selected by the pagination.
SearchParameters searchParameter = new SearchParameters
{
Filter = "organizationId eq '1'",
Skip = 0,
Top = 20,
IncludeTotalResultCount = true
};
using (var client = new SearchIndexClient(...)
{
DocumentSearchResult response = client.Documents.Search("*", searchParameter);
//pagination items
var collection = response.Results.ToArray();
//total items
var counter = response.Count;
}
When you create the index, make the status field facetable and filterable. Then issue a facets query on status, with $filter=status ne 'Booked'. This will give you the total counts of documents in each status category other than Booked, irrespective of pagination.

Global Search in Elastic Search

Working on Elasticsearch, my use case is very straight forward. When a user types in a search box I want to search all of my data set irrespective of field or column or any condition (search all data and provide all occurrences of searched word in documents).
This might be available in their documentation but I'm not able to understand it. Can somebody explain on this?
The easiest way to search across all fields in an index is to use the _all field.
The _all field is a catch-all field which concatenates the values of all of the other fields into one big string, using space as a delimiter, which is then analyzed and indexed, but not stored.
For example:
PUT my_index/user/1
{
"first_name": "John",
"last_name": "Smith",
"date_of_birth": "1970-10-24"
}
GET my_index/_search
{
"query": {
"match": {
"_all": "john smith 1970"
}
}
}
Highlighting is supported so matching occurrences can be returned in your search results.
Drawbacks
There are two main drawbacks to this approach:
Additional disk space and memory are needed to store the _all field
You lose flexibility in how the data and search terms are analysed
A better approach is to disable the _all field and instead list out the fields you are interested in:
GET /_search
{
"query": {
"query_string" : {
"query" : "this AND that OR thus",
"fields":[
"name",
"addressline1",
"dob",
"telephone",
"country",
"zipcode"
]
}
}
}
Query_string (link) can do this job for u .
It support partial search effectively , here is my analysis https://stackoverflow.com/a/43321606/2357869 .
Query_string is more powerful than match , term and wildcard query .
Scenario 1 - Suppose u want to search "Hello" :-
Then go with :-
{
"query": {
"query_string": {"query": "*Hello*" }
}
}
It will search all words like ABCHello , HelloABC , ABCHeloABC
By default it will search hello in all fields (_all)
2) Scenario 2 - Suppose u want to search "Hello" or "World" :-
Then go with :-
{
"query": {
"query_string": {"query": "*Hello* *World*" }
}
}
It will search all words like ABCHello , HelloABC , ABCHelloABC , ABCWorldABC ,ABChello ,ABCworldABC etc.
it will search like Hello OR World , so whichever word having Hello Or world , it wiil give .
By default query_string (link) use default operator OR , u can change that .

Elastic search having "not_analyzed" and "analyzed" together

I'm new to elasticsearch. What my business needs is that I should also do a partial matching on searchable fields I ended up with wildcard queries. my query is like this :
{
"query" : {
"wildcard" : "*search_text_here*"
}
}
Suppose that I'm searching for Red Flowers before the above query I was using an analyzed match query which provided me with both results for Red and Flowers lonely. but now my query only works when both Red Flowers are present together.
Use match phrase query as shown below for more information refer the ES doc:
GET /my_index/my_type/_search
{
"query": {
"match_phrase": {
"title": "red floewers"
}
}
}

Resources