Elastic opposite query - search

I'm using elastic search for classic queries LIKE "search all documents with G4 in name and LG in manfucaturer". This is ok. But what if I have a lot of documents and database with lot of search terms and I need to know which documents match some specific multicolumn terms. For example:
Documents:
[
{
"id": 5787,
"name": "Smartphone G4",
"manufacturer": "LG",
"description": "The revolutionary LG G4 design can only be described as forward thinking—with a classic touch."
},
{
"id": 68779,
"name": "Smartphone S6",
"manufacturer": "Samsung",
"description": "The Samsung Galaxy S6 is powerful to use and beautiful to behold."
}
]
...
Terms:
[
{
"id": "587",
"name": "G4",
"manufacturer": "LG",
"description": "classic touch"
},
{
"id": "364",
"manufacturer": "Samsung",
"description": "galaxy s6"
}
]
...
Result:
{
"587": [5787],
"364": [68779]
}
OR:
{
"5787": [587],
"68779": [364]
}
I need list of documents and list of terms which corresponds them (or oposite). In small amount of terms, it should be possible to apply all rules one by one and save matching documents. But I have milions of documents and thousands of terms. So, it is not possible to aply them one by one. Is it possible in another way?

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html is exactly what I wanted. It can store your queries and execute them against documents.

Related

Searching for sub-objects with a date range containing the queried date value

Let's say we're handling the advertising of various job openings across several channels (newspapers, job boards, etc.). For each channel, we can buy a "publication period" which will mean the channel will advertise our job openings during that period. How can we find the jobs for a given channel that have a publication period valid for today (i.e. starting on or before today, and ending on or after today)? The intent is to be able to generate a feed of "active" job openings that (e.g.) a job board can consume periodically to determine which jobs should be displayed to its users.
Another wrinkle is that each job opening is associated with a given tenant id: the feeds will have to be generated scoped to tenant and channel.
Let's say we have the following simplified documents (if you think the data should be modeled differently, please let me know also):
{
"_id": "A",
"tenant_id": "foo",
"name": "Job A",
"publication_periods": [
{
"channel": "linkedin",
"start": "2021-03-10T00:00:0.0Z",
"end": "2021-03-17T00:00:0.0Z"
},
{
"channel": "linkedin",
"start": "2021-04-10T00:00:0.0Z",
"end": "2021-04-17T00:00:0.0Z"
},
{
"channel": "monster.com",
"start": "2021-03-10T00:00:0.0Z",
"end": "2021-03-17T00:00:0.0Z"
}
]
}
{
"_id": "B",
"tenant_id": "foo",
"name": "Job B",
"publication_periods": [
{
"channel": "linkedin",
"start": "2021-04-10T00:00:0.0Z",
"end": "2021-04-17T00:00:0.0Z"
},
{
"channel": "monster.com",
"start": "2021-03-15T00:00:0.0Z",
"end": "2021-03-20T00:00:0.0Z"
}
]
}
{
"_id": "C",
"tenant_id": "foo",
"name": "Job C",
"publication_periods": [
{
"channel": "monster.com",
"start": "2021-05-15T00:00:0.0Z",
"end": "2021-05-20T00:00:0.0Z"
}
]
}
{
"_id": "D",
"tenant_id": "bar",
"name": "Job D",
"publication_periods": [
...
]
}
How can I query the jobs linked to tenant "foo" that have an active publication period for "monster.com" on for the date of 17.03.2021? (I.e. this query should return both jobs A and B.)
Note that the DB will contain documents of other (irrelevant) types.
Since I essentially need to "find all job openings containing an object in the publication_periods array having: CHAN as the channel value, "start" <= DATE, "end" >= DATE" it appears I'd require a Mango query to achieve this, as standard view queries don't provide comparison operators (if this is mistaken, please correct me).
Naturally, I want the Mango query to be executed only on relevant data (i.e. exclude documents that aren't job openings), but I can find references on how to do this (whether in the docs or elsewhere): all resources I found simply seem to define the Mango index on the entire set of documents, relying on the fact that documents where the indexed field is absent won't be indexed.
How can I achieve what I'm after?
Initially, I was thinking of creating a view that would emit the publication period information along with a {'_id': id} object in order to "JOIN" the job opening document to the matching periods at query time (per Best way to do one-to-many "JOIN" in CouchDB). However, I realized that I wouldn't be able to query this view as needed (i.e. "start" value before today, "end" value after today) since I wouldn't have a definite start/end key to use... And I have no idea how to properly leverage a Mango index/query for this. Presumably I'd have to create a partial index based on document type and the presence of publication periods, but how can I even index the multiple publication periods that can be located within a single document? Can a Mango index be defined against a specific view as opposed to all documents in the DB?
I stumbled upon this answer Mango search in Arrays indicating that I should be able to index the data with
{
"index": {
"fields": [
"tenant_id",
"publication_periods.[].channel",
"publication_periods.[].start",
"publication_periods.[].end"
]
},
"ddoc": "job-openings-periods-index",
"type": "json"
}
And then query them with
{
"selector": {
"tenant_id": "foo",
"publication_periods": {
"$elemMatch": {
"$and": [
{
"channel": "monster.com"
},
{
"start": {
"$lte": "2021-03-17T00:00:0.0Z"
}
},
{
"end": {
"$gte": "2021-03-17T00:00:0.0Z"
}
}
]
}
}
},
"use_index": "job-openings-periods-index"
"execution_stats": true
}
Sadly, I'm informed that the index "was not used because it does not contain a valid index for this query" and terrible performance, which I will leave for another question.

It is possible to have varying data structures in an Azure search index?

Below is some of the data I'm putting into an Azure search index:
I could go with this rigid structure but it needs to support different data types. I could keep adding fields - i.e. Field4, Field5, ... but I wondered if I could have something like a JSON field? So the index could be modelled like below:
[
{
"entityId":"dba656d3-f044-4cc0-9930-b5e77e664a8f",
"entityName":"character",
"data":{
"name":"Luke Skywalker",
"role":"Jedi"
}
},
{
"entityId":"b37bf987-0978-4fc4-9a51-b02b4a5eed53",
"entityName":"character",
"data":{
"name":"C-3PO",
"role":"Droid"
}
},
{
"entityId":"b161b9dc-552b-4744-b2d7-4584a9673669",
"entityName":"film",
"data":{
"name":"A new hope"
}
},
{
"entityId":"e59acdaf-5bcd-4536-a8e9-4f3502cc7d85",
"entityName":"film",
"data":{
"name":"The Empire Strikes Back"
}
},
{
"entityId":"00501b4a-5279-41e9-899d-a914ddcc562e",
"entityName":"vehicle",
"data":{
"name":"Sand Crawler",
"model":"Digger Crawler",
"manufacturer":"Corellia Mining Corporation"
}
},
{
"entityId":"fe815cb6-b03c-401e-a871-396f2cd3eaba",
"entityName":"vehicle",
"data":{
"name":"TIE/LN starfighter",
"model":"win Ion Engine/Ln Starfighter",
"manufacturer":"Sienar Fleet Systems"
}
}
]
I know that I can put JSON in a string field, but that would negatively impact the search matching and also filtering.
Is this possible in Azure search or is there a different way to achieve this kind of requirement?
See the article How to model complex data types. The hotel example data translates nicely to your use-case I believe. If your different entities have different sets of properties you can create a "complex type" similar to the Address or Amenities example below.
Structural updates
You can add new sub-fields to a complex field at any time without the
need for an index rebuild. For example, adding "ZipCode" to Address or
"Amenities" to Rooms is allowed, just like adding a top-level field to
an index.
{
"HotelId": "1",
"HotelName": "Secret Point Motel",
"Description": "Ideally located on the main commercial artery of the city in the heart of New York.",
"Tags": ["Free wifi", "on-site parking", "indoor pool", "continental breakfast"]
"Address": {
"StreetAddress": "677 5th Ave",
"City": "New York",
"StateProvince": "NY"
},
"Rooms": [
{
"Description": "Budget Room, 1 Queen Bed (Cityside)",
"RoomNumber": 1105,
"BaseRate": 96.99,
},
{
"Description": "Deluxe Room, 2 Double Beds (City View)",
"Type": "Deluxe Room",
"BaseRate": 150.99,
}
. . .
]
}

Azure Search match against two properties of the same object

I would like to do a query matches against two properties of the same item in a sub-collection.
Example:
[
{
"name": "Person 1",
"contacts": [
{ "type": "email", "value": "person.1#xpto.org" },
{ "type": "phone", "value": "555-12345" },
]
}
]
I would like to be able to search by emails than contain xpto.org but,
doing something like the following doesn't work:
search.ismatchscoring('email','contacts/type,','full','all') and search.ismatchscoring('/.*xpto.org/','contacts/value,','full','all')
instead, it will consider the condition in the context of the main object and objects like the following will also match:
[
{
"name": "Person 1",
"contacts": [
{ "type": "email", "value": "555-12345" },
{ "type": "phone", "value": "person.1#xpto.org" },
]
}
]
Is there any way around this without having an additional field that concatenates type and value?
Just saw the official doc. At this moment, there's no support for correlated search:
This happens because each clause applies to all values of its field in
the entire document, so there's no concept of a "current sub-document
https://learn.microsoft.com/en-us/azure/search/search-howto-complex-data-types
and https://learn.microsoft.com/en-us/azure/search/search-query-understand-collection-filters
The solution I've implemented was creating different collections per contact type.
This way I'm able to search directly in, lets say, the email collection without the need for correlated search. It might not be the solution for all cases but it works well in this case.

Azure Search - phonetic search implementation

I was trying out Phoenetic search using Azure Search without much luck. My objective is to work out an Index configuration that can handle typos and accomodate phonetic search for end users.
With the below configuration and sample data, I was trying to search for intentionally misspelled words like 'softvare' or 'alek'. I got results for 'alek' thanks for Phonetic analyzer; but didn't get any results for 'softvare'.
Looks like for this requirement phonetic search will not do the trick.
Only option that I found was to use synonyms map. The major pitfall is that I'm unable to use the Phonetics / Custom analyzer along with Synonyms :(
What are the various strategies that you would recommend for taking care of typos?
search query used
?api-version=2017-11-11&search=alec
?api-version=2017-11-11&search=softvare
Here is the index configuration
"name": "phonetichotels",
"fields": [
{"name": "hotelId", "type": "Edm.String", "key":true, "searchable": false},
{"name": "baseRate", "type": "Edm.Double"},
{"name": "description", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "analyzer":"my_standard"},
{"name": "hotelName", "type": "Edm.String", "analyzer":"my_standard"},
{"name": "category", "type": "Edm.String", "analyzer":"my_standard"},
{"name": "tags", "type": "Collection(Edm.String)", "analyzer":"my_standard"},
{"name": "parkingIncluded", "type": "Edm.Boolean"},
{"name": "smokingAllowed", "type": "Edm.Boolean"},
{"name": "lastRenovationDate", "type": "Edm.DateTimeOffset"},
{"name": "rating", "type": "Edm.Int32"},
{"name": "location", "type": "Edm.GeographyPoint"}
],
Analyzer (part of the index creation)
"analyzers":[
{
"name":"my_standard",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard_v2",
"tokenFilters":[ "lowercase", "asciifolding", "phonetic" ]
}
]
Analyze API Input and Output for 'software'
{
"analyzer":"my_standard",
"text": "software"
}
{
"#odata.context": "https://ctsazuresearchpoc.search.windows.net/$metadata#Microsoft.Azure.Search.V2017_11_11.AnalyzeResult",
"tokens": [
{
"token": "SFTW",
"startOffset": 0,
"endOffset": 8,
"position": 0
}
]
}
Analyze API Input and Output for 'softvare'
{
"analyzer":"my_standard",
"text": "softvare"
}
{
"#odata.context": "https://ctsazuresearchpoc.search.windows.net/$metadata#Microsoft.Azure.Search.V2017_11_11.AnalyzeResult",
"tokens": [
{
"token": "SFTF",
"startOffset": 0,
"endOffset": 8,
"position": 0
}
]
}
Sample data that I loaded
{
"#search.action": "upload",
"hotelId": "5",
"baseRate": 199.0,
"description": "Best hotel in town for software people",
"hotelName": "Fancy Stay",
"category": "Luxury",
"tags": ["pool", "view", "wifi", "concierge"],
"parkingIncluded": false,
"smokingAllowed": false,
"lastRenovationDate": "2010-06-27T00:00:00Z",
"rating": 5,
"location": { "type": "Point", "coordinates": [-122.131577, 47.678581] }
},
{
"#search.action": "upload",
"hotelId": "6",
"baseRate": 79.99,
"description": "Cheapest hotel in town ",
"hotelName": " Alec Baldwin Motel",
"category": "Budget",
"tags": ["motel", "budget"],
"parkingIncluded": true,
"smokingAllowed": true,
"lastRenovationDate": "1982-04-28T00:00:00Z",
"rating": 1,
"location": { "type": "Point", "coordinates": [-122.131577, 49.678581] }
},
With the right configuration, I should have got results even with the misspelled words.
I work on Azure Search. Before I suggest approaches to handle misspelled words, it would be helpful to look at your custom analyzer (my_standard) configuration. It might tell us why it's not able to handle the case for 'softvare'. As a DIY, you can use the Analyze API to see the tokens created using your custom analyzer and it should contain 'software' to actually match the docs.
Now then, here are a few ways that can be used independently or in conjunction to handle misspelled words. The best approach varies depending on the use-case and I strongly suggest you experiment with these to figure out the best one in your case.
You are already familiar with phonetic filters which is a common approach to handle similarly pronounced terms. If you haven't already, try different encoders for the filter to evaluate which configuration gives you the best results. Check out the list of encoders here.
Use fuzzy queries supported as part of the Lucene query syntax in Azure Search which returns terms that are near the original query term based on a distance metric. The limitation here is that it works on a single term. Check the docs for more details. Sample query would look like - search=softvare~1 You can also use term boosting to give the original term more boost in cases where the original term is also a valid term.
You also alluded to synonyms which is also used to query with misspelled terms. This approach gives you the most control over the process of handling typos but also require you to have prior knowledge of different typos for terms. You can use these docs if you want to experiment with synonyms.
As you could read in my post; my Objective was to handle the typos.
The only easy option is to use the inbuilt Lucene functionality - Fuzzy Search. I'm yet to check on the response times as the querytype has to be set to 'full' for using fuzzy search. Otherwise, the results were satisfactory.
Example:
search=softvare~&fuzzy=true&querytype=full
will return all documents with the 'Software' in it.
For further reading please go through Documentation

Optimal way to model documents hierarchy in CouchDB

I'm trying to model document a hierarchy in CouchDB to use in my system, which is conceptually similar to a blog. Each blog post belongs to at least one category and each category can have many posts. Categories are hierarchical, meaning that if a post belongs to CatB in the hierarchy "CatA->CatB" ("CatB is in CatA)", it belongs also to CatA.
Users must be able to quickly find all post in a category (and all its children).
Solution 1
Each document of the post type contains a "category" array representing its position in the hierarchy (see 2).
{
"_id": "8e7a440862347a22f4a1b2ca7f000e83",
"type": "post",
"author": "dexter",
"title": "Hello",
"category":["OO","Programming","C++"]
}
Solution 2
Each document of the post type contains the "category" string representing its path in the hierarchy (see 4).
{
"_id": "8e7a440862347a22f4a1b2ca7f000e83",
"type": "post",
"author": "dexter",
"title": "Hello",
"category": "OO/Programming/C++"
}
Solution 3
Each document of the post type contains its parent "category" id representing its path in the hierarchy (see 3). A hierarchical category structure is built through linked "category" document types.
{
"_id": "8e7a440862347a22f4a1b2ca7f000e83",
"type": "post",
"author": "dexter",
"title": "Hello",
"category_id": "3"
}
{
"_id": "1",
"type": "category",
"name": "OO"
}
{
"_id": "2",
"type": "category",
"name": "Programming",
"parent": "1"
}
{
"_id": "3",
"type": "category",
"name": "C++",
"parent": "2"
}
Question
What's the best way to store this kind of relationship in CouchDB? What's the most efficient solution in terms of disk space, scalability and retrieval speed?
Can such a relation be modelled to take into account localised category names?
Disclaimer
I know this question has been asked a few times already here on SO, but it seems there's no definitive answer to it nor an answer which deals with the pros and cons of each solution. Sorry for the length of the question :)
Read so far
CouchDB - The Definitive Guide
Storing Hierarchical Data in CouchDB
Retrieving Hierarchical/Nested Data From CouchDB
Using CouchDB group_level for hierarchical data
There's no right answer to this question, hence the lack of a definitive answer. It mostly depends on what kind of usage you want to optimize for.
You state that retrieval speed of documents that belong to a certain category (and their children) is most important. The first two solutions allow you to create a view that emits a blog post multiple times, once for each category in the chain from the leaf to the root. Thus selecting all documents can be done using a single (and thus fast) query. The only difference of second solution to first solution is that you move the parsing of the category "path" into components from the code that inserts the document to the map function of the view. I would prefer the first solution as it's simpler to implement the map function and a bit more flexible (e.g. it allows a category's name to contain a slash character).
In your scenario you probably also want to create a reduced view which counts the number of blog posts for each category. This is very simple with either of these solutions. With a fitting reduction function, the number of post in every category can be retrieved using a single request.
A downside of the first two solutions is that renaming or moving a category from one parent to another requires every document to be updated. The third solution allows that without touching the documents. But from the description of your scenario I assume that retrieval by category is very frequent and category renaming/moving is very rare.
Solution 4 I propose a fourth solution where blog post documents hold references to category documents but still reference all the ancestors of the post's category. This allows categories to be renamed without touching the blog posts and allows you to store additional metadata with a category (e.g. translations of the category name or a description):
{
"_id": "8e7a440862347a22f4a1b2ca7f000e83",
"type": "post",
"author": "dexter",
"title": "Hello",
"category_ids": [3, 2, 1]
}
{
"_id": "1",
"type": "category",
"name": "OO"
}
{
"_id": "2",
"type": "category",
"name": "Programming",
"parent": "1"
}
{
"_id": "3",
"type": "category",
"name": "C++",
"parent": "2"
}
You will still have to store the parents of categories with the categories, which is duplicating data in the posts, to allow categories to be traversed (e.g. for displaying a tree of categories for navigation).
You can extend this solution or any of your solutions to allow a post to be categorized under multiple categories, or a category to have multiple parents. When a post is categorized in multiple categories, you will need to store the union of the ancestors of each category in the post's document while preserving the categories selected by the author to allow them to be displayed with the post or edited later.
Lets assume that there is an additional category named "Ajax" with anchestors "JavaScript", "Programming" and "OO". To simplify the following example, I've chosen the document IDs of the categories to equal the category's name.
{
"_id": "8e7a440862347a22f4a1b2ca7f000e83",
"type": "post",
"author": "dexter",
"title": "Hello",
"category_ids": ["C++", "Ajax"],
"category_anchestor_ids": ["C++", "Programming", "OO", "Ajax", "JavaScript"]
}
To allow a category to have multiple parents, just store multiple parent IDs with a category. You will need to eliminate duplicates while finding all the ancestors of a category.
View for Solution 4 Suppose you want to get all the blog posts for a specific category. We will use a database with the following sample data:
{ "_id": "100", "type": "category", "name": "OO" }
{ "_id": "101", "type": "category", "name": "Programming", "parent_id": "100" }
{ "_id": "102", "type": "category", "name": "C++", "parent_id": "101" }
{ "_id": "103", "type": "category", "name": "JavaScript", "parent_id": "101" }
{ "_id": "104", "type": "category", "name": "AJAX", "parent_id": "103" }
{ "_id": "200", "type": "post", "title": "OO Post", "category_id": "104", "category_anchestor_ids": ["100"] }
{ "_id": "201", "type": "post", "title": "Programming Post", "category_id": "101", "category_anchestor_ids": ["101", "100"] }
{ "_id": "202", "type": "post", "title": "C++ Post", "category_id": "102", "category_anchestor_ids": ["102", "101", "100"] }
{ "_id": "203", "type": "post", "title": "AJAX Post", "category_id": "104", "category_anchestor_ids": ["104", "103", "101", "100"] }
In addition to that, we use a view called posts_by_category in a design document called _design/blog with the the following map function:
function (doc) {
if (doc.type == 'post') {
for (i in doc.category_anchestor_ids) {
emit([doc.category_anchestor_ids[i]], doc)
}
}
}
Then we can get all the posts in the Programming category (which has ID "101") or one of it's subcategories using a GET requests to the following URL.
http://localhost:5984/so/_design/blog/_view/posts_by_category?reduce=false&key=["101"]
This will return a view result with the keys set to the category ID and the values set to the post documents. The same view can also be used to get a summary list of all categories and the number of post in that category and it's children. We add the following reduce function to the view:
function (keys, values, rereduce) {
if (rereduce) {
return sum(values)
} else {
return values.length
}
}
And then we use the following URL:
http://localhost:5984/so/_design/blog/_view/posts_by_category?group_level=1
This will return a reduced view result with the keys again set to the category ID and the values set to the number of posts in each category. In this example, the categories name's would have to be fetched separately but it is possible to create view where each row in the reduced view result already contains the category name.

Resources