How to build search with facetting over unknown/unspecified set of attributes/properties? - search

I'm working on a product search engine with a big set of undefined products which is constantly growing. Each product has different attributes and at this time they're saved in an array of string key-value pairs like this:
"attributes": [
{
"key": "Producttype",
"value": "Headphones - 3.5 mm plug"
},
{
"key": "Weight",
"value": "280 g"
},
{
"key": "Soundmode",
"value": "Stereo"
},
....
]
Each product has also a category. I'm using elasticsearch 2.4.x to persist data that i want to search on via spring-data-elasticsearch. It's possible to upgrade to the newest elasticsearch version if needed.
As you can see the attributes are really generic. It's also needed to use nested objects to be able to search on this attributes. I'm also thinking about preprocessing this attributes to a standardized format. For example the "Weight" key might be written in different forms like "Productweight" or "Weight of product". Because there are a lot of attributes and i wouldn't like to create a custom property/field for each one i thought about about mapping only the important ones (like weight) to a custom, own field and to map the other attributes like described above.
Now if someone searches for example "iphone" i would like to show some facettes on the left of the search result page. The facettes should differ if someone searches "Adidas shoes". Is this possible with the given format above using nested objects? Is it possible to build the facettes dynamically regarding to the resultset elasticsearch is returning? E.g. the most common properties which all result products contain should be used to create facettes. Or do i have to persist some predefined filters/facettes on each category? I think that would be too much work and also doesn't work on search results where products can have different categories. What's the best practice to build a search feature with facetting on entities with n different properties that can grow in future?

Related

CouchDB searching linked documents

I'm very new to couchdb and i'm hoping someone can help me with a solution to this problem.
Say I have an address document that contains various keys, but importantly a singleLineAddress and a persons array:
{
"_id": "002cb726bfe69a79ed9b897931000ec6",
"_rev": "2-6af6d8896703e9db6f5ba97abb1ca5d7",
"type": "address",
...
"singleLineAddress": "28 CLEVEDON ROAD, WESTON-SUPER-MARE, BS23 1DG",
...
"persons":["d506d09a1c46e32f6632e6d99a0062bd","002cb726bfe69a79ed9b897931001c80"]
}
Then i have a person document with a number of keys, crucially with firstName & lastName:
{
"_id": "d506d09a1c46e32f6632e6d99a0062bd",
"_rev": "4-98fae966a92d5c6c359cb8ddfaa487e1",
"type": "person",
...
"firstName": "Joe",
"lastName": "Bloggs"
...
}
I understand I can created a linked document view and emit all the person id's linked to address, then I can use include_docs=true to see all the person data. But, from what i'm reading it's not advised to use include_docs=true as it can be expensive.
Ultimately, i'd like to use couchdb-lucene to run a FTS against person # address using the name & address. Is that even possible using linked documents?
Using ?include_docs=true is more expensive than not using it - for every row of the index returned, the database has to fetch the related document body. But sometimes needs must :) You can avoid using ?include_docs=true by "projecting" more data into the index which is returned to you at query time. See https://blog.cloudant.com/2021/11/12/Projection.html
As for Lucene full-text searching, you can certainly search across document types in the same collection but your search results would consist of a mixture of address and people documents - full text searching can't do the "join" between an address and its occupants - you'd have to do that yourself later.
If you desperately need to return address and people objects together, then consider combining the two: your address document would contain an array of people objects that reside there? There is a trade off between combining objects such that data the belongs together is stored together, and keeping every micro object separate for ease of updating.

Suggestions for my data structure/schema with Pouchdb - Couchdb

Good morning!
I want to use couchdb/pouchdb for my pwa that I currently work on.
In my project I want to to store "Projects", in a "Project" I want to store the project-title and "Chapters", in a "Chapter" I want to store the chapter-title and "Scenes", a "Scene" contains contain text.
What schema would make the most sense and performance?
Right now I think about a plan like this:
Project 1
title: string
Chapter 1
Scene 1
text: string
Scene 2
text: string
Scene 3
text: string
Chapter 2
...
Project 2
title: string
Chapter 1
Scene 1
...
Since I only have SQL experience and never used document based databases before, I dont really know how to put a structure that makes sense.
Do I store documents inside documents to have a schema that looks exactly like above or do I create a database for each component(Projects,Chapters,Scenes)?
You have several options.
Each project is a document with a list of chapters, each with a list of scenes.
Projects, chapters and scenes are three different kinds of documents in the same database.
Which one is the best depends on the likely total size, and how each of these components change. CouchDB works best with small documents (kilobytes). As you can only update whole documents, changing bits inside lists or inside objects in larger documents quickly become inefficient, and potentially generating update conflicts.
The second suggestion above will scale better, but (currently; see link below) lacks the convenience of being able to pull out everything about a project with a single API call. You can use the id field to great effect:
{
"_id": "project1:toplevel",
"type": "project",
"title": "Project 1"
}
{
"_id": "project1:chapter1",
"type": "chapter",
"title": "Project 1, chapter 1"
}
{
"_id": "project1:chapter1#scene1",
"type": "scene",
"title": "Project 1, chapter 1, scene 1"
}
In a "landing soon" version of CouchDB this id format can be used to leverage so-called partitioned databases that would be a great fit here. You can read blog posts about it here:
https://blog.cloudant.com/2019/03/05/Partition-Databases-Introduction.html

Azure Search Lucene Query Incorrect Result

I am currently using Azure Search to bring back images stored in blob storage, based off filters that are passed in by the user. Below is my Azure Search, which I thought should filter all of the content specified in the tags field as a AND:
search=foreignId:d0c41422-acfa-4e4b-a9db-8c06b6860f3f, tags:SiteRef +\""TY0033"\" + BlockRef + \""00"\" + Disipline + \""FABRIC"\"&searchMode=all&queryType=full
and what it brings back (which is wrong as you can see from the BlockRef, though if I pass CN0001, it brings the correct values):
"foreignId": "d0c41422-acfa-4e4b-a9db-8c06b6860f3f",
"description": "Health & Safety Eire - Site Photo - TY0033-01-
FABRIC-005",
"fileName": "TY0033-01-FABRIC-005",
"fileExtension": ".jpg",
"createdAt": "26/11/2018 02:00:24",
"tags": "[{\"TagName\":\"SiteRef\",\"Value\":\"TY0033\"},{\"TagName\":\"BlockRef\",\"Value\":\"01\"},{\"TagName\":\"Disipline\",\"Value\":\"FABRIC\"},{\"TagName\":\"PhotoNumber\",\"Value\":\"005\"}]",
"longitude": 0,
"latitude": 0
95% of the time this is working perfectly, however the other 5% of the time, the images comes back incorrect, as Azure search has given the incorrect details.
I have checked and it seems to be because it is not respecting the multiplicity of the search terms. I am new to Azure Search, so I am wondering if I am doing it correctly?
Any help would be greatly appreciated
Index Definition:
Index Definition
Edit: Updated Post with index definition
In your query you check if foreignId is equal to d0c41422-acfa-4e4b-a9db-8c06b6860f3f and tags field contains SiteRef and if any searchable field contains TY0033, BlockRef, 00, Disipline and FABRIC. In your case all fields are searchable. Thus:
forignId matches
tags contains SiteRef
TY0033, BlockRef, Disipline and FABRIC are in tags field
00 is in createdAt field, as standard Lucene analyzer tokenizes "26/11/2018 02:00:24" into 26,11,2018,02,00,24
In order to search in tags field you should rewrite your query as follows:
search=foreignId:d0c41422-acfa-4e4b-a9db-8c06b6860f3f AND tags:(SiteRef AND \""TY0033"\" AND BlockRef AND \""00"\" AND Disipline AND \""FABRIC"\")&searchMode=all&queryType=full
It might be worthwhile to use proximity search to make sure you correlate occurrences field/value pairs, e.g.: BlockRef and 00 e.g., "BlockRef 00"~1

couchdb match multiple inconsistent keys

Considering the following two documents:
{
"_id": "a6b8d3d7e2d61c97f4285220c103abca",
"_rev": "7-ad8c3eaaab2d4abfa01abe36a74da171",
"File":"/store/document/scan_bgd123.jpg",
"Commend": "Describes a person",
"DateAdded": "2014-07-17T14:13:00Z",
"Name": "Joe",
"LastName": "Soap",
"Height": "192cm",
"Age": "25"
}
{
"_id": "a6b8d3d7e2d61c97f4285220c103c4a9",
"_rev": "1-f43410cb2fe51bfa13dfcedd560f9511",
"File":"/store/document/scan_adf123.jpg",
"Comment": "Describes a car",
"Make": "Ford",
"Year": "2011",
"Model": "Focus",
"Color": "Blue"
}
How would I find a document based on multiple criteria, say for example "Make"="Ford" and "Color"="Blue". I realize I need a view for this, but I don't know what the key is going to be, and as you can see from the two documents, the key/value pairs aren't consistent. The only consistent item will be the "File" key.
I'm attempting to create couchDB database that will store the location of files, but tagged with Key/Value pairs.
EDIT:
Perhaps I should reconsider my data structure. modify it slightly?
{
"_id": "a6b8d3d7e2d61c97f4285220c103c4a9",
"_rev": "1-f43410cb2fe51bfa13dfcedd560f9511",
"File": "/store/document/scan_adf123.jpg",
"Tags": {
"Comment": "Describes a car",
"Make": "Ford",
"Year": "2011",
"Model": "Focus",
"Color": "Blue"
}
}
So, I need to find by the Key>Value pair in the tag or any number of Key>Value pairs to filter which document I want. The problem here is, I want to tag objects with a key>value pair. These tags could be very different per view, so the next document will have a whole diff set of Key>Value pairs.
Couchdb supports flexible schema. There is no need for the documents to be consistent for them to be query-able. The view for your scenario is pretty straightforward. Here is the map function that should do the trick.
function(doc){
if(doc.Make&&doc.Color)
emit([doc.Make,doc.Color],null);
}
This gives you a view which you can then query like
/view-name/key=["Ford","Blue"]&include_docs=true
This should give you the desired result.
Edit based on comment
For that you will need two separate views. Every view in couchdb is designed to fulfil a specific query need. This means that you have to think about access strategy of your data. It is a lot more work on your part initially but for the trouble you are rewarded with data that is indexed and has very fast access times.
So to answer your question directly. Create two views. One for Make like we have already done and other for Name like
function(doc){
if(doc.Name&&doc.LastName)
emit([doc.Name,doc.Name],null);
}
Now the Name view will index only those documents that have name in it. Where as Make view will index those documents that have make in it.
What happens when a requirement comes in future for which you don't have a query?
You can try a few things.
This is probably the easiest solution. Use couchdb-lucene for your dynamic queries. In this case your architecture will be like couchdb views for queries that you know your application would need. Lucene index for queries that you don't know you might need. So for instance you have indexed name and last name in the in couchdb query. But a requirement arises and you might need to query by age then simply dump the age field in lucene and it will take care of the rest.
Another approach is using the PPP technique where you exploit the fact that creating views is a one time cost and you can create views on less active hours and deploy them in a production service once they are built.
Combine steps 1 and 2! lucene to handle adhoc request while you are building views using the ppp technique.

Retrieve analyzed tokens from ElasticSearch documents

Trying to access the analyzed/tokenized text in my ElasticSearch documents.
I know you can use the Analyze API to analyze arbitrary text according your analysis modules. So I could copy and paste data from my documents into the Analyze API to see how it was tokenized.
This seems unnecessarily time consuming, though. Is there any way to instruct ElasticSearch to returned the tokenized text in search results? I've looked through the docs and haven't found anything.
This question is a litte old, but maybe I think an additional answer is necessary.
With ElasticSearch 1.0.0 the Term Vector API was added which gives you direct access to the tokens ElasticSearch stores under the hood on per document basis. The API docs are not very clear on this (only mentioned in the example), but in order to use the API you have to first indicate in your mapping definition that you want to store term vectors with the term_vector property on each field.
Have a look at this other answer: elasticsearch - Return the tokens of a field. Unfortunately it requires to reanalyze on the fly the content of your field using the script provided.
It should be possible to write a plugin to expose this feature. The idea would be to add two endpoints to:
allow to read the lucene TermsEnum like the solr TermsComponent does, useful to make auto-suggestions too. Note that it wouldn't be per document, just every term on the index with term frequency and document frequency (potentially expensive with a lot of unique terms)
allow to read the term vectors if enabled, like the solr TermVectorComponent does. This would be per document but requires to store the term vectors (you can configure it in your mapping) and allows also to retrieve positions and offsets if enabled.
You may want to use scripting, however your server should have the scripting enabled.
curl 'http://localhost:9200/your_index/your_type/_search?pretty=true' -d '{
"query" : {
"match_all" : { }
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "field_x.field_y"
}
}
}
}'
The default setting for allowing the script depends on the elastic search version, so please check that out from the official documentation.

Resources