Elastic Search input analysis - search

Can Elastic Search split input string into categorized words? i.e. if the input is
4star wi-fi 99$
and we are searching hotels with ES, is it possible to analyze/tokenize this string as
4star - hotel level, wi-fi - hotel amenities, 99$ - price?
yep, it's a noob question :)

Yes and no.
By default, query_string searches will work against the automatically created _all field. The contents of the _all field come from literally and naively combining all fields into a single analyzed string.
As such, if you have a "4star" rating, a "wi-fi" amenity, and a "99$" price, then all of those values would be inside of the _all field and you should get relevant hits against it. For example:
{
"level" : "4star",
"amenity" : ["pool", "wi-fi"],
"price" : 99.99
}
The problem is that you will not--without client-side effort--know what field(s) matched when searching against _all. It won't tell you the breakdown of where each value came from, rather it will simply report a score that determines the overall relevance.
If you have some way of knowing which field each term (or terms) is meant to search against, then you can easily do this yourself (quotes aren't required, but they're good to have to avoid mistakes with spaces). This would be the input that you might provide to the query_string query linked above:
level:"4star" amenity:"wi-fi" price:(* TO 100)
You could further complicate this by using a spelled out query:
{
"query" : {
"bool" : {
"must" : [
{ "match" : { "level" : "4star" } },
{ "match" : { "amentiy" : "wi-fi" } },
{
"range" : {
"price" : {
"lt" : 100
}
}
}
]
}
}
}
Naturally the last two requests would require advanced knowledge about what each search term referenced. You could certainly use the $ in "99$" as a tipoff for price, but not for the others. Chances are you wouldn't have them typing in 4 stars I hope, rather having some checkboxes or other form-based selections, so this should be quite realistic.
Technically, you could create a custom analyzer that recognized each term based on their position, but that's not really a good or useful idea.

Related

Optimise conditional queries in Azure cognitive search

we got a unique scenario while using Azure search for one of the project. So, our clients wanted to respect user's privacy, hence we have a feature where a user can restrict search for any PII data. So, if user has opted for Privacy, we can only search for him/her with UserID else we can search using Name, Phone, City, UserID etc.
JSON where Privacy is opted:
{
"Id": "<Any GUID>",
"Name": "John Smith", //searchable
"Phone": "9987887856", //searchable
"OtherInfo": "some info" //non-searchable
"Address" : {}, //searchable
"Privacy" : "yes", //searchable
"UserId": "XXX1234", //searchable
...
}
JSON where Privacy is not opted:
{
"Id": "<Any GUID>",
"Name": "Tom Smith", //searchable
"Phone": "7997887856", //searchable
"OtherInfo": "some info" //non-searchable
"Address" : {}, //searchable
"Privacy" : "no", //searchable
"UserId": "XXX1234", //searchable
...
}
Now we provide search service to take any searchText as input and fetch all data which matches to it (all searchable fields).
With above scenario,
We need to remove those results which has "Privacy" as "yes" if searchText is not matching with UserId
In case searchText is matching with UserId, we will be including it in result.
If "Privacy" is set "no" and searchText matches any searchable field, it will be included in result.
So we have gone with "Lucene Analysers" to check it while querying, resulting in a very long query as shown below. Let us assume searchText = "abc"
((Name: abc OR Phone: abc OR UserId: abc ...) AND Privacy: no) OR
((UserId: abc ) AND Privacy: yes)
This is done as we show paginated results i.e. bringing data in batches like 1 - 10, 11 - 20 and so on, hence, we get top 10 records in each query with total result count.
Is there any other optimised approach to do so??
Or Azure search service facilitates any internal mechanism for conditional queries?
If I understand your requirement correctly, it can be solved quite easily. You determine which property should be searchable and not in your data model. You don't need to construct a complicated query that repeats the end user input for every property. And you don't need to do any batching or processing of results.
If searchText is your user's input, you can use this:
(*searchText* AND Privacy:false)
This will search all searchable fields, but it will only return records that have allowed search in PII data.
You also have a requirement that allows the users to search for userid in all records regardless of the PII setting for the record. To support this, extend the query to:
(*searchText* AND Privacy:false) OR (UserId:*searchText*)
This allows users to search all fields in records where Privacy is false, and for all other records it allows search in the UserId only. This query pattern will solve all of your requirements with one optimized query.
From the client side you could dynamically add the ¨SearchFields¨ parameter as part of the query, that way if the user got the Privacy flag set to true, only UserId is set as part of the available Search fields.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.search.models.searchparameters.searchfields?view=azure-dotnet

Is there a way to search In Firebase firestore without saving another field in lowercase for case-insensitive search? [duplicate]

This question already has answers here:
Cloud Firestore Case Insensitive Sorting Using Query
(3 answers)
Are Cloud Firestore queries still case sensitive?
(1 answer)
Closed 1 year ago.
To support case-insensitive or any other canonicalization do we need to write a separate field that contains the canonicalized version and query against that??.
For example:
db.collection("users").where("name", "==", "Dan")
db.collection("users").where("name_lowercase", "==", "dan")
What I would do:
Before querying (maybe client-side): convert the query term in two or more variations (10 variations is maximum). For example, the search term "dan" (String) becomes an array of ["dan", "DAN", "Dan"]
Then I would do a "in" query, where I would search all of those variations in the same name field.
The "in" query type supports up to 10 equality (==) clauses with a logical "OR" operator. (documentation here)
This way, you can keep only one field "name" and query with possible variations on it.
It would look like this:
let query_variations = ["dan", "DAN", "Dan"]; // TODO: write a function that converts the query string into this kind of Array
let search = await db.collection("users").where("name", "in", query_variations).get();
In short, yes.
This is because Cloud Firestore (and the Firebase Realtime Database, when enabled) are indexed databases based on the values of each property in a document.
Rather than search through hundreds (if not thousands and thousands) of documents for matches, the index of the relevant property is queried for matching document IDs.
Consider the following "database" and it's index based on the name in the documents:
const documents = {
"docId1": {
name: "dan"
},
"docId2": {
name: "dan"
},
"docId3": {
name: "Dan"
},
"docId4": {
name: "Dan"
}
}
const nameIndex = {
"dan": ["docId1, docId2"],
"Dan": ["docId3, docId4"]
}
Instead of calling Object.entries(documents).filter(([id, data]) => data.name === "dan") on the entire list of documents, you can just ask the index instead using nameIndex["dan"] yielding the final results ["docId1, docId2"] near-instantly ready to be retrieved.
Continuing that same example, calling nameIndex["daniel"] gives undefined (no documents with that name) which can quickly be used to say that the data doesn't exist in the database).
Firestore introduced composite indexes, which allows you to index across multiple properties such as "name" and "age" so you can also quickly and efficiently search documents where the name is "Dan" but they are also 42 years of age.
Further reading: The Firebase documentation covers one solution for text-based search here.

How to use UNWIND to execute block composed of a MATCH and two FOREACHs?

I'm running neo4j queries from node.js using the neo4j-driver. A lot of things were simplified to cut irrelevant information, but what is needed is here.
I have been trying to make a query to ingest a data set with some quirks, defined as follows:
Curriculum: A list of Publications
Publication: Contains data about a publication and a field that is a list of Authors
Author: Relevant fields are externalId and normalizedFullName.
externalId is an id that comes from the data's origin system. It is not guaranteed to be present, but if it is, it will uniquely identify a node
normalizedFullName will always be present and it's ok to assume the same author will always have the same name wherever it appears; it is also acceptable that full name may not be unique and that at some point two different persons may be stored as the same node
It is possible for an author to be part of a publication with only it's normalizedFullName and be part of another with normalizedFullName AND externalId. As you can see, it is not very consistent data, but this is not a problem for the ends I need it.
It will look like this: (don't mind any syntax error)
"curriculum": [
{
"data": {
"fieldA": "a",
"fieldB": "b"
},
"authors": [
{
"externalId": "",
"normalizedFullName": "namea namea"
},
{
"externalId": "123456",
"normalizedFullName": "nameb nameb"
}
]
},
{
"data": {
"fieldA": "d",
"fieldB": "e"
},
"authors": [
{
"externalId": "123321",
"normalizedFullName": "namea namea"
},
{
"externalId": "123456",
"normalizedFullName": "nameb nameb"
}
]
}
]
Merging everything
Merging the publication part is trivial, but things get complicated when it comes to the authors since I have to follow this logic (simplified here) to merge an author:
IF author don't have externalId OR isn't already a node created with his externalId THEN
merge by normalizedFullName
ELSE IF there is already a node with this externalId THEN
merge by externalId
So, acknowledging that I would need some kind of conditional merge, finding that it could be achieved by "the foreach trick", I was able to come up with this little monster (comments added to clarify):
// For each publication, merge it
UNWIND {publications} as publication
MERGE (p:Publication { fieldA: publication.data.fieldA, fieldB: publication.data.fieldB })
ON CREATE SET p = publication.data
WITH p, publication.authors AS authors
// Then, for each author in this publication
UNWIND authors AS author
// IF author don't have externalId OR isn't already a node created with his externalId THEN
MATCH (a:Author) WHERE a.externalId = author.data.externalId AND a.externalId <> '' WITH count(a) as found, author, p
// Merge by name
FOREACH(ignoreMe IN CASE WHEN found = 0 THEN [1] ELSE [] END |
MERGE (aa:Author { normalizedFullName: author.data.normalizedFullName })
ON CREATE SET aa = author.data
MERGE (aa)-[:CONTRIBUTED]->(p)
)
// Else, merge by externalId
FOREACH(ignoreMe IN CASE WHEN found > 0 THEN [1] ELSE [] END |
MERGE (aa:Author { externalId: autor.dadta.externalId })
ON CREATE SET aa = author.data
MERGE (aa)-[:CONTRIBUTED]->(p)
)
Note: This is not the real query i'm using, just shows the exact structures.
The Problem
It doesn't work. It only creates the publications (corretly) and never the authors. It seems the MATCH, FOREACH or a combination of both is messing up with the loop I expected to happen because of UNWIND.
I'm at a point where I can't find a way to do it properly. I also can't find what is wrong, even checking the documentation available.
So, what do I do?
(let me know if anymore information is needed)
Thanks in advance for any insight!
First of all: author.data.externalIddoes not exists. The right property path is author.externalId(without data). The same for author.data.normalizedFullName.
I simulated your scenario here putting your data set as a parameter in the Neo4j browser interface. After it I ran your query. As expected the author are never created.
I corrected your query doing these steps:
Changed author.data.externalId to author.externalId and author.data.normalizedFullName to author.normalizedFullName.
Changed MATCH (a:Author) to OPTIONAL MATCH (a:Author) to ensure that the query will continue even no results found.
Removed count(a) as found (not necessary) and changed tests from found = 0 to a IS NULL and from found > 0 to a IS NOT NULL.
Your corrected query:
UNWIND {publications} as publication
MERGE (p:Publication { fieldA: publication.data.fieldA, fieldB: publication.data.fieldB })
ON CREATE SET p = publication.data
WITH p, publication.authors AS authors
UNWIND authors AS author
OPTIONAL MATCH (a:Author) WHERE a.externalId = author.externalId AND a.externalId <> '' WITH a, author, p
FOREACH(ignoreMe IN CASE WHEN a IS NULL THEN [1] ELSE [] END |
MERGE (aa:Author { normalizedFullName: author.normalizedFullName })
ON CREATE SET aa = author
MERGE (aa)-[:CONTRIBUTED]->(p)
)
FOREACH(ignoreMe IN CASE WHEN a IS NOT NULL THEN [1] ELSE [] END |
MERGE (aa:Author { externalId: author.dadta.externalId })
ON CREATE SET aa = author
MERGE (aa)-[:CONTRIBUTED]->(p)
)
The data set created after I ran this query:
I think the problem (or at least one problem) is that if your author MATCH fails, the entire row for that author will be wiped out, and the rest of the query will not execute for that author.
Try using OPTIONAL MATCH instead, that will preserve the row and allow the query to finish for those rows.
As for additional options on how to do conditional cypher operations, we actually just released new versions of APOC Procedures with conditional cypher execution, so take a look at apoc.do.when() when you get the chance.

ElasticSearch default scoring mechanism

What I am looking for, is plain, clear explanation, of how default scoring mechanism of ElasticSearch (Lucene) really works. I mean, does it use Lucene scoring, or maybe it uses scoring of its own?
For example, I want to search for document by, for example, "Name" field. I use .NET NEST client to write my queries. Let's consider this type of query:
IQueryResponse<SomeEntity> queryResult = client.Search<SomeEntity>(s =>
s.From(0)
.Size(300)
.Explain()
.Query(q => q.Match(a => a.OnField(q.Resolve(f => f.Name)).QueryString("ExampleName")))
);
which is translated to such JSON query:
{
"from": 0,
"size": 300,
"explain": true,
"query": {
"match": {
"Name": {
"query": "ExampleName"
}
}
}
}
There is about 1.1 million documents that search is performed on. What I get in return, is (that is only part of the result, formatted on my own):
650 "ExampleName" 7,313398
651 "ExampleName" 7,313398
652 "ExampleName" 7,313398
653 "ExampleName" 7,239194
654 "ExampleName" 7,239194
860 "ExampleName of Something" 4,5708737
where first field is just an Id, second is Name field on which ElasticSearch performed it's searching, and third is score.
As you can see, there are many duplicates in ES index. As some of found documents have diffrent score, despite that they are exactly the same (with only diffrent Id), I concluded that diffrent shards performed searching on diffrent parts of whole dataset, which leads me to trail that the score is somewhat based on overall data in given shard, not exclusively on document that is actually considered by search engine.
The question is, how exactly does this scoring work? I mean, could you tell me/show me/point me to exact formula to calculate score for each document found by ES? And eventually, how this scoring mechanism can be changed?
The default scoring is the DefaultSimilarity algorithm in core Lucene, largely documented here. You can customize scoring by configuring your own Similarity, or using something like a custom_score query.
The odd score variation in the first five results shown seems small enough that it doesn't concern me much, as far as the validity of the query results and their ordering, but if you want to understand the cause of it, the explain api can show you exactly what is going on there.
The score variation is based on the data in a given shard (like you suspected). By default ES uses a search type called 'query then fetch' which, sends the query to each shard, finds all the matching documents with scores using local TDIFs (this will vary based on data on a given shard - here's your problem).
You can change this by using 'dfs query then fetch' search type - prequery each shard asking about term and document frequencies and then sends a query to each shard etc..
You can set it in the url
$ curl -XGET '/index/type/search?pretty=true&search_type=dfs_query_then_fetch' -d '{
"from": 0,
"size": 300,
"explain": true,
"query": {
"match": {
"Name": {
"query": "ExampleName"
}
}
}
}'
Great explanation in ElasticSearch documentation:
What is relevance:
https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html
Theory behind relevance scoring:
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

Storing and Grouping IP Addresses in CouchDB

I have a Couch Database that contains a stack of IP address documents like this:
{
"_id": "09eea172ea6537ad0bf58c92e5002199",
"_rev": "1-67ad27f5ab008ad9644ce8ae003b1ec5",
"1stOctet": "10",
"2ndOctet": "1",
"3rdOctet": "3",
"4thOctet": "55"
}
The documents consist of multiple IP that are part of different subnet ranges.
I need a way to reduce/group these documents based on the 1st, 2nd, 3rd and 4th Octets in order to produce a reduced list of subnets.
Has anybody done anything like this before.
Best Regards,
Carlskii
I'm not sure if this is exactly what you're looking for, if you can provide more of an example as to your desired output, I can likely be of more help.
First, I would have your document structure look like this: (if you can't change that structure, it's not a big deal)
{
"ip": "10.1.3.55"
}
Your map function would look like:
function (doc) {
emit(doc.ip.split("."));
}
You'll need a reduce function, I've just used this in my testing
_count
Then I would use the group_level view query parameter to group based on each octet.
1 = group based on 1st octet
2 = group based on 1st-2nd octet
3 = group based on 1st-3rd octet
4 = group based on entire octet
group=true is functionally the same in this case as group_level=4

Resources