Searching for a particular phrase in _all fields generates less number of records comparing to do the same thing on small number of fields

Searching for a particular phrase in _all fields generates less number of records comparing to do the same thing on small number of fields - python-3.x

I wanted to search for a particular phrase using elasticsearch on both _all fields and only 2 fields. The phrase is taken from a file listing more than 10000 keywords.: Here is the code:
from elasticsearch import Elasticsearch
import os, sys
import requests
import json
es = Elasticsearch(['localhost:9200/'])
with open('localDrive\\extract_keywords\\t2.txt') as my_keywordfile:
for keyword in my_keywordfile.readlines():
keyword_array.append(keyword.strip().strip("'"))
with open('LocalFile\\_description_Results2.txt','w',encoding="utf-8") as f:
for x in keyword_array:
doc = {
"query": {
"multi_match": {
"query": x,
"type": "phrase",
"fields":["title", "description"],
}
res = es.search(index='xxx_062617', body=doc)
json.dump(res, f, ensure_ascii=False)
f.write(("\n"))
f.close()
Also, the query that matches _all fields is:
"multi_match": {
"query": x,
"type": "phrase",
"fields":"_all",
}
Now what happens is that, I get 101 returned record if I use query only on title, and description. But, I only get 100 returned records if I use _all fields. And if I want to get unique IDs by combining the ids of all records and remove duplicate ones, I see that there are only 86 duplicate records!
My questions are:
Does using type:phrase works differently if I use _all fields?
Should not I get more number of records if I use _all fields?
If _all includes all fields including title, and description, then why using _all does not cover all the records that have been returned by querying title, and description?
Thanks,

Related

Search Items by multiple Tags DynamoDB NodeJS

I need to do a search in my dynamoDB table that matches multiple values from a single item.
This is the type of Items i am storing:
{
"id": "<product id>",
"name": "Product Name",
"price": 1.23,
"tags": [
"tag1",
"tag2",
"tag3"
]
I need to return an array of items having tags that match all of the tags a the comma-separated list.
For example: i am looking for items that only contains tags "tag1" and "tag2".
My first aproach was getting all the items from the dynamoDB table and then iterating each item to check if this condition matchs, then add the target item to an object of objects.
My approach is definetly not cost effective, Any suggestions with node.js?

There is not a way to index optimize this generic case (an arbitrary number of tags stored and searched) with DynamoDB.
You can optimize retrieval for one tag by adding extra items in the table where the tag is the partition key and then doing a query (with filter for the other tags) starting there.
Or you can duplicate the data to OpenSearch which is designed for this type of query.

Elastic Search Bulk partial update for timestamped index with Kibana

I am using Elastic search and Kibana with a python client. My data are stored in elastic search, and I used kibana for data analysis and visualisation. In Kibana, I created a new index pattern with the timestamp field.
When I run the bulk partial update code, the documents disappear.
Then, I removed the index pattern and re-create the index pattern without the timestamp field. Only the fields(data_partial) provided in the "_source" can be seen in Kibana's discover panel.
So, I am wondering whether the partial update ('doc_as_upsert': True) is only worked for an index pattern without a timestamp field or not.
Or, I do not know what I am missing.
def add_data_partial_to_bulk(es, index):
d_body = []
qu = {'query': {'match_all': {}}}
for hit in scan(es, index=index, query=qu):
body = {
"_index": index,
"_id": hit["_id"],
"doc_as_upsert": True, # << this partial update only work for index pattern w/o a timestamp field
"_source": {
"data_partial": "hello world"}
}
d_body.append(body)
return d_body
doc_body = add_data_partial_to_bulk(es, es_index_name)
helpers.bulk(es, doc_body)

How to obtain nested fields within JSON

Background:
I wish to update a nested field within my JSON document. I want to query for all of the "state" that equal "new"
{
"id": "123"
"feedback" : {
"Features" : [
{
"state":"new"
}
]
}
This is what I have tried to do:
Since this is a nested document. My query looks like this:
SELECT * FROM c WHERE c.feedback.Features.state = "new"
However, I keep ending up with zero results when I know that this exists within the database. What am I doing wrong? Maybe I am getting 0 results because the Features is an array?
Any help is appreciated

For arrays, you'll need to use ARRAY_CONTAINS(). For example, in your case:
SELECT *
FROM c
WHERE ARRAY_CONTAINS(c.feedback.Features,{'state': 'new'}, true)
The 3rd parameter specifies that you're searching within documents within the array, not scalar values.

ElasticSearch default scoring mechanism

What I am looking for, is plain, clear explanation, of how default scoring mechanism of ElasticSearch (Lucene) really works. I mean, does it use Lucene scoring, or maybe it uses scoring of its own?
For example, I want to search for document by, for example, "Name" field. I use .NET NEST client to write my queries. Let's consider this type of query:
IQueryResponse<SomeEntity> queryResult = client.Search<SomeEntity>(s =>
s.From(0)
.Size(300)
.Explain()
.Query(q => q.Match(a => a.OnField(q.Resolve(f => f.Name)).QueryString("ExampleName")))
);
which is translated to such JSON query:
{
"from": 0,
"size": 300,
"explain": true,
"query": {
"match": {
"Name": {
"query": "ExampleName"
}
}
}
}
There is about 1.1 million documents that search is performed on. What I get in return, is (that is only part of the result, formatted on my own):
650 "ExampleName" 7,313398
651 "ExampleName" 7,313398
652 "ExampleName" 7,313398
653 "ExampleName" 7,239194
654 "ExampleName" 7,239194
860 "ExampleName of Something" 4,5708737
where first field is just an Id, second is Name field on which ElasticSearch performed it's searching, and third is score.
As you can see, there are many duplicates in ES index. As some of found documents have diffrent score, despite that they are exactly the same (with only diffrent Id), I concluded that diffrent shards performed searching on diffrent parts of whole dataset, which leads me to trail that the score is somewhat based on overall data in given shard, not exclusively on document that is actually considered by search engine.
The question is, how exactly does this scoring work? I mean, could you tell me/show me/point me to exact formula to calculate score for each document found by ES? And eventually, how this scoring mechanism can be changed?

The default scoring is the DefaultSimilarity algorithm in core Lucene, largely documented here. You can customize scoring by configuring your own Similarity, or using something like a custom_score query.
The odd score variation in the first five results shown seems small enough that it doesn't concern me much, as far as the validity of the query results and their ordering, but if you want to understand the cause of it, the explain api can show you exactly what is going on there.

The score variation is based on the data in a given shard (like you suspected). By default ES uses a search type called 'query then fetch' which, sends the query to each shard, finds all the matching documents with scores using local TDIFs (this will vary based on data on a given shard - here's your problem).
You can change this by using 'dfs query then fetch' search type - prequery each shard asking about term and document frequencies and then sends a query to each shard etc..
You can set it in the url
$ curl -XGET '/index/type/search?pretty=true&search_type=dfs_query_then_fetch' -d '{
"from": 0,
"size": 300,
"explain": true,
"query": {
"match": {
"Name": {
"query": "ExampleName"
}
}
}
}'

Great explanation in ElasticSearch documentation:
What is relevance:
https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html
Theory behind relevance scoring:
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

Searching required data in couchdb

I have documents like,
{_id:1,
name:"john"
}
{_id:2,
name:"john boss"
}
{_id:3,
name:"jim"
}
I have to search the data where ever john is stored in documents. Suppose, if i search "john" the documents should get _id:1 & _id:2 related data. Please guide me to get the result.
I appreciate if any one provide the solutions.

I suggest a CouchDB view to show you all "words" from the "name" field.
function(doc) {
// map function: _design/example/_view/names
if(!doc.name) // Optionally do more testing for doc type, etc. here.
return
// Emit one row per word in the name field (first name, last name, etc.).
var words = doc.name.split(/\s+/)
for(var i = 0; i < words.length; i++)
emit(words[i].toLowerCase(), doc._id)
}
Now if you query /db/_design/example/_view/names?key="john", you will get two rows: one for doc id 1, and another for id 2. I also added a conversion to lower case, so searching for "john" will match people named "John".
Duplicates are possible: the same doc ID listed multiple times, e.g. for {"name":"John John"}; however you are guaranteed that all duplicate rows will be adjacent.
You can also add ?include_docs=true to your request to get the full document for each row.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Searching for a particular phrase in _all fields generates less number of records comparing to do the same thing on small number of fields - python-3.x

Related

Search Items by multiple Tags DynamoDB NodeJS

Elastic Search Bulk partial update for timestamped index with Kibana

How to obtain nested fields within JSON

ElasticSearch default scoring mechanism

Searching required data in couchdb

Categories

Resources