i have this data in my index
https://gist.github.com/bitgandtter/6794d9b48ae914a3ac7c
If you notice in the mapping im using the ngram from 3 tokens to 20.
when i execute this query:
GET /my_index/user/_search?search_type=dfs_query_then_fetch
{
"query": {
"filtered": {
"query":{
"multi_match":{
"query": "F",
"fields": ["username","firstname","middlename","lastname"],
"analyzer": "custom_search_analyzer"
}
}
}
}
}
I should get the 8 documents i have indexed but i get only 6 leaving out two with their names are Franz and Francis. I expect to have those two also because the f its included in the data. for some reason its not working.
when i execute:
GET /my_index/user/_search?search_type=dfs_query_then_fetch
{
"query": {
"filtered": {
"query":{
"multi_match":{
"query": "Fran",
"fields": ["username","firstname","middlename","lastname"],
"analyzer": "custom_search_analyzer"
}
}
}
}
}
i get those two documents.
If i lower the ngram to start at 1 i get all the documents but i think this will affect the performance of the query.
What im missing here. Thanks in advance.
NOTE: all the examples are coded used sense
This is expected since the min_gram is specified as 3 it would mean that the minimum length of token produced by the custom analyzer is 3 codepoints.
Hence the first token for "Franz Silva" would be "Fra".
Hence token "F" would not be a match on this document.
One can test out the tokens produced by the analyzer using :
curl -Xget "http://<server>/index_name/_analyze?analyzer=custom_analyzer&text=Franz Silva"
Also note since the "custom_analyzer" specified above does not specify "token_chars", the tokens can contain spaces.
Related
I have several million docs, that I need to move into a new index, but there is a condition on which docs should flow into the index. Say I have a field named, offsets, that needs to be queried against. The values I need to query for are: [1,7,99,32, ....., 10000432] (very large list) in the offset field..
Does anyone have thoughts on how I can move the specific docs, with those values in the list into a new elasticsearch index.? My first though was reindexing with a query, but there is no pattern for the offsets list..
Would it be a python loop appending each doc to a new index? Looking for any guidance.
Thanks
Are the documents really large, or can you add them into an jsonl file for bulk ingestion?
In what form is the selector list, the one shown as "[1,7,99,32, ....., 10000432]"?
I'd do it in Pandas, but here is an idea in ES parlance.
Whatever you do, do use the _bulk API, or the job will never finish.
You can run a query based upon as file as per
GET my_index/_search?_file="myquery_file"
You can put all the ids into a file, myquery_file, as below:
{
"query": {
"ids" : {
"values" : ["1", "4", "100"]
}
},
"format": "jsonl"
}
and output as jsonl to ingest.
You can do the above for the reindex API.
{
"source": {
"index": "source",
**"query": {
"match": {
"company": "cat"
}
}**
},
"dest": {
"index": "dest",
"routing": "=cat"
}
}
Unfortunately,
I was facing a time crunch, and had to throw in a personalized loop to query a very specific subset of indices..
df = pd.read_csv('C://code//part_1_final.csv')
offsets = df['OFFSET'].tolist()
# Offsets are the "unique" values I need to identify the docs by.. There is no pattern in these values, thus I must go one by one..
missedDocs = []
for i in offsets:
print(i)
try:
client.reindex({
"source": {
"index": "<source_index>,
"query": {
"bool": {
"must": [
{ "match" : {"<index_filed_1>": "1" }},
{ "match" : {"<index_with_that_needs_values_to_match": i }}
]
}
}
},
"dest": {
"index": "<dest_index>"
}
})
except KeyError:
print('error')
#missedDocs.append(query)
print('DOC ERROR')
Let's take this document for example:
{
"id":1
"planet":"earth-616"
"data":[
["wolverine","mutant"],
["Storm","mutant"],
["Mark Zuckerberg","human"]]
}
I created a search index to index the name and type, for example if searched for name:wolverine or type:mutant I'd get the document that has it. But as per my requirement I don't want the whole document, I only want ["wolverine","mutant"] I've created a view that outputs as:
{
"id":1,
"key":"earth-616",
"value":["earth-616","wolverine","mutant"]
}
Then I found out I can query only with keys. (Is it possible to create search indexes on views?, Couldn't find anything in the documentation)
Or should I create views along with the one above like this:
{
"id":1,
"key":"wolverine",
"value":["earth-616","wolverine","mutant"]
}
And
{
"id":,
"key":"mutant"
"value":["earth-616","wolverine","mutant"]
}
This way I can query with keys that I want but I can't seem to partial match keys(Am I missing something?)
If you need the output to be exactly as described then I believe you have to use views, and to support wildcard searches I believe you will have to index every substring of a key.
One alternative is to use Cloudant Query, although admittedly you cannot get the exact output you are looking for. If you issue a query like so:
{
"selector": {
"_id": {
"$gt": 0
},
"data": {
"$elemMatch": {
"$elemMatch": {
"$regex": "(?i)zuck"
}
}
}
},
"fields": [
"data"
]
}
The result will be the entire data array:
{
"data": [
["wolverine", "mutant"],
["Storm", "mutant"],
["Mark Zuckerberg", "human"]
]
}
I am investigate possibility to switch to ElasticSearch from SphinxSearch.
What is good about SphinxSearch - full-text search just work out of the bot on pretty good level. Make it work on ElasticSearch appeared not as easy as I expected.
In my project I have search box with typeahead, means I stype Clint E and see dropdown with results including Clint Eastwood on the first place. Type robert down and see Robert Downey Jr. on the first place. All this I achieved with SphinxSearch out of the box just providing it my DB credentials and SQL query to pull the necessary fields.
On the other hand, with ElasticSearch I can't get satisfying results even after a day of reading about Fuzzy Like This Query, matching, partial matching and other. A lot of information but it does not make task easier. I feel like I need to be PhD in search just to make it work at simplest level.
So far I ended up with such configuration
{
"settings": {
"analysis": {
"analyzer": {
"stem": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"stop",
"porter_stem"
]
}
}
}
},
"mappings": {
"movies": {
"dynamic": true,
"properties": {
"title": {
"type": "string",
"analyzer": "stem"
}
}
}
}
}
The Query look like this:
{
"query": {
"query_string": {
"query": "clint eastw"
"default_field": "title"
}
}
}
But quality of search in this case is not satisfying at all - back to my example, it can not find Clint Eastwood profile until I type his name completely.
Then I tried to use
{
"query": {
"fuzzy_like_this": {
"fields": [
"title"
],
"like_text": "clint eastw",
"max_query_terms": 25,
"fuzziness": 0.5
}
}
}
It helps but not much, now I can find what I need with shorter request clint eastwo and after some manipulations with parameters with clint eastw but still not encouraging.
So I wonder, is there a simple recipe how to cook full-text search with ElasticSearch and get decent quality of results. I spend a day reading but didn't find the solution.
Couple of images to demonstrate what I am talking about:
Elastic, name almost complete but no expected result, note that there is no better match as well.
One letter after, elastic found it!
At the same moment Sphinx shining :)
Elasticsearch ships with auto completion suggester.
You need not put this into query functioanility , the way it works is on token level and not on partial token level.
Go for completion suggester , it also have support for fuzzy logic.
I'm building a leaderboard with elasticsearch. I'd like to query all documents who have points greater than a given amount using the following query:
{
"constant_score" : {
"filter" : {
"range" : {
"totalPoints" : {
"gt": 242
}
}
}
}
This works perfectly -- elasticsearch appropriately returns all documents with points greater than 242. However, all I really need is the count of elements matching this query. Since I'm sending the result over the network, it would be helpful if the query simply returned the count, as opposed to all of the documents matching the filter.
How do I get elasticsearch to only report the count of documents matching the filter?
EDIT: I've learned that what I'm looking for is setting search_type to count. However, I'm not sure how to do this with elastic.js. Any noders willing to pitch in their advice?
You can use the query type count for exactly that purpose:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-search-type.html#count
This is an example that should help you:
GET /mymusic/itunes/_search?search_type=count
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"year": {
"gt": 2000
}
}
}
}
}
}
My elasticsearch index has 10 types in it. When searching for the term "test" I want to get all the documents that matched that query and a list of all the types that has a least one match for that query.
I know I can get this list by going over all results but I guess there's a better way..
Thanks!
Since facets have been deprecated (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-facets.html) and replaced with aggregations, here is the solution for aggregations:
{
"query": {
...
},
"aggs": {
"your_aggregation_name": {
"terms": {
"field": "_type"
}
}
}
}
Link to documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
Just managed to do that with elasticsearch facets like described here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets.html#_facet_filter
In short you add this to your query:
"facets" : { "facet_name" : { "terms" : {"field" : "_type"} } }
Hope this help someone.