Elastic search with CouchDB river plugin - Can't find any documents - couchdb

I recently started using elasticsearch and couchdb and I have the following problem. I have a couch database with a bunch of documents. I add a couchDb river index on elasticsearch and I expect to have those documents indexed and searchable. But when I search for anything though ES I don't get any results. The command flow is as follows:
The command above verifies that there are 4 documents in the couchDb instance
curl -H "Content-Type: application/json" -X GET http://localhost:5984/my_db
result:
{
"db_name": "my_db",
"doc_count": 4,
"doc_del_count": 0,
"update_seq": 4,
"purge_seq": 0,
"compact_running": false,
"disk_size": 16482,
"data_size": 646,
"instance_start_time": "1370204643908592",
"disk_format_version": 6,
"committed_update_seq": 4
}
The _changes output:
curl -H "Content-Type: application/json" -X GET http://localhost:5984/my_db/_changes
{
"results": [
{
"seq": 1,
"id": "1",
"changes": [
{
"rev": "1-40d928a959dd52d183ab7c413fabca92"
}
]
},
{
"seq": 2,
"id": "2",
"changes": [
{
"rev": "1-42212757a56b240f5205266b1969e890"
}
]
},
{
"seq": 3,
"id": "3",
"changes": [
{
"rev": "1-f59c2ae7acacb68d9414be05d56ed33a"
}
]
},
{
"seq": 4,
"id": "4",
"changes": [
{
"rev": "1-e86cf1c287c16906e81d901365b9bf98"
}
]
}
],
"last_seq": 4
}
Now, below I m creating my index in ES.
curl -XPUT 'http://localhost:9200/_river/my_db/_meta' -d '{
"type": "couchdb",
"couchdb": {
"host": "localhost",
"port": 5984,
"db": "my_db",
"filter": null
}
}'
{
"ok": true,
"_index": "_river",
"_type": "my_db",
"_id": "_meta",
"_version": 1
}
But I don't get anything back.
curl -XGET "http://localhost:9200/my_db/my_db/_search?pretty=true"
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : []
}
}
Is there anything I'm missing?

You're missing the ElasticSearch index settings from your river metadata. From here:
{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "my_db",
"filter" : null
},
"index" : {
"index" : "my_db",
"type" : "my_db",
"bulk_size" : "100",
"bulk_timeout" : "10ms"
}
}
I haven't seen any documentation that suggests the "index" member can be inferred.

Related

how to match a related data if incorrectly texted a keyword in elastic search

I have a document contain title with "Hard work & Success". I need to do a search for this document. And if I typed "Hardwork" (without spacing) it didn't returning any value. but if I typed "hard work" then it is returning the document.
this is the query I have used :
const search = qObject.search;
const payload = {
from: skip,
size: limit,
_source: [
"id",
"title",
"thumbnailUrl",
"youtubeUrl",
"speaker",
"standards",
"topics",
"schoolDetails",
"uploadTime",
"schoolName",
"description",
"studentDetails",
"studentId"
],
query: {
bool: {
must: {
multi_match: {
fields: [
"title^2",
"standards.standard^2",
"speaker^2",
"schoolDetails.schoolName^2",
"hashtags^2",
"topics.topic^2",
"studentDetails.studentName^2",
],
query: search,
fuzziness: "AUTO",
},
},
},
},
};
if I searched for title "hard work" (included space)
then it returns data like this:
"searchResults": [
{
"_id": "92",
"_score": 19.04531,
"_source": {
"standards": {
"standard": "3",
"categoryType": "STANDARD",
"categoryId": "S3"
},
"schoolDetails": {
"categoryType": "SCHOOL",
"schoolId": "TPS123",
"schoolType": "PUBLIC",
"logo": "91748922mn8bo9krcx71.png",
"schoolName": "Carmel CMI Public School"
},
"studentDetails": {
"studentId": 270,
"studentDp": "164646972124244.jpg",
"studentName": "Nelvin",
"about": "good student"
},
"topics": {
"categoryType": "TOPIC",
"topic": "Motivation",
"categoryId": "MY"
},
"youtubeUrl": "https://www.youtube.com/watch?v=wermQ",
"speaker": "Anna Maria Siby",
"description": "How hardwork leads to success - motivational talk by Anna",
"id": 92,
"uploadTime": "2022-03-17T10:59:59.400Z",
"title": "Hard work & Success",
}
},
]
And if i search for the Keyword "Hardwork" (without spacing) it won't detecting this data. I need to make a space in it or I need to match related datas with the searching keyword. Is there any solution for this can you please help me out of this.
I made an example using a shingle analyzer.
Mapping:
{
"settings": {
"analysis": {
"filter": {
"shingle_filter": {
"type": "shingle",
"max_shingle_size": 4,
"min_shingle_size": 2,
"output_unigrams": "true",
"token_separator": ""
}
},
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle_filter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "shingle_analyzer"
}
}
}
}
Now I tested it with your term. Note that the token "hardwork" was generated but the others were also generated which may be a problem for you.
GET idx-separator-words/_analyze
{
"analyzer": "shingle_analyzer",
"text": ["Hard work & Success"]
}
Results:
{
"tokens" : [
{
"token" : "hard",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "hardwork",
"start_offset" : 0,
"end_offset" : 9,
"type" : "shingle",
"position" : 0,
"positionLength" : 2
},
{
"token" : "hardworksuccess",
"start_offset" : 0,
"end_offset" : 19,
"type" : "shingle",
"position" : 0,
"positionLength" : 3
},
{
"token" : "work",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "worksuccess",
"start_offset" : 5,
"end_offset" : 19,
"type" : "shingle",
"position" : 1,
"positionLength" : 2
},
{
"token" : "success",
"start_offset" : 12,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}

Elastic Search multi match query can't ignore special characters

I have a name field value as "abc_name" so when I search "abc_" I am getting proper results but when I search "abc_##£&-#&" still I am getting same results. I want my query to ignore this special characters that doesn't matches with my query.
My query has:
Multi_match
type as cross_fields
operator AND
I am using search_analyzer standard for my Fields
And I want this structure as it is otherwise it will affect my other Search behaviour
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
Please see the below sample which would fit your use case where I've created a custom analyzer which would fit your use case:
Sample Mapping:
PUT some_test_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "custom_tokenizer",
"filter": ["lowercase", "3_5_edge_ngram"]
}
},
"tokenizer": {
"custom_tokenizer": {
"type": "pattern",
"pattern": "\\w+_+[^a-zA-Z\\d\\s_]+|\\s+". <---- Note this pattern
}
},
"filter": {
"3_5_edge_ngram": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 5
}
}
}
},
"mappings": {
"properties": {
"my_field":{
"type": "text",
"analyzer": "my_custom_analyzer"
}
}
}
}
The above mentioned pattern would simply ignore the tokens with the format like abc_$%^^##. As a result this token would not be indexed.
Note that the way the analyzer works is:
First executes tokenizer
Then applies the edge_ngram filter on the tokens generated.
You can verify by simply removing the edge_ngram filter in the above mapping to first understand what tokens are getting generated via Analyze API which would be as below:
POST some_test_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "abc_name asda efg_!##!## 1213_adav"
}
Tokens generated:
{
"tokens" : [
{
"token" : "abc_name",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 0
},
{
"token" : "asda",
"start_offset" : 9,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "1213_adav",
"start_offset" : 25,
"end_offset" : 34,
"type" : "word",
"position" : 2
}
]
}
Note that the token efg_!##!## has been removed.
I've added edge_ngram fitler as you would want the search to be successful if you search with abc_ if your tokens generated via tokenizer is abc_name.
Sample Document:
POST some_test_index/_doc/1
{
"my_field": "abc_name asda efg_!##!## 1213_adav"
}
Query Request:
Use-case 1:
POST some_test_index/_search
{
"query": {
"match": {
"my_field": "abc_"
}
}
}
Use-case-2:
POST some_test_index/_search
{
"query": {
"match": {
"my_field": "efg_!##!##"
}
}
}
Responses:
Response for use-case-1:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.47992462,
"hits" : [
{
"_index" : "some_test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.47992462,
"_source" : {
"my_field" : "abc_name asda efg_!##!## 1213_adav"
}
}
]
}
}
Response for use-case-2:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
Updated Answer:
Create your mapping as follows based on the index I've created and let me know if that works:
PUT some_test_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "punctuation",
"filter": ["lowercase"]
}
},
"tokenizer": {
"punctuation": {
"type": "pattern",
"pattern": "\\w+_+[^a-zA-Z\\d\\s_]+|\\s+"
}
}
}
},
"mappings": {
"properties": {
"my_field":{
"type": "text",
"analyzer": "autocompete", <----- Assuming you have already this in setting
"search_analyzer": "my_custom_analyzer". <----- Note this
}
}
}
}
Please try and let me know if this works for all your use-cases.

NodeJs-ElasticSearch Bulk API error handling

I can't find any documentation on what happens if Elastic Bulk API fails on one or more of the actions. For example, for the following request, let's say there is already a document with id "3", so "create" should fail- does this fail all of the other actions?
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1"} }
{ "doc" : {"field2" : "value2"} }
I'm using nodejs elastic module.
No failures in one action does not affect the others .
From the documentation of elasticsearch bulk api :
The response to a bulk action is a large JSON structure with the
individual results of each action that was performed. The failure of a
single action does not affect the remaining actions.
In the response from elasticsearch client there is status in response corresponding to each action to determine if it was a failure or not
Example:
client.bulk({
body: [
// action description
{ index: { _index: 'test', _type: 'test', _id: 1 } },
// the document to index
{ title: 'foo' },
// action description
{ update: { _index: 'test', _type: 'test', _id: 332 } },
// the document to update
{ doc: { title: 'foo' } },
// action description
{ delete: { _index: 'test', _type: 'test', _id: 33 } },
// no document needed for this delete
]
}, function (err, resp) {
if(resp.errors) {
console.log(JSON.stringify(resp, null, '\t'));
}
});
Response:
{
"took": 13,
"errors": true,
"items": [
{
"index": {
"_index": "test",
"_type": "test",
"_id": "1",
"_version": 20,
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"status": 200
}
},
{
"update": {
"_index": "test",
"_type": "test",
"_id": "332",
"status": 404,
"error": {
"type": "document_missing_exception",
"reason": "[test][332]: document missing",
"shard": "-1",
"index": "test"
}
}
},
{
"delete": {
"_index": "test",
"_type": "test",
"_id": "33",
"_version": 2,
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"status": 404,
"found": false
}
}
]
}

Elasticsearch term suggester return stemmed results

why is the elasticsearch term suggester results are stemmed ?
when i do this query:
curl -XPOST 'localhost:9200/posts/_suggest' -d '{
"my-suggestion" : {
"text" : "manger",
"term" : {
"field" : "body"
}
}
}'
the expected result should be "manager" but I get back "manag":
{
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"my-suggest-1":[
{
"text":"mang",
"offset":0,
"length":6,
"options":[
{
"text":"manag",
"score":0.75,
"freq":180
},
{
"text":"mani",
"score":0.75,
"freq":6
}
]
}
]
}
EDIT
i found a solution for my problem: i added a standard analyzer to my query.
curl -XPOST 'localhost:9200/posts/_suggest' -d '{
"my-suggestion" : {
"text" : "manger",
"term" : {
"analyzer" : "standard",
"field" : "body"
}
}
}'
now the results are good:
{
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"my-suggest":[
{
"text":"mang",
"offset":0,
"length":6,
"options":[
{
"text":"manager",
"score":0.75,
"freq":180
},
{
"text":"manuel",
"score":0.75,
"freq":6
}
]
}
]
}
but i've run to another similar problem with agregations:
{
"aggs" : {
"cities" : {
"terms" : { "field" : "location" }
}
}
}
the results i get are trimmed:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 473,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"cities": {
"buckets": [{
"key": "londr",
"doc_count": 244
}, {
"key": "pari",
"doc_count": 244
}, {
"key": "tang",
"doc_count": 12
}, {
"key": "agad",
"doc_count": 8
}]
}
}
}
Terms aggregation works on "term" that are made from original text via tokenization and stemming. You need to mark field as "not_analyzed" in your index mappings to disable tokenization and stemming.
I never used suggesters, but it think that you need to disable stemming for that field, but enable tokenization. You can have two versions of field in index - one for search (tokenized and stemmed) and one for suggesters (tokenized, but non-stemmed).

ElasticSearch CouchDB river - explicitly specify field type

I am using ElasticSearch river to index a CouchDB database of tweets.
The "created_at" field doesn't conform to the "date" type and gets indexed as a String.
How would I start a river with explicitly specifying that "created_at" is a Date, so that I could do range queries on it?
I tried the following river request, but it didn't work and the field was still indexed as a String:
curl -XPUT 'localhost:9200/_river/my_db/_meta' -d '{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "testtweets",
"filter" : null
},
"index" : {
"index" : "my_testing",
"type" : "my_datetesting",
"properties" : {"created_at": {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss"
}
},
"bulk_size" : "100",
"bulk_timeout" : "10ms"
}
}'
My data looks like this:
{
"_id": "262856000481136640",
"_rev": "1-0ed7c0fe655974e236814184bef5ff16",
"contributors": null,
"truncated": false,
"text": "RT #edoswald: Ocean City MD first to show that #Sandy is no joke. Pier badly damaged, sea nearly topping the seawall http://t.co/D0Wwok4 ...",
"author_name": "Casey Strader",
"author_created_at": "2011-04-21 20:00:32",
"author_description": "",
"author_location": "",
"author_geo_enabled": false,
"source": "Twitter for iPhone",
"retweeted": false,
"coordinates": null,
"author_verified": false,
"entities": {
"user_mentions": [
{
"indices": [
3,
12
],
"id_str": "10433822",
"id": 10433822,
"name": "Ed Oswald",
"screen_name": "edoswald"
}
],
"hashtags": [
{
"indices": [
47,
53
],
"text": "Sandy"
}
],
"urls": [
{
"indices": [
117,
136
],
"url": "http://t.co/D0Wwok4",
"expanded_url": "http://t.co/D0Wwok4",
"display_url": "t.co/D0Wwok4"
}
]
},
"in_reply_to_screen_name": null,
"author_id_str": "285792303",
"retweet_count": 98,
"id_str": "262856000481136640",
"favorited": false,
"source_url": "http://twitter.com/download/iphone",
"author_screen_name": "Casey_Rae22",
"geo": null,
"in_reply_to_user_id_str": null,
"author_time_zone": "Eastern Time (US & Canada)",
"created_at": "2012-10-29 09:58:48",
"in_reply_to_status_id_str": null,
"place": null
}
Thanks!

Resources