Stormcrawler with ES - content not stored - stormcrawler

Content / text not appearing in ES even after the indexer is configured to store content.
Stormcrawler 1.14, ES 7.0 - followed the online tutorial and the configuration change to ES_IndexInit at: Stormcrawler not indexing content with Elasticsearch
Here are the changes the content properties in ES_IndexInit.sh
"mappings": {
"_source": {
"enabled": true
},
"properties": {
"content": {
"type": "text",
"index": "true",
"store": true
},
....
The crawl in local mode runs successfully and the status and metrics indices are populated with data. But the content index still remains empty:
curl -H 'Content-Type: application/json' -XGET <my-es-host>:<my-es-host-port>/content/_search?pretty
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
The crawl logs do not indicate any failures and except for the missing content, the results are as specified. Looks like a configuration issue but after eliminating the usual suspects, the problem remains.

Related

ElasticSearch can't get multiple suggestor values from the same document

Can you help me please?
I have a problem with Completion Suggester in ElasticSearch
Example: I have this mapping :
PUT music
{
"mappings": {
"properties": {
"suggest": {
"type": "completion"
},
"title": {
"type": "keyword"
}
}
}
}
and index multiple suggestions for a document as follows:
PUT music/_doc/1?refresh
{
"suggest": [
{
"input": "Nirva test",
"weight": 10
},
{
"input": "Nirva hola",
"weight": 3
}
]
}
Querying: you can do this request on kibana
POST music/_search?pretty
{
"suggest": {
"song-suggest": {
"prefix": "nirv",
"completion": {
"field": "suggest"
}
}
}
}
and the result I retrieve only the first value but not both.
I did the test on kibana dev tool too and this is the result
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"song-suggest" : [
{
"text" : "nir",
"offset" : 0,
"length" : 3,
"options" : [
{
"text" : "Nirvana test",
"_index" : "music",
"_type" : "_doc",
"_id" : "1",
"_score" : 10.0,
"_source" : {
"suggest" : [
{
"input" : "Nirvana test",
"weight" : 10
},
{
"input" : "Nirvana best",
"weight" : 3
}
]
}
}
]
}
]
}
}
expected result :
"suggest" : {
"song-suggest" : [
{
"text" : "nirvana",
"offset" : 0,
"length" : 7,
"options" : [
{
"text" : "Nirvana test",
"_index" : "music",
"_type" : "_doc",
"_id" : "1",
"_score" : 10.0,
"_source" : {
"suggest" : [
{
"input" : "Nirvana test",
"weight" : 10
},
{
"input" : "Nirvana best",
"weight" : 3
}
]
}
}
]
},
{
"text" : "nirvana b",
"offset" : 0,
"length" : 9,
"options" : [
{
"text" : "Nirvana best",
"_index" : "music",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.0,
"_source" : {
"suggest" : [
{
"input" : "Nirvana test",
"weight" : 10
},
{
"input" : "Nirvana best",
"weight" : 3
}
]
}
}
]
}
]
}
This is the default behavior of current implementations. You can check #31738. Below is one of the comment for an explanation why it is returning only one document/suggestion.
The completion suggester is document-based by design so we cannot
return one entry per matching suggestion. It is documented that it
returns documents not suggestions and a single input can be indexed in
multiple suggestions (if you have synonyms in your analyzer for
instance) so it is not trivial to differentiate a match from its
variations. Also the completion suggester does not visit all
suggestions to select the top N, it has a special structure (a
weighted FST) that can visit suggestions in the order of their scores
and early terminates the query once enough documents have been found.

Elastic Search multi match query can't ignore special characters

I have a name field value as "abc_name" so when I search "abc_" I am getting proper results but when I search "abc_##£&-#&" still I am getting same results. I want my query to ignore this special characters that doesn't matches with my query.
My query has:
Multi_match
type as cross_fields
operator AND
I am using search_analyzer standard for my Fields
And I want this structure as it is otherwise it will affect my other Search behaviour
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
Please see the below sample which would fit your use case where I've created a custom analyzer which would fit your use case:
Sample Mapping:
PUT some_test_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "custom_tokenizer",
"filter": ["lowercase", "3_5_edge_ngram"]
}
},
"tokenizer": {
"custom_tokenizer": {
"type": "pattern",
"pattern": "\\w+_+[^a-zA-Z\\d\\s_]+|\\s+". <---- Note this pattern
}
},
"filter": {
"3_5_edge_ngram": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 5
}
}
}
},
"mappings": {
"properties": {
"my_field":{
"type": "text",
"analyzer": "my_custom_analyzer"
}
}
}
}
The above mentioned pattern would simply ignore the tokens with the format like abc_$%^^##. As a result this token would not be indexed.
Note that the way the analyzer works is:
First executes tokenizer
Then applies the edge_ngram filter on the tokens generated.
You can verify by simply removing the edge_ngram filter in the above mapping to first understand what tokens are getting generated via Analyze API which would be as below:
POST some_test_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "abc_name asda efg_!##!## 1213_adav"
}
Tokens generated:
{
"tokens" : [
{
"token" : "abc_name",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 0
},
{
"token" : "asda",
"start_offset" : 9,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "1213_adav",
"start_offset" : 25,
"end_offset" : 34,
"type" : "word",
"position" : 2
}
]
}
Note that the token efg_!##!## has been removed.
I've added edge_ngram fitler as you would want the search to be successful if you search with abc_ if your tokens generated via tokenizer is abc_name.
Sample Document:
POST some_test_index/_doc/1
{
"my_field": "abc_name asda efg_!##!## 1213_adav"
}
Query Request:
Use-case 1:
POST some_test_index/_search
{
"query": {
"match": {
"my_field": "abc_"
}
}
}
Use-case-2:
POST some_test_index/_search
{
"query": {
"match": {
"my_field": "efg_!##!##"
}
}
}
Responses:
Response for use-case-1:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.47992462,
"hits" : [
{
"_index" : "some_test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.47992462,
"_source" : {
"my_field" : "abc_name asda efg_!##!## 1213_adav"
}
}
]
}
}
Response for use-case-2:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
Updated Answer:
Create your mapping as follows based on the index I've created and let me know if that works:
PUT some_test_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "punctuation",
"filter": ["lowercase"]
}
},
"tokenizer": {
"punctuation": {
"type": "pattern",
"pattern": "\\w+_+[^a-zA-Z\\d\\s_]+|\\s+"
}
}
}
},
"mappings": {
"properties": {
"my_field":{
"type": "text",
"analyzer": "autocompete", <----- Assuming you have already this in setting
"search_analyzer": "my_custom_analyzer". <----- Note this
}
}
}
}
Please try and let me know if this works for all your use-cases.

Is there any solution for searching exact word and containing word both in elasticsearch

index: process.env.elasticSearchIndexName,
body: {
query: {
bool: {
must: [
{
match_phrase: {
title: `${searchKey}`,
},
},
],
},
},
},
from: (page || constants.pager.page),
size: (limit || constants.pager.limit),
i am using above method but problem in that is it only search exact matched words in whole text.
it can't search containing word.. for example if title = "sweatshirt" than if i type word "shirt" it should come the result but currently not got the result using above method
Standard analyzer(default analyzer if none is specified) breaks texts in tokens.
For sentence "this is a test" tokens generated are [this,is,a,test]
Match_pharse query breaks text in tokens using same analyzer as indexing analyzer and returns documents which 1. contain all the tokens 2. tokens appear in same order.
Since you text is sweatshirt there is single token in inverted index for it "sweatshirt" which will not match with either sweat or shirt
NGram tokenizer
The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length
Mapping
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"text":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Query:
{
"query": {
"match": {
"text": "shirt"
}
}
}
If you will run _analyze query
GET my_index/_analyze
{
"text": ["sweatshirt"],
"analyzer": "my_analyzer"
}
you will see below token are generated for the text sweatshirt. Size of tokens can be adjusted using min_gram and max_gram
{
"tokens" : [
{
"token" : "swe",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "wea",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "eat",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 2
},
{
"token" : "ats",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : 3
},
{
"token" : "tsh",
"start_offset" : 4,
"end_offset" : 7,
"type" : "word",
"position" : 4
},
{
"token" : "shi",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 5
},
{
"token" : "hir",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 6
},
{
"token" : "irt",
"start_offset" : 7,
"end_offset" : 10,
"type" : "word",
"position" : 7
}
]
}
Warning:Ngrams increase the size of the inverted index so use with appropriate value of min_gram and max_gram
Another option is to use wildcard query. For wildcard all the documents have to scanned to check if text matches the pattern. They have low performance.
When using wildcard search on not_analyzed fields in case you want to include whitespace ex text.keyword
{
"query": {
"wildcard": {
"text": {
"value": "*shirt*"
}
}
}
}

ElasticSearch CouchDB river - explicitly specify field type

I am using ElasticSearch river to index a CouchDB database of tweets.
The "created_at" field doesn't conform to the "date" type and gets indexed as a String.
How would I start a river with explicitly specifying that "created_at" is a Date, so that I could do range queries on it?
I tried the following river request, but it didn't work and the field was still indexed as a String:
curl -XPUT 'localhost:9200/_river/my_db/_meta' -d '{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "testtweets",
"filter" : null
},
"index" : {
"index" : "my_testing",
"type" : "my_datetesting",
"properties" : {"created_at": {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss"
}
},
"bulk_size" : "100",
"bulk_timeout" : "10ms"
}
}'
My data looks like this:
{
"_id": "262856000481136640",
"_rev": "1-0ed7c0fe655974e236814184bef5ff16",
"contributors": null,
"truncated": false,
"text": "RT #edoswald: Ocean City MD first to show that #Sandy is no joke. Pier badly damaged, sea nearly topping the seawall http://t.co/D0Wwok4 ...",
"author_name": "Casey Strader",
"author_created_at": "2011-04-21 20:00:32",
"author_description": "",
"author_location": "",
"author_geo_enabled": false,
"source": "Twitter for iPhone",
"retweeted": false,
"coordinates": null,
"author_verified": false,
"entities": {
"user_mentions": [
{
"indices": [
3,
12
],
"id_str": "10433822",
"id": 10433822,
"name": "Ed Oswald",
"screen_name": "edoswald"
}
],
"hashtags": [
{
"indices": [
47,
53
],
"text": "Sandy"
}
],
"urls": [
{
"indices": [
117,
136
],
"url": "http://t.co/D0Wwok4",
"expanded_url": "http://t.co/D0Wwok4",
"display_url": "t.co/D0Wwok4"
}
]
},
"in_reply_to_screen_name": null,
"author_id_str": "285792303",
"retweet_count": 98,
"id_str": "262856000481136640",
"favorited": false,
"source_url": "http://twitter.com/download/iphone",
"author_screen_name": "Casey_Rae22",
"geo": null,
"in_reply_to_user_id_str": null,
"author_time_zone": "Eastern Time (US & Canada)",
"created_at": "2012-10-29 09:58:48",
"in_reply_to_status_id_str": null,
"place": null
}
Thanks!

Elastic search with CouchDB river plugin - Can't find any documents

I recently started using elasticsearch and couchdb and I have the following problem. I have a couch database with a bunch of documents. I add a couchDb river index on elasticsearch and I expect to have those documents indexed and searchable. But when I search for anything though ES I don't get any results. The command flow is as follows:
The command above verifies that there are 4 documents in the couchDb instance
curl -H "Content-Type: application/json" -X GET http://localhost:5984/my_db
result:
{
"db_name": "my_db",
"doc_count": 4,
"doc_del_count": 0,
"update_seq": 4,
"purge_seq": 0,
"compact_running": false,
"disk_size": 16482,
"data_size": 646,
"instance_start_time": "1370204643908592",
"disk_format_version": 6,
"committed_update_seq": 4
}
The _changes output:
curl -H "Content-Type: application/json" -X GET http://localhost:5984/my_db/_changes
{
"results": [
{
"seq": 1,
"id": "1",
"changes": [
{
"rev": "1-40d928a959dd52d183ab7c413fabca92"
}
]
},
{
"seq": 2,
"id": "2",
"changes": [
{
"rev": "1-42212757a56b240f5205266b1969e890"
}
]
},
{
"seq": 3,
"id": "3",
"changes": [
{
"rev": "1-f59c2ae7acacb68d9414be05d56ed33a"
}
]
},
{
"seq": 4,
"id": "4",
"changes": [
{
"rev": "1-e86cf1c287c16906e81d901365b9bf98"
}
]
}
],
"last_seq": 4
}
Now, below I m creating my index in ES.
curl -XPUT 'http://localhost:9200/_river/my_db/_meta' -d '{
"type": "couchdb",
"couchdb": {
"host": "localhost",
"port": 5984,
"db": "my_db",
"filter": null
}
}'
{
"ok": true,
"_index": "_river",
"_type": "my_db",
"_id": "_meta",
"_version": 1
}
But I don't get anything back.
curl -XGET "http://localhost:9200/my_db/my_db/_search?pretty=true"
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : []
}
}
Is there anything I'm missing?
You're missing the ElasticSearch index settings from your river metadata. From here:
{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "my_db",
"filter" : null
},
"index" : {
"index" : "my_db",
"type" : "my_db",
"bulk_size" : "100",
"bulk_timeout" : "10ms"
}
}
I haven't seen any documentation that suggests the "index" member can be inferred.

Resources