Error indexing Nutch crawl data into Elasticsearch - nutch

I'm using Nutch 1.14 and trying to index a small web crawl into ES v5.3.0 and I keep getting this error:
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length in bytes. (default 2500500)
elastic.exponential.backoff.millis : elastic bulk exponential backoff initial delay in milliseconds. (default 100)
elastic.exponential.backoff.retries : elastic bulk exponential backoff max retries. (default 10)
elastic.bulk.close.timeout : elastic timeout for the last bulk in seconds. (default 600)
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
Error running:
/home/david/tutorials/nutch/apache-nutch-1.14-src/runtime/local/bin/nutch index -Delastic.server.url=http://localhost:9300/search-index/ searchcrawl//crawldb -linkdb searchcrawl//linkdb searchcrawl//segments/20180824175802
Failed with exit value 255.
I've already done this and I still get the error...
UPDATE - Ok, I've made progress. Indexing seems to work now - no more errors. However, when I go to see use _stats via Kibana to check the document count I get 0 when Nutch is telling me this:
Segment dir is complete: crawl/segments/20180830115119.
Indexer: starting at 2018-08-30 12:19:31
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
ElasticRestIndexWriter
elastic.rest.host : hostname
elastic.rest.port : port
elastic.rest.index : elastic index command
elastic.rest.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.rest.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
Indexer: number of documents indexed, deleted, or skipped:
Indexer: 9 indexed (add/update)
Indexer: finished at 2018-08-30 12:19:45, elapsed: 00:00:14
I'm assuming that means ES was sent 9 documents for indexing?

I've used Elasticsearch 6.0 with nutch 1.14 and it worked like a charm, I was using indexer-elastic-rest plugin with port 9200, i am attaching the my nutch-site.xml for the reference.

Related

Insert Events with Changed Status Only using Logstash

I'm inserting data into elasticsearch (Index A) every minute for the healthcheck of some endpoints. I want to read the index A every minute for the last events it received and if the state changes of any endpoint (from healthy to unhealthy or unhealthy to healthy) then insert that event into Index B.
How would I achieve that and if possible can someone provide a sample code please. I tried elasticsearch filter plugin but couldn't get the desired result.
I tried elasticsearch filter plugin but couldn't get the desired result.

CosmosDB Mongo v.4.0 throws Query exceeded the maximum allowed memory usage of 40 MB

I request a help bacause i still face 40MB issue on MongoDb Server in CosmosDB despite the fact that i have upgraded the version to 4.0 (40 mb issue was fixed in 3.6).
I have a simple query that i am building using IMongoQerableInterface.
protected Task<List<TEntity>> GetAllAsync(IMongoQueryable<TEntity> query)
{
return query.ToListAsync()
}
above method from repository is later awaited in service. The query that is translated looks like below:
{aggregate([{ "$match" : { "Foo" : "Bar", "IsDeleted" : false } }])}
I have about 20k documents to query with "Bar" that i want to extract with a query that i am building:
var result = await GetAllAsync(DbQueryableCollection
.Where(x=> x.Foo == "Bar" && x.IsDeleted == isDeleted))
When testing locally on my local machine it works fine. When published to Azure AppService, I recieve an error:
"Command aggregate failed: Query exceeded the maximum allowed memory usage of 40 MB. Please consider adding more filters to reduce the query response size.."
When I execute the same query but not using IMongoQuerable but IMongoCollection.FindAsync() with filter as an argument it works fine on AppService.
Below works fine.
var result = await DbCollection.FindAsync(x => x.Model == model && x.IsDeleted == isDeleted);
I am using MongoDb.Driver for .Net v.2.12.3 (latest stable)
I have created wildecard index on collection
Why I still see 40 MB issue on AppService when mongo server is upgraded to 4.0 and why it works locally?
Why query constructed with IMongoQueryable does not work in AppService but the one constructed with IMongoQuerable works ok and returns proper result?
Post an answer to end this question:
Update the endpoint from documents.azure.com to mongo.cosmos.azure fix this issue.

Logstash 6.2 - full persistent queue (wrong mapping?)

My queue is almost full and I see this errors in my log file:
[2018-05-16T00:01:33,334][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"2018.05.15-el-mg_papi-prod", :_type=>"doc", :_routing=>nil}, #<LogStash::Event:0x608d85c1>], :response=>{"index"=>{"_index"=>"2018.05.15-el-mg_papi-prod", "_type"=>"doc", "_id"=>"mHvSZWMB8oeeM9BTo0V2", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse [papi_request_json.query.disableFacets]", "caused_by"=>{"type"=>"i_o_exception", "reason"=>"Current token (VALUE_TRUE) not numeric, can not use numeric value accessors\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper#56b8442f; line: 1, column: 555]"}}}}}
[2018-05-16T00:01:37,145][INFO ][org.logstash.beats.BeatsHandler] [local: 0:0:0:0:0:0:0:1:5000, remote: 0:0:0:0:0:0:0:1:50222] Handling exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 69
[2018-05-16T00:01:37,147][INFO ][org.logstash.beats.BeatsHandler] [local: 0:0:0:0:0:0:0:1:5000, remote: 0:0:0:0:0:0:0:1:50222] Handling exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 84
...
[2018-05-16T15:28:09,981][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})
[2018-05-16T15:28:09,982][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})
[2018-05-16T15:28:09,982][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})
If I understand first warning, problem is with mapping. I have a lot of files in my queue Logstash folder. My questions is:
How to empty my queue, can i just delete all files from logstash queue folder(all logs will be lost)? And then resend all the data to logstash to proper index?
How can I determine where exactly is problem in mapping, or which servers sending data of wrong type?
I have pipeline on port 5000 named testing-pipeline just for checking if Logstash is active from nagios. What is that [INFO ][org.logstash.beats.BeatsHandler] logs?
If I understand correctly, [INFO ][logstash.outputs.elasticsearch] is just logs about retrying to process logstash queue?
On all servers is FIlebeat 6.2.2. Thank you for your help.
All pages in queue could be deleted but it is not the proper solution. In my case, the queue was full because there was events with different mapping of index. In Elasticsearch 6, you cannot send documents with different mapping to the same index so the logs stacked in queue because of this logs (even if there is only one wrong event, all others will not be processed). So how to process all data you can process an skip the wrong one? Solution is to configure DLQ (dead letter queue). Every event with response code 400 or 404 is moved to DLQ so others could be process. The data from DLQ can be processed later with pipeline.
Wrong mapping can be determined by error log "error"=>{"type"=>"mapper_parsing_exception", ..... }. To specify exact place with wrong mapping, you have to compare mapping of events and indices.
[INFO ][org.logstash.beats.BeatsHandler] was caused by Nagios server. The check did not consist of valid request, that's why the Handling exception. The check should test if Logstash service is active. Now I check Logstas service on localhost:9600, for more info here.
[INFO ][logstash.outputs.elasticsearch] means that Logstash trying to process the queue but index is locked ([FORBIDDEN/12/index read-only / allow delete (api)]) because the indices was set to read-only state. Elasticsearch, when there is not enough space on server, automatically configure indices to read-only. This can be change by cluster.routing.allocation.disk.watermark.low, for more info here.

Elasticsearch Global timeout setting not reflected in search response's took parameter

We have setup a global timeout for an elasticsearch (5.3.3) cluster in elasticsearch.yml -
search.default_search_timeout: 1nanos
But the responses we get for this, in most cases, have "timed_out": true. However, sometimes, we do get the expected response from elasticsearch with "timed_out": false. However, when "timed_out" is false, we do see that the "took" is returning values > 30 which means the time taken by elasticsearch was around 30ms which is > 1 nanosecond. Ideally the query should have timed out at 1 nanosecond. Is thsis a bug?

errmsg" : "exception: getMore: cursor didn't exist on server, possible restart or time out?"

I am trying to run an aggregation pipeline using node.js and mongodb native driver on a sharded mongodb cluster with 2 shards. The monogdb ver. is 2.6.1. The operation runs for about 50 minutes and throws the error 'errmsg" : "exception: getMore: cursor didn't exist on server, possible restart or timeout?"' On googling I came across this link . It looks like the issue is not resolved yet. BTW, the size of the collection is about 140 million documents.
Is there a fix/workaround for this issue?
Here is the pipeline that I am trying to run. I don't know at what stage it breaks. It runs for about 50 minutes and the error happens. Same is the case with any aggregation pipeline that I try to run.
db.collection01.aggregate([
{$match:{"state_cd":"CA"}},
{$group : {"_id": "$pat_id" , count : {$sum : 1}}}
],
{out: "distinct_patid_count", allowDiskUse: true }
)
My guess is you could try to lower the batch size to make the cursor more "active".
I came across this error after our server was running for more than 2.5 months. Mongo started dropping cursors even before the timeout (I guess some sort of memory error), restart of mongo solved our problem.

Resources