Documentation for bin/nutch elasticindex - search

There is a lot of documentation and examples using the bin/nutch solrindex command, but the bin/nutch elasticindex command is lacking in coverage. I am struggling to combine an instance of Nutch 2.2.1 with Elasticsearch 0.90.2. I've tried to use this plugin to disguise Elasticsearch as a Solr instance, but any bin/crawl jobs crash from internal server error. What I am looking for is an example of bin/crawl modified to use Elasticsearch or a detailed description of the bin/nutch elasticindex command (the nutch wiki doesn't have a page for it). Can I simply replace every occurrence of the phrase solrindex with elasticindex freely?

I've modified bin/crawl to remove the bin/nutch solrdedup command, and replaced all mentions of solrindex with elasticindex.

I don't think it's possible to make Nutch 2.2.x work with Elasticsearch. But I don't see the added benefit of 2.2.x compared with 1.8. The only thing is that Nutch 2.2.x uses Gora to save the crawled pages in a database of your choice. Since you are using Elasticsearch to index the results I assume you don't need the database.
I made Nutch 1.8 with Elasticsearch 0.90.11 and you can find the bundle on my GitHub account:
https://github.com/andreivisan/NutchElasticsearch

Related

Monitor/Log slow running queries in Apache Cassandra 2.2.X

how to monitor/log slow running queries in Apache Cassandra 2.2.X version without using any external monitoring tools? Is there is any parameter that we can set in YAML to log slow running queries? or any other approach?
Also in CASSANDRA-12403, i see they added parameter "slow_query_log_timeout_in_ms: 500" for this purpose. Can we add this parameter in Cassandra 2.2.X version's Cassandra.YAML file? or do we need to apply this patch for 2.2.X version in order to make it work?
Its a feature in a newer version, you can upgrade or apply the patch and go off of a custom build. In 2.2.x theres no support to do it by itself.
Its a bit of a long shot but you might be able to get https://github.com/smartcat-labs/cassandra-diagnostics with https://github.com/smartcat-labs/cassandra-diagnostics/blob/dev/cassandra-diagnostics-core/COREMODULES.md#slow-query-module to work. It also only supports 2.1 and 3.0 though, I dont see 2.2 there.

Unable to find custom indexer class 'com.stratio.cassandra.lucene.Index'

I am using apache-cassandra-3.0.10. I have placed cassandra lucene jar with version 3.0.10.3 in cassandra lib folder. When i am trying to create lucene index it is showing the message Unable to find custom indexer class 'com.stratio.cassandra.lucene.Index'. As per the lucene documentation 3.0.10 jar is compatible with cassandra version 3.0.10. Then why this error is occuring. Can any one help me out of this please?
Put the stratio lucene jar into all of your cassandra node's lib folder and
Restart all the node.
The cassandra lucene jar 3.0.10 was downloaded from maven repository and it was broken. I generated own jar file from their github repository and it was working fine

How to dump Nutch 2.3 data into WARC file?

I need to dump data from Nutch 2.3 into a WARC file. However, i couldn't find the necessary module. Nutch 1.x had this capability. I would like to know the proper way to do it.
As you said, at the moment the WARC exporter module is not yet ported to the 2.x branch of Nutch, nevertheless porting the https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/tools/warc/WARCExporter.java module shoudln't be that hard. As a general rule the 1.x branch of Nutch still is more used and better equiped than the 2.x branch (at least for now).

Cassandra (Datastax v3.5) using Stratio Lucene Index plugin - Windows

I'm trying to look at using the Stratios Lucene index plugin (on Windows)installation of Cassandra (Datastax v3.5) but can't get Cassandra to recognize it.
I'm aware that you must use the corresponding version to Cassandra and have tried with 3.0.5 & 3.5 but both with the same results. The service is stopped, the index .jar file is copied to the lib directory & then the service is restarted. Then using CQLSH, I can create the relevant keyspace & table (as described in the Stratio documentation) but when attempting to create the index it fails with the following message:
Query invalid because of configuration issue: message="Unable to find custom indexer class 'com.stratio.cassandra.lucene.Index'"
https://github.com/Stratio/cassandra-lucene-index/tree/branch-3.5
Does anyone have any idea how to get this implemented & working?
Is there a central forum or a point of contact for Stratios Lucene index support?
This resource https://github.com/Stratio/cassandra-lucene-index/issues/118#issuecomment-211796434 suggests that only open source Apache Cassandra is officially supported by this plugin. It might work with DSE, might not. I checked 3.5.0 version works on Linux with Apache Cassandra but does not work on Windows with DSE :( According to Datastax docs, it should support custom secondary indexes. So, it might be the plugin does not run on Windows?

not able to use webgraph command in nutch 1.2

I am quite new to nutch . Thing is I have crawled a site successfully using nutch 1.2 .Now using cygwin I am working on crawldb and segments . Problem is when I am using webgraphdb command it is showing "Error: Could not find or load main class WebGraph". Please suggest me that what I need to do to use this command properly.

Resources