Getting started with StormCrawler for document crawling - stormcrawler

I am experiencing difficulties to get started using StormCrawler using the StormCrawler+ElasticSearch archetype. On the StormCrawler website, I see two versions, namely 1x and 2x. Similarly, Apache Storm comes in version 1 and 2.
Should I install StormCrawler using the version 1x or 2x?
What version of JDK does StormCrawler require? Is there a need to use Oracle JDK or can the OpenJDK be used as well?
I want to use StormCrawler to identify and process images and documents. At what place in the topology can these tasks best be added?
Update: According to the following URL (Storm Crawler with Java 11), StormCrawler 2 is advised. What StormCrawler+ElasticSearch archetype should be used when using StormCrawler 2?

SC 1.x is stable and the current version, 2.x is less tested but will be the main version at some point.
The thread mentioned in the question does not advise you to use SC2 as such, it mentions that you should use it if you need Java 11. If you are on Java 8, then you can use whichever version you want.
SC works fine with openjdk.
As for question #3, it depends what you want to do. Can you please elaborate?

Related

Monitor/Log slow running queries in Apache Cassandra 2.2.X

how to monitor/log slow running queries in Apache Cassandra 2.2.X version without using any external monitoring tools? Is there is any parameter that we can set in YAML to log slow running queries? or any other approach?
Also in CASSANDRA-12403, i see they added parameter "slow_query_log_timeout_in_ms: 500" for this purpose. Can we add this parameter in Cassandra 2.2.X version's Cassandra.YAML file? or do we need to apply this patch for 2.2.X version in order to make it work?
Its a feature in a newer version, you can upgrade or apply the patch and go off of a custom build. In 2.2.x theres no support to do it by itself.
Its a bit of a long shot but you might be able to get https://github.com/smartcat-labs/cassandra-diagnostics with https://github.com/smartcat-labs/cassandra-diagnostics/blob/dev/cassandra-diagnostics-core/COREMODULES.md#slow-query-module to work. It also only supports 2.1 and 3.0 though, I dont see 2.2 there.

How to dump Nutch 2.3 data into WARC file?

I need to dump data from Nutch 2.3 into a WARC file. However, i couldn't find the necessary module. Nutch 1.x had this capability. I would like to know the proper way to do it.
As you said, at the moment the WARC exporter module is not yet ported to the 2.x branch of Nutch, nevertheless porting the https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/tools/warc/WARCExporter.java module shoudln't be that hard. As a general rule the 1.x branch of Nutch still is more used and better equiped than the 2.x branch (at least for now).

Solr upgradation from 4.7 to 5.3

I need to upgrade my solr search from 4.7 version to 5.3.1 .
I am working on a linux platform.
Can you please provide me the steps that i need to follow .
Thank you!
I do not think that there is a definitive step by step guide for the upgrade that you are looking for.
I have a 4.3.x SOLR running in a production environment and I am contemplating the leap to upgrade to 5.x. However its clear that a lot has changed and that my upgrade is not going to be straight forward.
Also other priorities in my project have kept me from doing the upgrade.
So rest of the discussion is more a thought process than actual upgrade experience.
Last I researched I found the below links useful
https://support.lucidworks.com/hc/en-us/articles/203776523-How-to-upgrade-between-major-Solr-Versions
https://cwiki.apache.org/confluence/display/solr/Major+Changes+from+Solr+4+to+Solr+5
From the Major changes link you will notice that a lot has changed ..
Most notably there are changes to the index format, SolrJ removal of deprecated API and that the deployment is now as a standalone server instead of a war file.
So I would suggest that you ask yourself the following questions ...
Is it possible to recreate the index from scratch ? How much time does it take to create your complete index ? If your index can be recreated quickly then , I would suggest that you do that using 5.x engine on a separate machine, while your production environment is served by your existing server. Then plan a complete upgrade from 4.x to 5.x by simply pointing your Production instance to the new SOLR engine. This approach will give you a clean slate to start with and a brand new index (but with existing data).
If you have a very large index (e.g. it takes several days to recreate it from scratch), then you may want to perform an upgrade of the live index. In that case I suggest that you consider the following.
The SOLR upgrade guide mentions 4.10 as a version that is 4.x (so I assume its is easy to upgrade from any 4.x to 4.10) and has some features built in to help with the move to 5.x. So first upgrade to 4.10 ensure that your index continues to work properly. Then use the guides mentioned above to upgrade to 5.x

solr 1.4.1 solrj client with solr 3.6.2 server?

I've been trying to work through an issue that developed while attempting to upgrade our testing environment from 12.04 to 14.04 ubuntu on aws. Prior to this, the Package repository version of solr was 1.4.1 which matched our 1.4.1 solrj client integrated with our application.
Changing the base AMI to the 14.04 latest and running our default deploy caused solr 3.6.2 server to be installed. It appears it was accepting our configs without issue, however when our client tried to connect we received different errors:
The first was an unknown custom field, which we traced back to our deployment scripts not moving our schema.xml and solrconfig.xml to /etc/solr/conf/ but keeping it in the base directory.
We corrected this issue, and then ran into the following:
'exception: Invalid version or the data in not in 'javabin' format'
This was generated by a wrapper ontop of solrj, but I'll be honest and say I know nothing regarding Solr and that this may be on our end. I've asked our dev team to look at 2 options:
1) enabling: 'server.setParser(new XMLResponseParser());'
Which is the recommendation on the backwards compatibility for an older client.
2) updating our client in the application to 3.6.2
-I know less about the requirements on this.
My fall back is to revert to 1.4.1, but it appears it hasn't been touched since 2011, which makes me hesitant.
Any thoughts / suggestions would be appreciated!
Thanks!
I think the best option is to maintain the same version of Solr and Solrj.
I used for a lot of time Solr 1.4.1 and, while as you said, the most part of it works with newer versions without any problem, actually a lot of things have been changed since 1.4.*
I did your same porting last year, (from 1.4.1 to 3.6.1) and I can confirm you that the 2nd way is the right one: all changes you must do in your client code are just "formal" and very very quick.
Any workaround you could do for being able to communicate with a different version (between Solrj and Solr) is just, as the word says, a "workaround" and it could lead to unexpected (hidden) side-effects later.

How can I use an updated version of JavaMail in XPages?

I have a XPage application where I use JavaMail in one of my managed beans. Currently I have added the jar-file C:\Programme\IBM\Notes\framework\shared\eclipse\plugins\com.ibm.designer.lib.javamail_9.0.0.20130301-1431\lib\mail.jarto the build-path of the manged bean. This works well. But now I want to use a newer version of JavaMail as the Domino server uses version 1.3 but I need version 1.4.x.
I have downloaded the new JavaMail jar-files from Oracle. In Domino Designer (version 9) I add this jar-file to the new design element "Code / Jars" and remove the old jar-files from the build path.
My managed bean is still compiling and running as desired, but if I check the version the bean is using it reports still version 1.3. To check the version number I use the debug property of JavaMail and it's reporting version 1.3 to the domino server console.
Is there a way to tell the domino server to use the jar-files in the application (i.e. the nsf) and not his own? Is there another approach to update the JavaMail version?
The reason I want to use a newer version of JavaMail is as follows: I want to read mails from an imap server with ssl. To avoid the problem of importing ssl-certificates I simply want to trust all hosts. This can be be done via MailSSLSocketFactory, but this is only available since version 1.4.2. Therefore I want to use a newer version of JavaMail.
Another reason I want to use a newer version is as follows: the method "getSortedMessages" of "IMAPFolder" is only available since version 1.4.4. (and so are some other features of JavaMail).
This may be a little too late for you... I think the right approach may be to include the jar file as an OSGi plugin.
I have spent some time to figure out how to do that - and recently succeeded :-) I have described the steps to perform to make this work in two articles. The first is about wrapping a JAR into a plug-in: http://www.dalsgaard-data.eu/blog/wrap-an-existing-jar-file-into-a-plug-in/ - the second is about deployment (and there is a link in the first one).
/John
You can solve the problem by creating an OSGi plug-in that supersedes the one that sports the JavaMail library: com.ibm.designer.lib.javamail.
In order to do that do the following:
Create an OSGi plugin whose id is com.ibm.designer.lib.javamail (Dalsgaard's tutorial on how to do it)
Set its version to a higher number than the one the Domino server is shipped with (to know the version type tell http osgi ss com.ibm.designer.lib.javamail). As of now using 9.0.1.qualifier should be fine
Deploy the plugin either through an update site or by directly copying it under the domino\workspace\applications\eclipse\plugins folder.
Restart the HTTP service. The higher version - the one you created - will now be used
I've got the same problem here, but found a solution. Be warned, this is not the best answer but it will work. Simply download the latest javamail jar here and rename the jar file to 'mail.jar'. Just replace the current file in IBM\Notes\framework\shared\eclipse\plugins\com.ibm.designer.lib.javamail_9.0.0.20130301-1431\lib\mail.jar with this file. Quit the http task and restart it. The code will now work with the latest version.

Resources