Injecting urls taking too long - nutch

I have configured Hbase 0.94.14 and Nutch 2.3 through this tutorial and made a seed directory which contains a text file with the urls. When I want to inject these urls using this command:
$NUTCH_ROOT/runtime/local/bin/nutch inject /seed
I get the following output:
InjectorJob: starting at 2015-07-23 14:00:24
InjectorJob: Injecting urlDir: /seed
and stays in this state forever.
Can anybody help me with this problem?

Your Nutch version is 2.3. You should not run the command line in $NUTCH_ROOT/runtime/local/bin/nutch, you should run the command in $NUTCH_ROOT/runtime/deploy/bin/nutch instead.
Hope this helps,
Le Quoc Do

Related

Spark jar package dependency file

I want to do some ip to location computation on spark, after exploring the net ,find IPLocator https://github.com/miraclesu/IPLocator,
the IP to location need to use a file which contains the mapping information.
After packaging the jar, I can run it through on using local java, the package just runs with the IPLocator.jar and qqwry.dat in the same directory.
But I want to use this jar using spark , I tryed to use --jars IPLocator.jar qqwry.dat when starting spark-shell, but when launching , the functions still can not read get the file .
the file reading code is like
QQWryFile.class.getClassLoader().getResource("qqwry.dat")
I also tried to package qqwry.dat file into the jar, and It did not work.
You need to use --files and then SparkFiles.get inside of your program
Try to use comma delimitor and check if IPLocator.jar and qqwry.dat are distributed to spark staging folder(.sparkStaging/application_xxx).
--jars IPLocator.jar,qqwry.dat

Error while running Zeppelin paragraphs in Spark on Linux cluster in Azure HdInsight

I have been following this tutorial in order to set up Zeppelin on a Spark cluster (version 1.5.2) in HDInsight, on Linux. Everything worked fine, I have managed to successfully connect to the Zeppelin notebook through the SSH tunnel. However, when I try to run any kind of paragraph, the first time I get the following error:
java.io.IOException: No FileSystem for scheme: wasb
After getting this error, if I try to rerun the paragraph, I get another error:
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
These errors occur regardless of the code I enter, even if there is no reference to the hdfs. What I'm saying is that I get the "No FileSystem" error even for a trivial scala expression, such as parallelize.
Is there a missing configuration step?
I am download the tar ball that the script that you pointed to as I type. But want I am guessing is that your zeppelin install and spark install are not complete to work with wasb. In order to get spark to work with wasb you need to add some jars to the Class path. To do this you need to add something like this to your spark-defaults.conf (the paths might be different in HDInsights, this is from HDP on IaaS)
spark.driver.extraClassPath /usr/hdp/2.3.0.0-2557/hadoop/lib/azure-storage-2.2.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/microsoft-windowsazure-storage-sdk-0.6.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/hadoop-azure-2.7.1.2.3.0.0-2557.jar
spark.executor.extraClassPath /usr/hdp/2.3.0.0-2557/hadoop/lib/azure-storage-2.2.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/microsoft-windowsazure-storage-sdk-0.6.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/hadoop-azure-2.7.1.2.3.0.0-2557.jar
Once you have spark working with wasb, or next step is make those sames jar in zeppelin class path. A good way to test your setup is make a notebook that prints your env vars and class path.
sys.env.foreach(println(_))
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
Also looking at the install script, it trying to pull the zeppelin jar from wasb, you might want to change that config to somewhere else while you try some of these changes out. (zeppelin.sh)
export SPARK_YARN_JAR=wasb:///apps/zeppelin/zeppelin-spark-0.5.5-SNAPSHOT.jar
I hope this helps, if you are still have problems I have some other ideas, but would start with these first.

Cassandra dead but pid file exists

I have novice to cassandra and tried my hands to install cassandra-2.1.2 on centos 7.0.
After complete installation execute cqlsh command and created few keyspace(s) and column family.
Which seems to me in first glance its working perfectly.
But later onwards i realized below issues:
1- when i execute "service cassandra status" command, i got below error:
Output:Cassandra dead but pid file exists.
I googled the above issue and found some links
http://www.datastax.com/support-forums/topic/dse-dead-but-pid-file-exists
https://baioradba.wordpress.com/2014/06/13/how-to-install-cassandra-on-centos-6-5/
and found that I had same configuration mentioned in above links but the same error still persists.
Please tell me the root cause and how to resolve it.
2- Second issue is in the cassandra.log file.
When I analysed the cassandra.log file there was an expection as :
Expecting URI in variable: [cassandra.config]. Please prefix the file with file:/// for local files or file://<server>/ for remote files. Aborting.
Below is the complete log:
12:01:40.816 [main] ERROR o.a.c.config.DatabaseDescriptor - Fatal configuration error
org.apache.cassandra.exceptions.ConfigurationException: Expecting URI in variable: [cassandra.config]. Please prefix the file with file:/// for local files or file://<server>/ for remote files. Aborting.
at org.apache.cassandra.config.YamlConfigurationLoader.getStorageConfigURL(YamlConfigurationLoader.java:73) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.config.YamlConfigurationLoader.loadConfig(YamlConfigurationLoader.java:84) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.config.DatabaseDescriptor.loadConfig(DatabaseDescriptor.java:158) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.config.DatabaseDescriptor.<clinit>(DatabaseDescriptor.java:133) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:110) [apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:465) [apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:554) [apache-cassandra-2.1.3.jar:2.1.3]
Expecting URI in variable: [cassandra.config]. Please prefix the file with file:/// for local files or file://<server>/ for remote files. Aborting.
Fatal configuration error; unable to start. See log for stacktrace.
I again searched the same issue in google and but the links were not that useful as they contained the java class code for cassandra.config .
Again please tell the root cause and how to resolve it?
Thanks in advance.
rm /var/run/cassandra.pid
Run ps -ef | grep cassandra
Kill the pid of the cassandra process.
Start cassandra
fix this issue, Edit the cassandra-env.sh:
sudo vi /etc/cassandra/conf/cassandra-env.sh
increase heap size for cassandra .. this should resolve your issue
Check if you have enough memory to start cassandra service with this command:
cat /proc/meminfo
I was running Hortonworks VM with Virtualbox, and I had a lot of Hadoop components started which needed a lot of memory, so for me the solution was to stop unnecessary Hadoop components and add some extra memory to the virtual machine.
From https://github.com/apache/cassandra/blob/cassandra-2.1/examples/client_only/README.txt#L43-L49 :
cassandra.yaml can be on the classpath as is done here, can be
specified (by modifying the script) in a location within the classpath
like this: java -Xmx1G
-Dcassandra.config=/path/in/classpath/to/cassandra.yaml ... or can be retrieved from a location outside the classpath like this: ...
-Dcassandra.config=file:///path/to/cassandra.yaml ... or ... -Dcassandra.config=http://awesomesauce.com/cassandra.yaml ...
So you probably had a misconfigured startup option.
Remove the pid file. Try
rm /var/run/cassandra.pid

Nutch 2.1 cannot setup in Mac

Trying to set up the new Nutch 2.1 in local environments. With the fresh download, then "ant build". Following the document from wiki http://wiki.apache.org/nutch/Nutch2Tutorial however, it seems that no luck
I got the following errors:
java[1815:1903] Unable to load realm info from SCDynamicStore
InjectorJob: org.apache.gora.util.GoraException: org.apache.hadoop.hbase.MasterNotRunningException
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:214)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:228)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:248)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:258)
Caused by: org.apache.hadoop.hbase.MasterNotRunningException
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:394)
at org.apache.hadoop.hbase.client.HBaseAdmin.(HBaseAdmin.java:94)
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:108)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 7 more
Your help is highly appreciated. thanks
Caused by: org.apache.hadoop.hbase.MasterNotRunningException
this indicates that the clsuter setup is not done correctly. The nutch tutorial page mentions this:
Install and configure HBase. You can get it here (N.B. Gora 0.2 uses
HBase 0.90.4, however the setup is known to work with more recent
versions of the HBase 0.90.x branch)
Have you performed this step correctly ?

Problem creating RichFaces page theme

I'm trying to create a RichFaces page theme using the instructions here. I know NOTHING about Maven, so I've followed the instructions as best I can, but I've run into an error and don't know what I'm doing wrong. I followed the instructions on the page, and then run this command:
mvn archetype:create -DarchetypeGroupId=org.richfaces.cdk -DarchetypeArtifactId=maven-archetype-theme -DarchetypeVersion=3.3.3.Final -DartifactId=test -DgroupId=org.richfaces.docs -Dversion=1.0mvn archetype:create -DarchetypeGroupId=org.richfaces.cdk -DarchetypeArtifactId=maven-archetype-theme -DarchetypeVersion=3.3.3
However, when I run the command I get the following error message:
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-archetype-plugin:2.0:create (default-cli) on project standalone-pom: Error creating from archetype: org.apache.maven.archetype.downloader.DownloadException: Error downloading org.richfaces.cdk:maven-archetype-theme:jar:3.3.3. Could not transfer artifact org.richfaces.cdk:maven-archetype-theme:jar:3.3.3 from/to repository.jboss.com (http://repository.jboss.com/maven2/): Access denied to: http://repository.jboss.com/maven2/org/richfaces/cdk/maven-archetype-theme/3.3.3/maven-archetype-theme-3.3.3.jar
[ERROR] org.richfaces.cdk:maven-archetype-theme:jar:3.3.3
I tried browsing to http://repository.jboss.com/maven2/, but I get an "Access Denied" error, just as stated in the error message. My question is, how do I rectify this? Is there a different URL that I should be using? If so, do I edit the Maven settings.xml file and use the new URL? I'd REALLY appreciate anyone that can give me some direction on this.
The link of the Jboss Repository for Maven specified in the jboss_profile.txt seems to be outdated.
You can try to replace all <url> of all Jboss Repository for Maven with https://repository.jboss.org/nexus/content/groups/public-jboss/ in your Maven settings.xml

Resources