Regarding Nutch to crawl authenticated sites - nutch

we need to crawl data from a url which is authenticated with username and password.
1) we have configured httpclient-auth.xml with the following credentials
<credentials username="xxxx" password="xxxxxx">
<default/>
</credentials>
2) we have configured nutch-site.xml with the following properties
<property>
<name>http.agent.name</name>
<value>Nutch Crawl</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>http.auth.file</name>
<value>httpclient-auth.xml</value>
<description>Authentication configuration file for 'protocol-httpclient' plugin.</description>
</property>
While we try to fetch the data we got only one Url which is present in seed.txt file we didn't get any errors but still we are getting only one crawled data.
what we are missing here.

Related

Why I can delete others' files on HDFS with only read permission

I am not sure what is going on with our HDFS configuration, but I can delete other's files although the file permission looks fine and I only have read (r) access. What is possible problem here?
See Permissions Guide in HDFS.
Check the permission in HDFS is enabled by checking the HDFS configuration parameter dfs.permissions.enabled in the hdfs-site.xml file which determines whether permission checking is enabled in HDFS or not:
<property>
<name>dfs.permissions.enabled</name>
<value>true</value>
</property>

How to retain completed applications after yarn server restart in spark web-ui

I am using yarn resource manager for spark. after restart of yarn server, all completed jobs in spark-webui disappered.
Below two properties added in yarn-site.xml Can someone explain me what could be the reason and is there any property to control this.
<property>
<name>yarn.log-aggregation-enable</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.log.retain-seconds</name>
<value>86400</value>
</property>
Thanks.
You can persist application history on restarts if you set yarn.resourcemanager.recovery.enabled to true in your yarn-site.xml and set yarn.resourcemanager.store.class.
See ResourceManger Restart for further details.
Your other entries refer to logging and define how long you want completed logs to stay before they get cleaned out. You can read more about them in yarn-default.xml.

Setting up Nutch with Solr on Centos

I'm attempting to setup Apache Nutch and Apache Solr so our site can have internal site search. I have followed so my guides and while they are very useful, they lack what to do if an error occurs and most seem outdated at this point.
I'm using JDK 131, Nutch 2.3.1, and Solr 6.5.1
This the sequence of my actions from the none root user
sudo wget [java url] to /opt
sudo tar xvf java.tar.gz
export JAVA_HOME=/opt/java/
export JAVA_JRE=/opt/java/jre
export PATH=$PATH:/opt/java/bin:/opt/java/jre/bin
cd solr6.5.1/
sudo start runtime -e cloud -noprompt
sudo wget [solr url] to /root
sudo tar xvf solr.tar.gz
sudo wget [nutch url] to /opt
sudo tar xvf nutch.tar.gz
cd /opt/apache-nutch-2.3.1
sudo vi nutch-site.xml
add:
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch-solr-integration</value>
</property>
<property>
<name>generate.max.per.host</name>
<value>100</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)</value>
<description> At the very least, I needed to add the parse-html, urlfilter-regex, and the indexer-solr.
</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.</description>
</property>
</configuration>
cd /opt/apache-nutch-2.3.1
mkdir urls
cd urls
sudo vi seed.txt
add [our site url]
[ESC]
:w
:q
cd ../conf
sudo vi regex-urlfilter.xml
add:
+^http://([a-zA-Z0-9]*\.)*[domain of our site].com/
[ESC]
:w
:q
cd ..
sudo ant runtime
sudo -E runtime/local/bin/nutch inject urls -crawlId 3
Then I get this:
InjectorJob: Injecting urlDir: urls
InjectorJob: java.lang.ClassNotFoundException: org.apache.gora.sql.store.SqlStore
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.nutch.storage.StorageUtils.getDataStoreClass(StorageUtils.java:93)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:77)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
My questions are how why am I getting this error and how do I resolve it. I saw in a lot of places to modify the schema.xml the solr directory but there is no schema.xml file in the solr directory anywhere.
As you're using sql-store as Nutch back-end, did you edit ivy/ivy.xml and uncomment this line?
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
If not, uncomment this line and clean & build again. If it's still not working, let me know your complete approach or the tutorial you followed.
Edit
As you said, you are using hbase as store, your nutch-site.xml property is supposed to be this -
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
Please follow the link you mentioned carefully.

Unknown host error when listing with hadoop fs -ls /

I am new to Hadoop and trying to install Hadoop on multinode cluster on ubuntu 14.04-Server on VM. All goes well until I try to list the files within HDFS using hadoop fs -ls /
I keep getting an error:
ls: unknown host: Hadoop-Master.
Initially I thought I made some mistake in assigning the hostname but cross-checked with /etc/hosts and /etc./hostname. Hostname is listed correctly as Hadoop-Master. Removed hostname altogether. Only ip address remaining.
Another post here suggested to add two lines to .bashrc:
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib
I tried doing that but still getting the same error.
Please find the relevant steps below along with edits based on information asked.
Check IP address of the master with ifconfig
Add to the /etc/hosts and edit the /etc/hostname to add the host name.
Add the relevant details to masters and slaves.
.bashrc File
export HADOOP_INSTALL=/usr/local/hadoop
export PIG_HOME=/usr/local/pig
export HIVE_HOME=/usr/local/Hive
export PATH=$PATH:$HADOOP_INSTALL/bin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
Java path
export JAVA_HOME='/usr/lib/jvm/java-7-oracle'
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs:Hadoop-Master:9001</value>
</property>
</configuration>
hadoop-env.sh
export JAVA_HOME='/usr/lib/jvm/java-7-oracle'
Edit mapred-site.xml to include the hostname and change the value to no. of nodes present.
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>2</value>
</property>
</configuration>
Edit hdfs-site.xml, changed the value to no. of data nodes present.​
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hduser/mydata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hduser/mydata/hdfs/datanode</value>
</property>
</configuration>
whoami
simplilearn
/etc/hosts
localhost 127.0.0.1
Hadoop-Master 192.168.207.132
Hadoop-Slave 192.168.207.140
/etc/hostname
Hadoop-Master
Changes to be made:
1. /etc/hosts file:
Change Hadoop-Master to HadoopMaster
2. /etc/hostname file:
Change Hadoop-Master to HadoopMaster
3. core-site.xml:
Change this
hdfs:Hadoop-Master:9001
to this
hdfs://HadoopMaster:9001
NOTE: Change Hadoop-Master to HadoopMaster in all nodes pointing to your IP. Change slaves and master files too.

Getting "mount.nfs: mount system call failed" on mounting HDFS

I am trying to mount the HDFS file system as per the content in the URL http://hadoop.apache.org/docs/r2.5.1/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html
But at the final mount statement, I am getting the mount.nfs: mount system call failed
I got that output on executing the below command:
mount -t nfs -o vers=3,proto=tcp,nolock,noacl <HDFS server name>:/ <existing local directory>
I am running the hadoop in a Pseudo Distributed mode.
If you use root to mount nfs at client, for example, you need to add following configuration in core-site.xml under $HADOOP_HOME/etc/hadoop/
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>

Resources