Setting up Nutch with Solr on Centos - nutch

I'm attempting to setup Apache Nutch and Apache Solr so our site can have internal site search. I have followed so my guides and while they are very useful, they lack what to do if an error occurs and most seem outdated at this point.
I'm using JDK 131, Nutch 2.3.1, and Solr 6.5.1
This the sequence of my actions from the none root user
sudo wget [java url] to /opt
sudo tar xvf java.tar.gz
export JAVA_HOME=/opt/java/
export JAVA_JRE=/opt/java/jre
export PATH=$PATH:/opt/java/bin:/opt/java/jre/bin
cd solr6.5.1/
sudo start runtime -e cloud -noprompt
sudo wget [solr url] to /root
sudo tar xvf solr.tar.gz
sudo wget [nutch url] to /opt
sudo tar xvf nutch.tar.gz
cd /opt/apache-nutch-2.3.1
sudo vi nutch-site.xml
add:
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch-solr-integration</value>
</property>
<property>
<name>generate.max.per.host</name>
<value>100</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)</value>
<description> At the very least, I needed to add the parse-html, urlfilter-regex, and the indexer-solr.
</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.</description>
</property>
</configuration>
cd /opt/apache-nutch-2.3.1
mkdir urls
cd urls
sudo vi seed.txt
add [our site url]
[ESC]
:w
:q
cd ../conf
sudo vi regex-urlfilter.xml
add:
+^http://([a-zA-Z0-9]*\.)*[domain of our site].com/
[ESC]
:w
:q
cd ..
sudo ant runtime
sudo -E runtime/local/bin/nutch inject urls -crawlId 3
Then I get this:
InjectorJob: Injecting urlDir: urls
InjectorJob: java.lang.ClassNotFoundException: org.apache.gora.sql.store.SqlStore
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.nutch.storage.StorageUtils.getDataStoreClass(StorageUtils.java:93)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:77)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
My questions are how why am I getting this error and how do I resolve it. I saw in a lot of places to modify the schema.xml the solr directory but there is no schema.xml file in the solr directory anywhere.

As you're using sql-store as Nutch back-end, did you edit ivy/ivy.xml and uncomment this line?
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
If not, uncomment this line and clean & build again. If it's still not working, let me know your complete approach or the tutorial you followed.
Edit
As you said, you are using hbase as store, your nutch-site.xml property is supposed to be this -
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
Please follow the link you mentioned carefully.

Related

Elasticsearch: Change permissions of old folder index to work with yum-installed elasticsearch

I used a specific library that used an embedded version of elasticsearch. Now as we are growing, I want to start elasticsearch as a service.
I followed this guide to install it using yum on a linux machine. I pointed ES to the new directory using
path:
logs: /home/ec2-user/.searchindex/logs
data: /home/ec2-user/.searchindex/data
When I start the service
sudo service elasticsearch start
I get a permission denied error:
java.io.FileNotFoundException: /home/ec2-user/.searchindex/logs/elasticsearch_index_search_slowlog.log (Permission denied)
at java.io.FileOutputStream.open0(Native Method)
....
I guess this has to do with the folder permission, I changed folder permission using:
sudo chown elasticsearch:elasticsearch -R .searchindex
But that didn't help.
Any help?
Your user elasticsearch can't write in the logging folder : /home/ec2-user/.searchindex/logs
Check the permissions with ls -l
Set write permission with the chmod command :e.g. : sudo chmod -R u+wx .searchindex
The issue occurred because .searchindex is located in ec2-user directory which obviously is inaccessible by elasticsearch user created to manage the elasticsearch service.
Moving the folder to /var/lib/elasticsearch did the trick.

YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register

I am new to Hadoop ecosystem.
I recently tried Hadoop (2.7.1) on a single-node Cluster without any problems and decided to move on to a Multi-node cluster having 1 namenode and 2 datanodes.
However I am facing a weird issue. Whatever Jobs that I try to run, are stuck with the following message:
on the web interface:
YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register
and in the cli:
16/01/05 17:52:53 INFO mapreduce.Job: Running job: job_1451083949804_0001
They don't even start and at this point I am not sure what changes I need to make in order to make it work.
Here's what I have tried to resolve:
disabling firewall on all nodes
setting lower resource limits
configuring under different machines, routers and distros
I would really appreciate any help (even a minute hint) in correct direction.
I have followed these instructions (configuration):
Running Hadoop on Ubuntu Linux (Multi-Node Cluster)
How To Setup Multi Node Hadoop 2 (YARN) Cluster
I finally got this solved. Posting detailed steps for future reference. (only for test environment)
Hadoop (2.7.1) Multi-Node cluster configuration
Make sure that you have a reliable network without host isolation. Static IP assignment is preferable or at-least have extremely long DHCP lease. Additionally all nodes (Namenode/master & Datanodes/slaves) should have a common user account with same password; in case you don't, make such user account on all nodes. Having same username and password on all nodes makes things a bit less complicated.
[on all machines] First configure all nodes for single-node cluster. You can use my script that I have posted over here.
execute these commands in a new terminal
[on all machines] ↴
stop-dfs.sh;stop-yarn.sh;jps
rm -rf /tmp/hadoop-$USER
[on Namenode/master only] ↴
rm -rf ~/hadoop_store/hdfs/datanode
[on Datanodes/slaves only] ↴
rm -rf ~/hadoop_store/hdfs/namenode
[on all machines] Add IP addresses and corresponding Host names for all nodes in the cluster.
sudo nano /etc/hosts
hosts
xxx.xxx.xxx.xxx master
xxx.xxx.xxx.xxy slave1
xxx.xxx.xxx.xxz slave2
# Additionally you may need to remove lines like "xxx.xxx.xxx.xxx localhost", "xxx.xxx.xxx.xxy localhost", "xxx.xxx.xxx.xxz localhost" etc if they exist.
# However it's okay keep lines like "127.0.0.1 localhost" and others.
[on all machines] Configure iptables
Allow default or custom ports that you plan to use for various Hadoop daemons through the firewall
OR
much easier, disable iptables
on RedHat like distros (Fedora, CentOS)
sudo systemctl disable firewalld
sudo systemctl stop firewalld
on Debian like distros (Ubuntu)
sudo ufw disable
[on Namenode/master only] Gain ssh access from Namenode (master) to all Datnodes (slaves).
ssh-copy-id -i ~/.ssh/id_rsa.pub $USER#slave1
ssh-copy-id -i ~/.ssh/id_rsa.pub $USER#slave2
confirm things by running ping slave1, ssh slave1, ping slave2, ssh slave2 etc. You should have a proper response. (Remember to exit each of your ssh sessions by typing exit or closing the terminal. To be on the safer side I also made sure that all nodes were able to access each other and not just the Namenode/master.)
[on all machines] edit core-site.xml file
nano /usr/local/hadoop/etc/hadoop/core-site.xml
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>master:9000</value>
<description>NameNode URI</description>
</property>
</configuration>
[on all machines] edit yarn-site.xml file
nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
<description>The hostname of the RM.</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
[on all machines] modify slaves file, remove the text "localhost" and add slave hostnames
nano /usr/local/hadoop/etc/hadoop/slaves
slaves
slave1
slave2
(I guess having this only on Namenode/master will also work but I did this on all machines anyway. Also note that in this configuration master behaves only as resource manger, this is how I intent it to be.)
[on all machines] modify hdfs-site.xml file to change the value for property dfs.replication to something > 1 (at-least to the number of slaves in the cluster; here I have two slaves so I would set it to 2)
[on Namenode/master only] (re)format the HDFS through namenode
hdfs namenode -format
[optional]
remove dfs.datanode.data.dir property from master's hdfs-site.xml file.
remove dfs.namenode.name.dir property from all slave's hdfs-site.xml file.
TESTING (execute only on Namenode/master)
start-dfs.sh;start-yarn.sh
echo "hello world hello Hello" > ~/Downloads/test.txt
hadoop fs -mkdir /input
hadoop fs -put ~/Downloads/test.txt /input
hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /input /output
wait for a few seconds and the mapper and reducer should begin.
These links helped me with the issue:
https://stackoverflow.com/a/24544207/2534513
Hadoop YARN Installation: The definitive guide#Cluster Installation
I met the same problem when I ran
"hadoop jar hadoop-mapreduce-examples-2.6.4.jar wordcount /calculateCount/ /output"
this command stopped there,
I tracked the job, and find "there are 15 missing blocks, and they are all corrupted"
then I did the following:
1) ran "hdfs fsck / "
2) ran "hdfs fsck / -delete "
3) added "-A INPUT -p tcp -j ACCEPT" to /etc/sysconfig/iptables on the two datanodes
4) ran "stop-all.sh and start-all.sh"
everything goes well
I think the firewall is the key point.

Unknown host error when listing with hadoop fs -ls /

I am new to Hadoop and trying to install Hadoop on multinode cluster on ubuntu 14.04-Server on VM. All goes well until I try to list the files within HDFS using hadoop fs -ls /
I keep getting an error:
ls: unknown host: Hadoop-Master.
Initially I thought I made some mistake in assigning the hostname but cross-checked with /etc/hosts and /etc./hostname. Hostname is listed correctly as Hadoop-Master. Removed hostname altogether. Only ip address remaining.
Another post here suggested to add two lines to .bashrc:
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib
I tried doing that but still getting the same error.
Please find the relevant steps below along with edits based on information asked.
Check IP address of the master with ifconfig
Add to the /etc/hosts and edit the /etc/hostname to add the host name.
Add the relevant details to masters and slaves.
.bashrc File
export HADOOP_INSTALL=/usr/local/hadoop
export PIG_HOME=/usr/local/pig
export HIVE_HOME=/usr/local/Hive
export PATH=$PATH:$HADOOP_INSTALL/bin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
Java path
export JAVA_HOME='/usr/lib/jvm/java-7-oracle'
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs:Hadoop-Master:9001</value>
</property>
</configuration>
hadoop-env.sh
export JAVA_HOME='/usr/lib/jvm/java-7-oracle'
Edit mapred-site.xml to include the hostname and change the value to no. of nodes present.
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>2</value>
</property>
</configuration>
Edit hdfs-site.xml, changed the value to no. of data nodes present.​
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hduser/mydata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hduser/mydata/hdfs/datanode</value>
</property>
</configuration>
whoami
simplilearn
/etc/hosts
localhost 127.0.0.1
Hadoop-Master 192.168.207.132
Hadoop-Slave 192.168.207.140
/etc/hostname
Hadoop-Master
Changes to be made:
1. /etc/hosts file:
Change Hadoop-Master to HadoopMaster
2. /etc/hostname file:
Change Hadoop-Master to HadoopMaster
3. core-site.xml:
Change this
hdfs:Hadoop-Master:9001
to this
hdfs://HadoopMaster:9001
NOTE: Change Hadoop-Master to HadoopMaster in all nodes pointing to your IP. Change slaves and master files too.

Getting "mount.nfs: mount system call failed" on mounting HDFS

I am trying to mount the HDFS file system as per the content in the URL http://hadoop.apache.org/docs/r2.5.1/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html
But at the final mount statement, I am getting the mount.nfs: mount system call failed
I got that output on executing the below command:
mount -t nfs -o vers=3,proto=tcp,nolock,noacl <HDFS server name>:/ <existing local directory>
I am running the hadoop in a Pseudo Distributed mode.
If you use root to mount nfs at client, for example, you need to add following configuration in core-site.xml under $HADOOP_HOME/etc/hadoop/
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>

How to set up a Git server with HTTP access on Linux

I need to create a Git repository on a Linux machine and then make it accessible via HTTP. Also need full access with one user and read-only to anon-users.
I've created local repositories before but I don't know how to create this (e.g.: inside /var/www or /opt/git/...)
I tried doing this:
-sudo Clone a GitHub repository into /var/www/repos/repo.git
-cd /var/www/repos/repo.git
-sudo git --bare update-server-info
-sudo mv hooks/post-update.sample hooks/post-update
-sudo service apache2 restart
Then I tried to access this repository from another machine:
-With browser : (http protocol)192.168.1.49/repo.git <-- WORKS
-With terminal: git clone --bare (http protocol)192.168.1.49/repo.git <--DOESN'T WORK
The terminal says:
Cloning into bare repository repo.git...
fatal: (http protocol)192.168.1.49/repo.git/info/refs?service=git-upload-pack not found: did you run git update-server-info on the server?
I think maybe it's a permissions problem. How I need to manage permissions inside /var/www?
EDIT: Already fixed, just needed:
-put the repository into /var/www/repos/ named repo.git
-change the permissions of the www folder with sudo chown -R www-data:www-data /var/www
-enable webdav with sudo a2enmod dav_fs
-config file into /etc/apache2/conf.d called git.conf
-create the file with users with sudo htpasswd -c /etc/apache2/passwd.git user
-rename the pot-update file and make it executable with sudo mv /var/www/repos/repo.git/hooks/post-update.sample /var/www/repos/repo.git/hooks/post-update && sudo chmod a+x /var/www/repos/repo.git/hooks/post-update
-update server and restart apache with sudo git update-server-info && sudo service apache2 restart
And, to fix the problem with pushing:
Edit the file .git/config into your repository folder (client machine) and put the username and password on the url:
url = (http protocol)user:password#url/repos/repo.git
So, now only I need is to set the read-only for anon-users.
Already fixed, just needed:
-put the repository into /var/www/repos/ named repo.git
-change the permissions of the www folder with sudo chown -R www-data:www-data /var/www
-enable webdav with sudo a2enmod dav_fs
-config file into /etc/apache2/conf.d called git.conf
-create the file with users with sudo htpasswd -c /etc/apache2/passwd.git user
-rename the pot-update file and make it executable with sudo mv /var/www/repos/repo.git/hooks/post-update.sample
/var/www/repos/repo.git/hooks/post-update && sudo chmod a+x
/var/www/repos/repo.git/hooks/post-update
-update server and restart apache with sudo git update-server-info && sudo service apache2 restart
And, to fix the problem with pushing:
Edit the file .git/config into your repository folder (client machine)
and put the username and password on the url: url = (http
protocol)user:password#url/repos/repo.git
So, now only I need is to set the read-only for anon-users.

Resources