I have tested running of nutch in server mode by starting it using bin/nutch startserver command locally. Now I wonder whether I can start nutch in server mode on top of a hadoop cluster(in distributed environment) and submit crawl requests to server using nutch REST api ?
Please help.
From further research I've got nutch server working on distributed mode.
Steps :-
Assume hadoop is configured in all slave nodes. Then setup nutch in all nodes. This can help : http://wiki.apache.org/nutch/NutchHadoopTutorial
On your namenode, cd $NUTCH_HOME/runtime/deploy
bin/nutch startserver -port <port> -host <host>
Note :Port and host are optional.
Then you can submit requests from nutch using REST. The requests you submit will be accepted by nutch server started on step 3.
Happy crawling :)
Related
Q1
I installed jenkins on linux using the jenkins repo. It's up and running fine. I thought it was running on nginx or apache so I could change the hostname and install certificates but I read somewhere that it's most likely using a small java servlet called jetty???? I'm a devops student and want to go about this the right way for future production workloads. Is there a way to access the jetty server to make production ready network and security updates? Should I instead redo the server and install jenkins on tomcat so I can make these changes? Orrrrr, should I install nginx alongside whatever is running jenkins? TIA.
Q2
I tried systemctl status nginx, httpd, tomcat, tc, http, apache2, jetty. How do you find what server is running jenkins. I assume there may be a java command that could tell me where the jenkins.war is being served from?
1.Jenkins is a java application, so it will use a java servlet container. By default, it uses Jetty, but you can also use Tomcat. If you want to use nginx,
you'll need to configure it to reverse proxy to the servlet container.
2.You can find the process running Jenkins by running ps -ef | grep jenkins. This will give you a list of all processes running on your system that contain the word "jenkins". The first column is the process ID, and the second column is the command that started the process.
I may have a dumb question. I am running spark on a remote EC2 and I would like to use the UI it offers. According to the official doc https://spark.apache.org/docs/latest/spark-standalone.html
I need to run the address http://localhost:8080 on my local browser. But when I do that I have my Airflow UI opening. How do I set it to Spark? Any help is appreciated.
Also according to this doc https://spark.apache.org/docs/latest/monitoring.html, I tried to run http://localhost:18080 but it did not work (I did all the settings to be able to see history server).
edit:
I have also tried the command sc.uiWebUrl in spark, which gives a private DNS 'http://ip-***-**-**-***.ap-northeast-1.compute.internal:4040' . But I am not sure how do use it.
I assumed you ssh-ed into your EC2 instance using this command:
ssh -i /path/my-key-pair.pem my-instance-user-name#my-instance-public-dns-name
To connect to the spark UI, you can add port forwarding option in ssh:
ssh -L 8080:localhost:8080 -i /path/my-key-pair.pem my-instance-user-name#my-instance-public-dns-name
and then you can just open a browser on your local machine and go to http://localhost:8080.
If you need to forward multiple ports, you can chain the -L arguments, e.g.
ssh -L 8080:localhost:8080 -L 8081:localhost:8081 -i /path/my-key-pair.pem my-instance-user-name#my-instance-public-dns-name
Note: check the Spark port number is correct, sometimes it's 4040 and sometimes 8080, depends on how you deployed spark.
Hadoop, hive, hdfs, spark were running fine on my namenode and were able to connect to datanode as well. But for some reason, the server was shutdown and now when I try to access hadoop filesystem via commands like hadoop fs -ls / or if I try to access hive, connection is refused on port 8020.
I can see that cloudera-scm-server and cloudera-scm-agent services are running fine. I tried to check the status of hiveserver2, hivemetastore, hadoop-hdfs etc. services, but the service status command gives an errors message that these services do not exist.
Also, I tried to look for start-all.sh but could not find it. I ran find / -name start-all.sh command and only the path for start-all.sh in the cloudera parcel directory for spark came up.
I checked the logs in /var/log directory, for hiveserver2 it is pretty clear that the service was shutdown .. other logs are not that clear but I am guessing all the services went down when the server powered off.
Please advise on how to bring up the whole system again. I am unable to access cloudera manager or ambari or anything on the webpages either. Those pages are down too and I am not sure if I even have access to those because I've never tried it before, I've only been working on the linux command line.
I have a Spark Spark cluster where the master node is also the worker node. I can't reach the master from the driver-code node, and I get the error:
14:07:10 WARN client.AppClient$ClientEndpoint: Failed to connect to master master-machine:7077
The SparkContext in driver-code node is configured as:
SparkConf conf = new SparkConf(true).setMaster(spark:master-machine//:7077);
I can successfully ping master-machine, but I can't successfully telnet master-machine 7077. Meaning the machine is reachable but the port is not.
What could be the issue? I have disabled Ubuntu's ufw firewall for both master node and node where driver code runs (client).
Your syntax is a bit off, you have:
setMaster(spark:master-machine//:7077)
You want:
setMaster(spark://master-machine:7077)
From the Spark docs:
Once started, the master will print out a spark://HOST:PORT URL for
itself, which you can use to connect workers to it, or pass as the
“master” argument to SparkContext. You can also find this URL on the
master’s web UI, which is http://localhost:8080 by default.
You can use an IP address in there too, I have run into issues with debian-based installs where I always have to use the IP address but that's a separate issue. An example:
spark.master spark://5.6.7.8:7077
From a configuration page in Spark docs
I am running a AWS EMR cluster with Spark (1.3.1) installed via the EMR console dropdown. Spark is current and processing data but I am trying to find which port has been assigned to the WebUI. I've tried port forwarding both 4040 and 8080 with no connection. I'm forwarding like so
ssh -i ~/KEY.pem -L 8080:localhost:8080 hadoop#EMR_DNS
1) How do I find out what the Spark WebUI's assigned port is?
2) How do I verify the Spark WebUI is running?
Spark on EMR is configured for YARN, thus the Spark UI is available by the application url provided by the YARN Resource Manager (http://spark.apache.org/docs/latest/monitoring.html). So the easiest way to get to it is to setup your browser with SOCKS using a port opened by SSH then from the EMR console open Resource Manager and click the Application Master URL provided to the right of the running application. Spark History server is available at the default port 18080.
Example of socks with EMR at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-web-interfaces.html
Here is an alternative if you don't want to deal with the browser setup with SOCKS as suggested on the EMR docs.
Open a ssh tunnel to the master node with port forwarding to the machine running spark ui
ssh -i path/to/aws.pem -L 4040:SPARK_UI_NODE_URL:4040 hadoop#MASTER_URL
MASTER_URL (EMR_DNS in the question) is the URL of the master node that you can get from EMR Management Console page for the cluster
SPARK_UI_NODE_URL can be seen near the top of the stderr log. The log line will look something like:
16/04/28 21:24:46 INFO SparkUI: Started SparkUI at http://10.2.5.197:4040
Point your browser to localhost:4040
Tried this on EMR 4.6 running Spark 2.6.1
Glad to announce that this feature is finally available on AWS. You won't need to run any special commands (or to configure a SSH tunnel) :
By clicking on the link to the spark history server ui, you'll be able to see the old applications logs, or to access the running spark job's ui :
For more details: https://docs.aws.amazon.com/emr/latest/ManagementGuide/app-history-spark-UI.html
I hope it helps !
Just run the following command:
ssh -i /your-path/aws.pem -N -L 20888:ip-172-31-42-70.your-region.compute.internal:20888 hadoop#ec2-xxx.compute.amazonaws.com.cn
There are 3 places you need to change:
your .pem file
your internal master node IP
your public DNS domain.
Finally, on the Yarn UI you can click your Spark Application Tracking URL, then just replace the url:
"http://your-internal-ip:20888/proxy/application_1558059200084_0002/"
->
"http://localhost:20888/proxy/application_1558059200084_0002/"
It worked for EMR 5.x
Simply use SSH tunnel
On your local machine do:
ssh -i /path/to/pem -L 3000:ec2-xxxxcompute-1.amazonaws.com:8088 hadoop#ec2-xxxxcompute-1.amazonaws.com
On your local machine browser hit:
localhost:3000