I was experimenting some crawl cycles with nutch and would like to setup a distributed crawl environment. But I wonder how can I trigger nutch for incoming crawl requests in a production system. I read about nutch REST api. Is that the real option that I have ? Or can I run nutch as a continuously running distributed server by any other option ?
My preferred nutch version is nutch 1.12.
As sujen stated, there are two options for this :-
Use REST api if you want to submit crawl requests to nutch remotely.
Steps to get this running is described here :-
How to run nutch server on distributed environment
Otherwise you can run bin/crawl script from runtime/deploy to launch requests to nutch distributed using hadoop.
Related
I am getting started with Apache Livy and I was able to follow online documentation and was able to submit the spark job through Curl(I have posted another question on converting curl to REST call). My plan was to tryout with curl and then convert that curl to a REST API call from Scala. However after spending entire day to figure out how to convert Livy curl call to REST, I feel like my understanding is wrong.
I am checking this example from Cloudera and I see we have to create a LivyClient instance and upload the application code to the Spark context from it. Is the correct way? I have a use case where I need to trigger my spark job from Cloud, do I need to put dependencies on Cloud and add it with uploadJar like it is mentioned in Cloudera example?
There are 2 options to interact with Livy server
Using the Livy Client which makes it easier to interact with Livy Server.
There are Rest API exposed which can be used programmatically.
Please check the below links
https://livy.incubator.apache.org/docs/latest/rest-api.html
I'm new to ETL development with PySpark and I've been writing my scripts as paragraphs on Apache Zeppelin Notebooks. I was curious what the typical flow was for a deployment process? How are you converting your code from a Zeppelin Notebook to your ETL pipeline?
Thanks!
Well that heavily depends on the sort of ETL that you're doing.
If you want to keep the scripts in the notebooks and you just need to orchestrate their execution then you have a couple options:
Use Zeppelin's built-in scheduler
Use cron to launch your notebooks via curl commands and Zeppelin's REST API
But if you already have an up-and-running workflow management tool like Apache Airflow then you can add new tasks that launch the aforementioned curl commands to trigger the notebooks (with Airflow, you can use BashOperator or PythonOperator), but keep in mind that you'll need some workarounds to have a sequential execution of different notes.
One major tech company that's betting heavily on notebooks is Netflix (you can take a look at this), and they developed a set of tools to improve the effeciency of notebook-based ETL pipelines, like Commuter and Papermill. They're more into Jupyter, so Zeppelin compatibility is still not provided, but the core concepts should be the same when working with Zeppelin.
For more on Netflix' notebook-based pipelines, you can refer to this article shared on their tech blog.
I am using Nutch 1.10 to crawl websites for my organization. I use a system with 16Gb RAM to do this crawl. As of now, my nutch file uses only 3-4Gb of RAM while crawling the data and it takes almmost 10 hours to finish it. Is there some way where i can configure the nutch to use more than 12Gb of RAM to finish the same task ? All Suggestions are most welcome !
Under the assumption that the script bin/nutch or bin/crawl is used for crawling in local mode (no Hadoop cluster): the environment variable NUTCH_HEAPSIZE defines the heap size in MB.
I am using Apache Spark in Bluemix.
I want to implement scheduler for sparksql jobs. I saw this link to a blog that describes scheduling. But its not clear how do I update the manifest. Maybe there is some other way to schedule my jobs.
The manifest file is to guide the deployment of cloud foundry (cf) apps. So in your case, sounds like you want to deploy your cf app that acts as a SparkSQL scheduler and use the manifest file to declare that your app doesn't need any of the web app routing stuff, or anything else for user-facing apps, because you just want to run a background scheduler. This is all well and good, and the cf docs will help you make that happen.
However, you cannot run a SparkSQL scheduler for the Bluemix Spark Service today because it only supports Jupyter notebooks through the Data-Analytics section of Bluemix; i.e., only a notebook UI. You need a Spark API you could drive from your scheduler cf app; e.g. spark-submit type thing where you can create your Spark context and then run programs, like SparkSQL you mention. This API is supposed to be coming to the Apache Spark Bluemix service.
UPDATE: spark-submit was made available sometime around the end of 1Q16. It is a shell script, but inside it makes REST calls via curl. REST API doesn't seem to yet be supported, but either you could call the script in your scheduler, or take the risk of calling the REST API directly and hope it doesn't changes and break you.
OS: Cent OS 6.4
ISSUE:
Installed gmond, gmetad and gweb on a server. Installed spark worker in the same server.
configured metrics.properties in $SPARK_HOME/conf/metrics.properties as below...
CONFIGURATION (metrics.properties in spark):
org.apache.spark.metrics.sink.GangliaSink
host localhost
port 8649
period 10
unit seconds
ttl 1
mode multicast
We are not able to see any metrics in ganglia web.
Please do the needful.
-pradeep samudrala
In the first place, those are just indications of the default settings of Ganglia. You should not uncomment that. Taken from the metrics section from the Spark web page (spark page):
To install the GangliaSink you’ll need to perform a custom build of Spark. Note that by embedding this library you will include LGPL-licensed code in your Spark package. For sbt users, set the SPARK_GANGLIA_LGPL environment variable before building. For Maven users, enable the -Pspark-ganglia-lgpl profile. In addition to modifying the cluster’s Spark build user applications will need to link to the spark-ganglia-lgpl artifact.