Running multiple Apache Nutch fetch map tasks on a Hadoop Cluster - nutch

I am unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN.
I am using the bin/crawl script and did the following tweaks to trigger a fetch with multiple map tasks , however I am unable to do so.
Added maxNumSegments and numFetchers parameters to the generate phase.
$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
Removed the topN paramter and removed the noParsing parameter because I want the parsing to happen at the time of fetch.
$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
The generate phase is not generating more than one segment.
And as a result the fetch phase is not creating multiple map tasks, also I belive the script is written it does not allow the fecth to fecth multiple segemnts even if the generate were to generate multiple segments.
Can someone please let me know , how they go the script to run in a distributed Hadoop cluster ? Or if there is a different version of script that should be used?
Thanks.

Are you using Nutch 1.xx for this? In this case, the Generator class looks for a flag called "mapred.job.tracker" and tries to see if it is local. This property has been deprecated in Hadoop2 and the default value is set to local. You will have to overwrite the value of the property to something other than local and the Generator will generate multiple partitions for the segments.

I've recently faced this problem and thought it'd be a good idea to build upon Keith's answer to provide a more thorough explanation about how to solve this issue.
I've tested this with Nutch 1.10 and Hadoop 2.4.0.
As Keith said the if block on line 542 in Generator.java reads the mapred.job.tracker property and sets the value of variable numLists to 1 if the property is local. This variable seems to control the number of reduce tasks and has influence in the number of map tasks.
Overwriting the value of said property in mapred-site.xml fixes this:
<property>
    <name>mapred.job.tracker</name>
    <value>distributed</value>
</property>
(Or any other value you like except local).
The problem is this wasn't enough in my case to generate more than one fetch map task. I also had to update the value of the numSlaves parameter in the runtime/deploy/bin/crawl script. I didn't find any mentions of this parameter in the Nutch 1.x docs so I stumbled upon it after a bit of trial and error.
#############################################
# MODIFY THE PARAMETERS BELOW TO YOUR NEEDS #
#############################################
# set the number of slaves nodes
numSlaves=3
# and the total number of available tasks
# sets Hadoop parameter "mapred.reduce.tasks"
numTasks=`expr $numSlaves \* 2`
...

Related

Spark executor metrics don't reach prometheus sink

Circumstances:
I have read through these:
https://spark.apache.org/docs/3.1.2/monitoring.html
https://dzlab.github.io/bigdata/2020/07/03/spark3-monitoring-1/
versions: Spark3.1.2, K8s v19
I am submitting my application via
-c spark.ui.prometheus.enabled=true
-c spark.metrics.conf=/spark/conf/metric.properties
metric.properties:
*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
*.sink.prometheusServlet.path=/metrics/prometheus
Result:
Both of these endpoints have some metrics
<driver-ip>:4040/metrics/prometheus
<driver-ip>:4040/metrics/executors/prometheus
the first one - the driver one - has all the metrics
the second one - the executor one - has all the metrics except the ones under the executor namespace
described here: https://spark.apache.org/docs/3.1.2/monitoring.html#component-instance--executor
So everything is missing from bytesRead.count to threadpool.startedTasks
But these metric are indeed reported by the executors, because under /api/v1/applications/app-id/stages/stage-id I can see those too.
I am struggled with this for hours, moving the configs to --conf flag, splitting up the configs by instances, enabling everything...etc No result.
However if I change the sink from prometheus to ConsoleSink:
*.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink
*.sink.console.period=10
*.sink.console.unit=seconds
Then the metrics appear successfully.
So something is definitely wrong with the Spark-K8s-Prometheus integration.
Note:
One interesting stuff is if I split up the config by instances like
driver.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
executor.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
driver.sink.prometheusServlet.path=/metrics/prometheus1
executor.sink.prometheusServlet.path=/metrics/executor/prometheus1
(note the trailing '1' at the end)
Then the executor sink path is not taken into account , the driver metrics will be on
/metrics/prometheus1 but the exectutors will be still on /metrics/executor/prometheus.
The class config is indeed working because if I change it to a nonexisting one, then the executor will throw an error as expected.
I have been looking to understand why custom user metrics are not sent to the driver, while the regular spark metrics are.
It looks like the PrometheusSink use the class ExecutorSummary, which doesn't allow to add custom metrics.
For the moment, it seems the only working way is to use the JMXExporter (and use the Java agent to export to Prometheus), or just use the ConsoleSink with
*.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink

Submitting multiple runs to the same node on AzureML

I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.
The way I currently submit jobs is the following (resulting in one training run per GPU/node):
experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
script="train.py",
compute_target="gpu_cluster",
environment="env_name",
arguments=["--args args"])
run = experiment.submit(config)
ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:
Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407
Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.
I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?
Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.
For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run.
So there would be 1 execution per node.
single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score.
or with the new ML Endpoints (Preview).
What are endpoints (preview) - Azure Machine Learning | Microsoft Docs

Getting the load for each job

Where can I find the load (used/claimed CPUs) per job? I know to get it per host using sinfo, but that does not directly give information on which job causes a possible 'incorrect' load of anything unequal to 1.
(I want to get this for all jobs, i.e. logging in to the node and running top is not my objective.)
You can use
sacct --format='jobid,ReqCPUS,elapsed,AveCPU'
and compare Elapsed with AveCPU. The latter will only be available for job steps, not for the whole job.

How to get job status of crawl tasks in nutch

In a crawl cycle, we have many tasks/phases like inject,generate,fetch,parse,updatedb,invertlinks,dedup and an index job.
Now I would like to know is there any methodologies to get status of a crawl task(whether it is running or failed) by any means other than referring to hadoop.log file ?
To be more precise I would like to know whether I can track status of a generate/fetch/parse phase ? Any help would be appreciated.
You should always run Nutch with Hadoop in pseudo or fully distributed mode, this way you'll be able to use the Hadoop UI to track the progress of your crawls, see the logs for each step, access the counters (extremely useful!).

Requesting specific nodes with TORQUE qsub?

There's a cluster with TORQUE qsub installed. I want to send a job, but I want to make sure that it runs on one of a specific set of nodes.
Is it possible to request a list of possible nodes in qsub, so that the job is sent to one of the nodes in the requested set, never to a node outside the set?
Using just TORQUE, the way to do this is to add a feature (or property) to each of the nodes in the set and add the feature as part of the job request. For example:
#nodes file entry
node01 fast np=32
# line in job script to request 2 'fast' nodes with 16 execution slots on each
#PBS -l nodes=2:fast:ppn=16
Depending on which scheduler you're using there may be easier ways to accomplish this task.

Resources