How to get job status of crawl tasks in nutch - nutch

In a crawl cycle, we have many tasks/phases like inject,generate,fetch,parse,updatedb,invertlinks,dedup and an index job.
Now I would like to know is there any methodologies to get status of a crawl task(whether it is running or failed) by any means other than referring to hadoop.log file ?
To be more precise I would like to know whether I can track status of a generate/fetch/parse phase ? Any help would be appreciated.

You should always run Nutch with Hadoop in pseudo or fully distributed mode, this way you'll be able to use the Hadoop UI to track the progress of your crawls, see the logs for each step, access the counters (extremely useful!).

Related

Rerunning timed-out SLURM array jobs efficiently

I am running a large number of slurm array jobs. Some fraction of the jobs end up timing out. Is there an efficient way to identify those jobs and rerun them with an increased wall time? Currently, I am using sacct -j jobID to list all the jobs, manually identifying the failed jobs, and then rerunning them after updating the wall time. But this procedure is rather cumbersome. Any suggestions to improve this method would be appreciated.
The atools suite of utilities (Github) aims at solving that problem. It offers a set of commands you can use to easily track and re-submit jobs in a job array. Designed originally for PBS, but fully functioning with Slurm. See a video presentation here (slides here).

Getting the load for each job

Where can I find the load (used/claimed CPUs) per job? I know to get it per host using sinfo, but that does not directly give information on which job causes a possible 'incorrect' load of anything unequal to 1.
(I want to get this for all jobs, i.e. logging in to the node and running top is not my objective.)
You can use
sacct --format='jobid,ReqCPUS,elapsed,AveCPU'
and compare Elapsed with AveCPU. The latter will only be available for job steps, not for the whole job.

How to make Spark to fail fast with clarity

I'm learning Spark, and quite often I have some issue that causes tasks and stages to fail. With my default configuration, there are rounds of retries and a bunch of ERROR messages to that effect.
While I totally appreciate the idea of retrying tasks when I finally get to production, I'd love to know how to make my application fail at the first sign of trouble so that I can avoid all the extra noise in the logs and within the application history itself. For example, if I run it out of memory, I'd love to just see the OOM exception near the end of my log and have the whole app fail.
What's the best way to setup configs for this kind of workflow?
You can set spark.task.maxFailures to 1.
spark.task.maxFailures is the number of individual task failures before giving up on the job, and its default value is 4.

duplicate jobs in sun grid engine

When I run qacct with the job ID, after it is finished, I get two results,
the one I run and an older job with the same jobid.
how can I delete the history of qacct?
Any one know how to solve this?
Thanks
Tsvi
Grid Endine (or SGE) has job IDs in the range 0..99999. This may roll over quickly in some clusters and people may be interested in finding statistics of older jobs with the same ID. You can identify your jobs knowing also the approximate submit time.
Anyway if you want to eliminate the duplicate job IDs from qacct you can rotate the accounting file (//common/accounting) using utilities like logchecker.sh.
Check the man page or this grid engine online documentation:
http://gridscheduler.sourceforge.net/howto/rotatelogs.html

How to know OPC job status using Syncsort or anyother method?

My objective is, I need to get the current timestamp using Syncsort if one OPC job(existing Job) run fine in production. In my case I can not interpret my new job after existing OPC job. Is there any facility to check the existing job ran fine in production ?
I mean any reference table to have production job details with status for each day ?
Please help anyone to move.
There are commercial packages that track jobs and job status. CA (computer associates) is one such vendor.
However, these packages cost a lot. A simple, home grown solution, is to have a dataset known to both jobs and write a one line record into that data set when job1 completes and the second job2 can read the dataset to "KNOW" if the job ran. IF this is what you are trying to do, it is not exactly clear from your question. But any solution along these lines works, until management wants to cough up $50K (or whatever) for a commercial package.

Resources