Can't run parallel jobs in Heritrix3 Web Crawler

Can't run parallel jobs in Heritrix3 Web Crawler - linux

I created 2 jobs in Heritrix 3.2.0 and I launched both after building, both started running but after 15 to 20 seconds, one job is stopped and other continues and when a job is stopped, the status in jobs log is as follows:
2015-05-12T06:40:33.715Z INFO EMPTY 20150512063923
So could not multi-process the jobs. How to fix it?

No it just means that this job is done (queue is empty). If no pages were downloaded, it probably means that your decide rules are too strict and don't allow anything to be downloaded.

Related

Oozie: kill a job after a timeout

Sorry but can't find he configuration point a need. I schedule spark application, sometimes they may not succeed after 1 hour, in this case I want to automatically kill this task (because I am sure it will never succeed, and another scheduling may start).
I found a timeout configuration, but as I understand it, this is used to delay the start of a workflow.
So is there a kind of living' timeout ?

Oozie cannot kill a workflow that it triggered. However you can ensure that a single workflow is running at same time by setting Concurrency = 1 in the Coordinator.
Also you can have a second Oozie workflow monitoring the status of the Spark job.
Anyawy, you should investigate the root cause of Spark job not successful or being blocked.

Spark Streaming active jobs stuck/piled up in Web UI

I'm experiencing a strange behavior while streaming from Kafka using spark 2.1.0/2.0.2 on AWS EMR.
"spark.streaming.concurrentJobs" was set to 1 explicitly to the streaming job but after running for a while, the job tab showed more than 1 active jobs running and such "active" jobs keep increasing.
Inside such jobs, some stages remaining not executed for ever (status is --). However all the tasks are shown as SUCCEED under those jobs.
What could be wrong here? A more strange thing is that, such behavior seems not occurring unless I open the Spark UI page to check the current status frequently.
Jobs tab - http://ibb.co/j6XkXk
Stages - http://ibb.co/budg55
It was only Job 12109 at the beginning. Things got piled up when I switched tabs a couple of times.
Regards,
Alex

how to kill spark job of a spark application?

can we kill one of the jobs(time consuming) of a running spark application and move on to next job ?
Let us say , there are 50 jobs in a Spark Application and one of them is taking more time (may be it requires more memory than what we have configured) , So can we kill that job and move on to next job ?
and then we can run that job(that action which triggers that job) later with higher memory configuration
If this is not possible then how to handle these conditions ?

You can kill running job by:
opening Spark application UI.
going to jobs tab.
find job among running jobs.
click on kill link and confirm.

Google Dataproc Jobs Never Cancel, Stop, or Terminate

I have been using Google Dataproc for a few weeks now and since I started I had a problem with canceling and stopping jobs.
It seems like there must be some server other than those created on cluster setup, that keeps track of and supervises jobs.
I have never had a process that does its job without error actually stop when I hit stop in the dev console. The spinner just keeps spinning and spinning.
Cluster restart or stop does nothing, even if stopped for hours.
Only when the cluster is entirely deleted will the jobs disappear... (But wait there's more!) If you create a new cluster with the same settings, before the previous cluster's jobs have been deleted, the old jobs will start on the new cluster!!!
I have seen jobs that terminate on their own due to OOM errors restart themselves after cluster restart! (with no coding for this sort of fault tolerance on my side)
How can I forcefully stop Dataproc jobs? (gcloud beta dataproc jobs kill does not work)
Does anyone know what is going on with these seemingly related issues?
Is there a special way to shutdown a Spark job to avoid these issues?

Jobs keep running
In some cases, errors have not been successfully reported to the Cloud Dataproc service. Thus, if a job fails, it appears to run forever even though it (has probably) failed on the back end. This should be fixed by a soon-to-be released version of Dataproc in the next 1-2 weeks.
Job starts after restart
This would be unintended and undesirable. We have tried to replicate this issue and cannot. If anyone can replicate this reliably, we'd like to know so we can fix it! This may (is provably) be related to the issue above where the job has failed but appears to be running, even after a cluster restarts.
Best way to shutdown
Ideally, the best way to shutdown a Cloud Dataproc cluster is to terminate the cluster and start a new one. If that will be problematic, you can try a bulk restart of the Compute Engine VMs; it will be much easier to create a new cluster, however.

Determine which Celery workers are consuming jobs and telling them to stop

Scenario: How to I gracefully tell a worker to stop accepting new jobs and identify when they are finished processing the current jobs as to shut them down as workers are coming online.
Details (Feel free to correct any of my assumptions):
Here is snippet of my current queue.
As you can see I have 2 exchange queues for the workers (I believe this is the *.pidbox), 2 queues representing celeryev on each host (yes I know I only need one), and one default celery queue. Clearly I have 90+ jobs in this queue.
(Side Question) Where do you go to find the worker consuming the job from the Management console? I know I can look at djcelery and figure that out.
So.. I know there are jobs running on each host - I can't shut celery off those machines as it will kill the jobs running (and any pending?).
How do I stop any further processing of new jobs while allowing those jobs still running to complete? I know that on each host I can stop celery but that will kill any current jobs running as well. I want to say to the 22 jobs in the hopper to halt.
Thanks!!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Can't run parallel jobs in Heritrix3 Web Crawler - linux

No it just means that this job is done (queue is empty). If no pages were downloaded, it probably means that your decide rules are too strict and don't allow anything to be downloaded.

Related

Oozie: kill a job after a timeout

Spark Streaming active jobs stuck/piled up in Web UI

how to kill spark job of a spark application?

Google Dataproc Jobs Never Cancel, Stop, or Terminate

Determine which Celery workers are consuming jobs and telling them to stop

Categories

Resources