How do we know when Heritrix completes a crawl job?

How do we know when Heritrix completes a crawl job? - heritrix

In our application, Heritrix is being used as the crawl engine and once the crawl job is finished, we are manually kicking off an endpoint to download the PDFs from a website. We would like to automate this downloading pdf task as soon as the crawl job is complete. Does HEritrix provide any URI/webservice method - which returns the status of the job? (or) Do we need to create a polling app to continuously monitor the status of the job?

I don't know if there is any option to do it without continious monitoring but you can use Heritrix API to get status for a job, smth like
curl -v -d "action=" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob
gives you XML from where you can read job status.
Another, maybe easier (yet not so 'professional') option is to check if your jobs warcs directory contains a file with .open extension. If not - the job is finished.

Related

nutch jobs failing after second round that is in fetch stage?

Nutch jobs failing after second round that is in fetch stage i am using emr cluster it is not throwing any error.May i know the reason.May I know what can be reasons it is stopping second round.

The reason is because i did not run nohup command I was previously running with
sh filename.sh it stopped after some crawls now i am running by using the nohup sh filename.sh &.
Thanks # Sebastian Nagel

Running a node script 'forever' in heroku (or aws) with no front end?

I have a requirement of running a node script (not app or dapp), which has no front -end files (html, css). This script will send transactions (call smart contract function) in regular intervals. the only constraint is that this script needs to run perpetually (forever) without stopping, unless a specific command is given by admin. Please do suggest how we could achieve this? Thanks.
PS: in case you have a better platform suggestion other than heroku, those are welcome as well with details. Tx.

Unix cron works fine for this kind of things. Just add a cron in your crontab with command :
crontab -e
Then just set your pattern and add a line with your command to launch, for example this will run each day at 3.00 AM:
0 3 * * * /root/backup.sh
Then don't forget to reload your cron process:
sudo service cron restart
Define your pattern here : https://crontab.guru/

You can have a look at kue.
The good thing about kue is, that it gives you a UI where an admin can view the running, failed, jobs etc. Also, you can configure it to stop a job programmatically.

How to show all jobs in sge(sun grid engine)

I used apt to install sge in my ubuntu 16.04 server.
Howerver, when I use qsub to submit a job, qstat -j "*" can only show error jobs and running jobs information.
Does sge has some command or options to show all jobs info（including success finished jobs）

For terminated jobs, qacct should be used.
E.g. to show information about all jobs in the account file, use
qacct -j "*"
The qacct man page gives further information:
http://gridscheduler.sourceforge.net/htmlman/htmlman1/qacct.html

How to wait until services are ready

I have been setting up a Jenkins pipeline using docker images. Now I need to run various services like MySQL, Redis, Memcache, Beanstalkd and Elasticsearch. To wait the job until MySQL is ready, I am using the following command :
sh "while ! mysqladmin ping -u root -h mysqlhost ; do sleep 1; done"
sh 'echo MySQL server is up and running'
Where mysqlhost is the hostname I have provided for the container. Similarly, I need to check and wait for Redis, Memcached, Beanstalkd and Elasticsearch. But pinging to these services are not working as it is done for MySQL . How can I implement this ?

The Docker docs mention this script to manage container readiness checks: https://github.com/vishnubob/wait-for-it
I also use this one which is compatible with Alpine:
https://github.com/eficode/wait-for

You can do a curl to this services in order to check if they are alive or not.
For redis you can also do https://redis.io/commands/ping

How to backup & restore spinnaker pipelines

I am new to & trying to use spinnaker for the client I am working with. I am somewhat familiar with spinnaker architecture.
I know FRONT50 micro-service is responsible for this task. I am not sure how I can safely backup the pipeline data and restore into a new instance.
I want that to be able to continuously back up these pipelines as they are being added so that when I happen to recreate the spinnaker instance(i.e destroy my the infra and then recreate from scratch) I am able to restore these.
I am currently using Azure as the cloud provider and using Azure Container service.
I found this page here : https://www.spinnaker.io/setup/install/backups/
but does not indicate if the pipelines will also be backed up.
Many thanks in advance

I am not sure about the standard method but you can copy the configurations for pipelines and applicatons from front50 manually.
For pipelines, just do a curl to http://<front50-IP>:8080/pipelines
curl http://<front50-IP>:8080/pipelines -o pipelines.json
For applications config:
curl http://<front50-IP>:8080/v2/applications -o applications.json
To push pipeline config to Spinnaker, you can do:
cat pipelines.json | curl -d#- -X POST \
--header "Content-Type: application/json" --header \
"Accept: /" http://<Front50_URL>:8080/pipelines
P.S: My Spinnaker version is 1.8.1 and both, v1 and v2, k8s providers are supported.
Update-2: If you are using AWS S3 or GCS, you can back up the buckets directly.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do we know when Heritrix completes a crawl job? - heritrix

Related

nutch jobs failing after second round that is in fetch stage?

Running a node script 'forever' in heroku (or aws) with no front end?

How to show all jobs in sge(sun grid engine)

How to wait until services are ready

How to backup & restore spinnaker pipelines

Categories

Resources