How to run multiple instances of nutch? - nutch

I want to crawl multiple websites by simultaneously running multiple instances of apache nutch-1.6. Should I install multiple copy of apache nutch in different locations and create a single(or master) .sh file for executing nutch crawl command for every copy? OR is it possible to configure a single copy of nutch for multiple instances?

I used the 'bin/crawl' script. Ran it in 2 different terminals simultaneously. Both finished their execution without any bug (atleast as per my judgment). I had supplied different seed directory and crawl directory to each simultaneous instance.
However as per an another thread here it states that you must run bin/nutch command by supplying different 'configuration' file every time you want to run a different simultaneous instance and supply a different /tmp/ path for each instance. I myself didnt had to go through that trouble. The above method worked pretty well for me

Related

Jupiter Lab on cluster with shared storage

I'm on a slurm cluster with a bunch of nodes. I want to run two seperate notebooks, on two seperate nodes. Unfortunatley, when I run two jupyter lab instances, they wind up clobbering .ipynb_checkpoint and other hidden files that are essential for jupyter, even if they use differrent ports. I think this is because they are sharing the same home directory. I'm wondering if there is some addon which will allow me to select which node to use when initializing a kernel, but I can't find it.
The soloution was to start jupyter lab instances on different nodes from different directories (starting them in the same directory causes them to clobber each other). I don't know why this didn't work before. Watch out for editing the same file in two different instances though.

Tool to manage multiple scripts run simultaneously

We have multiple scripts to run simultaneously on server, in order to perform some tasks. These scripts are set to run on a frequency via crons.
Using crons for this purpose as so many drawbacks. Like using crons we are unable to track if previous command completed then run it again otherwise wait for its completion. Also we are unable to get the errors occurred during script running and its output. This approach increases CPU load also.
So I need a tool where we can set these scripts, which executes only if previous one is completed and track output of those scripts as well. Firstly I tried to use ActiveMQ for this but I think this tool not suitable for this purpose.
Can someone please suggest me a tool for this requirement?

How do I run a shell script on multiple CGE machines concurrently?

I have multiple Linux GCE machines and there's a bash script I want to run on some of them from time to time. Right now I'm connecting manually to each machine separately in order to run the script.
How do I run the script on several machines concurrently without connecting to each machine individually each time? How do I configure on which machines it should run?
you can use csshx to ssh into multiple servers at once. Once logged in into servers, you can execute script as per your need. you can follow this link to install it on mac.
Other alternative could be that you schedule a cronjob at all servers so they will run at specific time. you can edit these multiple cronjobs using csshx.
Try fabric. You can write simple scripts in Python that you can run on multiple hosts (which you will need SSH access to).

Can I run multiple node processes from 1 copy of my source code

I want to know if it's possible to run multiple node processes from the same directory, either with the same file or using a different file. The process I am running will basically execute batch jobs, and will not be running a server, but I would like to know for the server case as well.
Let's say I have the following structure
src
server.js
test
file1.js
file2.js
I have a two-part question.
Will I be able to open up 2 terminals in the src directory and execute PORT=3000 node server.js from the first terminal, and then, while being in the same directory, in the second terminal run PORT=3001 node server.js
Secondly, I want to set up a cron job, to run batch jobs, so I would like to call node src/file1.js and node src/file2.js from my cron job. Since these files will reside in the same directory, will I need to have separate copies of the source code in order to run 2 separate jobs, or can I do it from the same directory and have only one copy of the source?
In general, for each separate node process we run from a directory either on the same file or a different file, do I need to run it out of a different copy of my source code, or can I use one copy of my code and run multiple node processes from different terminals or cron jobs?
Will i be able to upon up 2 terminals in the src directory and
basically execute PORT=3000 node server.js from the first terminal,
and then while being in the same directory in the second terminal run
PORT=3001 node server.js
Yes, that will work fine. Source code for a node.js program is loaded from disk, parsed and turned into some sort of internal structure in memory. The source files themselves are then not used again after being loaded while that program is running. You can load that source code as many times as you want for different instances of the app. You will have to make sure that each instance does not contend for the same resources (ports, files, etc...), but it looks like you're already aware of that and that would be no different if it was different programs running on the same computer.
Secondly, I want to set up a chron job, to run batch jobs, so I would like to call node src/file1.js and node src/file2.js from my
chron job. Since these files will reside in the same directory, will I
need to have separate copies of the source code in order to run 2
separate jobs, or can I do it from the same directory and have only
one copy of the source ?
One copy of the source will be just fine.
In general, for each separate node process we run from a directory
either on the same file or a different file, do i need to run it out
of a different copy of my source code, or can I use one copy of my
code and run multiple node processes from different terminals or chron
jobs ?
One copy of the source is fine. Just to give you an idea, the clustering module for node.js does exactly this. It runs multiple processes using the same source code.
Not a node.js guru but your scripts should run separately even if they are in the same directory, as long as they are not sharing resources. Concurrent processes for example writing to a file (reading isn't a problem) may cause failures. But system calls should work just fine.

how to run multiple elastic search (2.2) nodes as processes on the same server

i was wondering if you help me out here;
am trying to run multiple elasticsearch processes on the same (CentOS) server, but i have been un-successful so far.
and i have not enabled the service wrapper. and Elasticsearch has been installed using the .rpm package
the requirements are:
every instance belongs to a different cluster (cluster.name)
every instance uses a different port, 9201, 9202, 9203, etc.
every instance should be parameterised with different ES_HEAP_SIZE
the elasticsearch.yml file is attached where all parameters are described.
and the questions are:
how to set a different configuration file per instance when Des.config seems to be deprecated in 2.2
how to set a custom ES_HEAP_SIZE (-Xmx=24G -Xms=24G) when
# bin/elasticsearch -Des.config=config/IP-spotlight.RRv4/elasticsearch.yml [2016-02-14 19:44:02,858][INFO ][bootstrap ] es.config is no longer supported. elasticsearch.yml must be placed in the config directory and cannot be renamed.
please help ..
You have two solutions:
download elasticsearch archive from the site and run it from different paths with different configs. You can monitor each running instance with a method like supervisor. The main page for Elasticsearch downloads is here
run each instance inside a docker container. This is the right way to do, because it is easier to deploy and manage. You can find a Elasticsearch docker image here

Resources