How to use start-all.sh to start standalone Worker that uses different SPARK_HOME (than Master)? - apache-spark

I have installed spark 2.1.1 on 2 machines but in different relative locations ie in one machine I have installed somewhere on an NTFS drive and on the other one I have installed it on an ext4 drive. I am trying to start a cluster in standalone mode with 2 slaves and a master by having 1 Master and 1 slave on 1 machine and 1 slave on other machine.
When I try to start this cluster via start-all.sh script on master node, I get the following error :-
192.168.1.154: bash: line 0: cd: /home/<somePath>/spark-2.1.1-bin-hadoop2.7: No such file or directory
I have set proper SPARK_HOME in respective bashrc files. Below is my slave file (in the 1 master + 1 slave machine)
localhost
192.168.1.154
I can remotely login to the 1 slave machine via ssh. I am able to run Spark cluster individually in each machine.
It is my understanding when I try to remotely start a slave from my master machine via start-all.sh script it is trying to goto the location where spark is installed on master node, but as on slave node the spark is installed on a different location, it fails. Can anyone please tell me how can I rectify this problem?

In start-all.sh you can find the following:
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
# Load the Spark configuration
. "${SPARK_HOME}/sbin/spark-config.sh"
# Start Master
"${SPARK_HOME}/sbin"/start-master.sh
# Start Workers
"${SPARK_HOME}/sbin"/start-slaves.sh
which has nothing to do with the Spark installation on the standalone master. start-all.sh simply uses whatever SPARK_HOME you've defined globally and uses it across all nodes in the cluster, for standalone master and workers.
In your case, I'd recommend writing a custom startup script that would start the standalone Master and workers per respective SPARK_HOME env vars.
start-slaves.sh (source here) does simply the following:
cd "${SPARK_HOME}" \; "${SPARK_HOME}/sbin/start-slave.sh" "spark://$SPARK_MASTER_HOST:$SPARK_MASTER_PORT"
And so there is not much magic going on, but to ssh to every node and execute the command line.
I think I'd even use Ansible for this.

You should check your ~/.bashr. You can see my bashrc below:
export JAVA_HOME=/usr/local/java/jdk1.8.0_121
export JRE_HOME=$JAVA_HOME/jre
export SCALA_HOME=/usr/local/src/scala/scala-2.12.1
export SPARK_HOME=/usr/local/spark/2.1.0
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin
Finally, you have to update your bashrc environment:
source ~/.bashrc

In my case I had 2 Macs and 1 PC/Linux machine as workers. 1 of the Macs acted as a master as well.
On the Macs, I had installed spark under /Users/<user>/spark and set my $SPARK_HOME to this path.
On the Linux machine (ubuntu), I had setup the spark directory under /home/<user>/spark. When running start-all.sh on my spark master machine (1 of the Macs) would cause an error on the Linux worker:
192.168.1.33: bash: line 1: cd: /Users/<user>/spark: No such file or directory 192.168.1.33: bash: line 1: /Users/<user>/spark/sbin/start-worker.sh: No such file or directory
In order to fix the pathing problem, I mimicked the Mac by creating a symbolic link on the Linux machine pointing a "/Users" directory to the "/home" directory. This tricked Spark into working on that Linux machine/worker just as it works on the Macs
cd /
sudo ln -s home Users
This is probably not the most elegant solution but it meant I did not need to maintain my own version of start-all.sh and its associated subscripts.

Related

Unable to write over an SSHFS mounted folder with SLURM jobs

I have the following problematic and I am not sure what is happening. I'll explain briefly.
I work on a cluster with several nodes which are managed via slurm. All these nodes share the same disk memory (I think it uses NFS4). My problem is that since this disk memory is shared by a lots of users, we have a limit a mount of disk memory per user.
I use slurm to launch python scripts that runs some code and saves the output to a csv file and a folder.
Since I need more memory than assigned, what I do is I mount a remote folder via sshfs from a machine where I have plenty of disk. Then, I configure the python script to write to that folder via an environment variable, named EXPERIMENT_PATH. The script example is the following:
Python script:
import os
root_experiment_dir = os.getenv('EXPERIMENT_PATH')
if root_experiment_dir is None:
root_experiment_dir = os.path.expanduser("./")
print(root_experiment_dir)
experiment_dir = os.path.join( root_experiment_dir, 'exp_dir')
## create experiment directory
try:
os.makedirs(experiment_dir)
except:
pass
file_results_dir = os.path.join( root_experiment_dir, 'exp_dir' , 'results.csv' )
if os.path.isfile(file_results_dir):
f_results = open(file_results_dir, 'a')
else:
f_results = open(file_results_dir, 'w')
If I directly launch this python script, I can see the created folder and file in my remote machine whose folder has been mounted via sshfs. However, If I use sbatch to launch this script via the following bash file:
export EXPERIMENT_PATH="/tmp/remote_mount_point/"
sbatch -A server -p queue2 --ntasks=1 --cpus-per-task=1 --time=5-0:0:0 --job-name="HOLA" --output='./prueba.txt' ./run_argv.sh "python foo.py"
where run_argv.sh is a simple bash taking info from argv and launching, i.e. that file codes up:
#!/bin/bash
$*
then I observed that in my remote machine nothing has been written. I can check the mounted folder in /tmp/remote_mount_point/ and nothing appears as well. Only when I unmount this remote folder using: fusermount -u /tmp/remote_mount_point/ I can see that in the running machine a folder has been created with name /tmp/remote_mount_point/ and the file is created inside, but obviously nothing appears in remote machine.
In other words, it seems like by launching through slurm, it bypasses the sshfs mounted folder and creates a new one in the host machine which is only visible once the remote folder is unmounted.
Anyone knows why this happens and how to fix it? I emphasize that this only happens if I launch everything through slurm manager. If not, then everything works.
I shall emphasize that all the nodes in the cluster share the same disk space so I guess that the mounted folder is visible from all machines.
Thanks in advance.
I shall emphasize that all the nodes in the cluster share the same disk space so I guess that the mounted folder is visible from all machines.
This is not how it works, unfortunately. Trying to put it simply; you could say that mount point inside mount points (here SSHFS inside NFS) are "stored" in memory and not in the "parent" filesystem (here NFS) so the compute nodes have no idea there is an SSHFS mount on the login node.
For your setup to work, you should create the SSHFS mount point inside your submission script (which can create a whole lot of new problems, for instance regarding authentication, etc.)
But before you dive into that, you probably should enquiry whether the cluster has another filesystem ("scratch", "work", etc.) where there you could temporarily store larger data than what the quota allows in your home filesystem.

How to transfer data from local file system (linux) to a Hadoop Cluster made on Google Cloud Platform

I am a beginner in Hadoop, I made a Hadoop Cluster (one master and two slaves) on Google Cloud Platform.
I accessed the master of the cluster using from the local file system (Linux): ssh -i key key#public_ip_of_master
Then I did sudo su - inside the cluster because Hadoop functions only appears while being root.
Then I initiated the HDFS using start-dfs.sh and start-all.sh
Now the problem is that I want to tranfer files from the local Linux file system to the Hadoop Cluster and vice versa using the following command (inserting the command inside the cluster while being root):
root#master:~# hdfs dfs -put /home/abas1/Desktop/chromFa.tar.gz /Hadoop_File
The problem is that the local path which is: /home/abas1/Desktop/chromFa.tar.gz is never recognized and I can not seem to know what to do.
I am sure I am missing something trivial but I do not know what it is. I have to use either -copyFromLocal or -put.
local path is never recognized
That is not a Hadoop problem, then. You are on the master node (over SSH), as the root user. There is a /root folder with files, and probably no /home/abas1.
In other words, run ls -l /home, and you see what local files are available.
To get files to the master server to upload from that terminal session, you will want to SCP files first to there from a different machine.
Exit the SSH session
scp -i key root#master-ip home/abas1/Desktop/chromFa.tar.gz /tmp
ssh -i key root#master-ip
Then you can do this
hdfs mkdir /Hadoop_File
ls -l /tmp | grep chromFa # for example, to check file
hdfs -put /tmp/chromFa.tar.gz /Hadoop_file/
Hadoop functions only appears while being root.
Please do not use root for interacting with Hadoop services. Create unique user accounts for HDFS, YARN, Zookeeper, etc. with restricted permissions like you would for any other Unix process.
Using DataProc will do this... And you can still SSH to it, so you should really considering using it instead of manual GCE cluster.

Java HotSpot(TM) Server VM warning in Cassandra

I am getting following error while running the cassandra.
$ sudo service cassandra start
$ cassandra
Java HotSpot(TM) Server VM warning: Cannot open file /var/log/cassandra/gc.log due to Permission denied.
I guess you have installed the Cassandra using repositories. Cassandra needs a directory to store data and in your case, it cannot create that directories because of permission problems. You have three-way:
Become the root user using the command sudo su and run the command cassandra as the root user. You can issue the command sudo systemctl enable cassandra.service to run Cassandra automatically at startup.
change the following setting in cassandra.yaml file to where the user has permission, like your home directory.
data_file_directories
commitlog_directory
saved_caches_directory
add the line export CASSANDRA_HOME=path/to/cassandra in user .bashrc file and after that run source .bashrc to compile it. This makes Cassandra know the Cassandra install directory and creates the nesseccery folder within that.

How can we set nodetool and cqlsh to be run from anywhere and by any user on linux server

I am trying to setup environment variables so that any user on a particular server can run commands like nodetool or cqlsh from any where in linux file system . The effort to traverse to bin directory everytime should be saved .
How can we achieve this ? My DSE 4.8 is a tarball install .
Nodetool is usually available to any user that has execution privileges in your linux boxes
For cqlsh, you can set any configuration inside the cqlshrc file (usually found in $HOME/.cassandra/cqlshrc; we have used to enable client-node encryption but has more configurable options
To setup environment variable just follow some steps from root user:
# vi /etc/profile.d/cassandra.sh
Add the following lines to the cassandra.sh file-
export CASSANDRA_HOME=/opt/apache-cassandra-3.0.8
export CASSANDRA_CONF_DIR=/opt/apache-cassandra-3.0.8/conf
Here /opt/ is my directory, where I've extracted my apache-cassandra-3.0.8-bin.tar.gz tarball.
After adding those lines to cassandra.sh, save and exit. Then-
# source /etc/profile.d/cassandra.sh

Starting Hadoop without ssh'ing to localhost

I've a very tricky situation in my hand. I'm installing Hadoop on few nodes which run Ubuntu 12.04 and our IT guys have created a user "hadoop" for me to use on all the nodes. The issue with this user is that it does not allow ssh on localhost because of some security constraints. So, I'm not able to start Hadoop daemons at all.
I can connect to itself using "ssh hadoop#hadoops_address" but not using loopback address. I also cannot make any changes to the /etc/hosts. Is there a way I can tell Hadoop to ssh to itself using "ssh hadoop#hadoops_address" instead of "ssh hadoop#localhost"?
Hadoop reads the hostname from "masters" and "slaves" file which is present inside conf dir,
edit the file and change the value from localhost to hadoops_address.
This should fix your problem.

Resources