How to transfer data from local file system (linux) to a Hadoop Cluster made on Google Cloud Platform - linux

I am a beginner in Hadoop, I made a Hadoop Cluster (one master and two slaves) on Google Cloud Platform.
I accessed the master of the cluster using from the local file system (Linux): ssh -i key key#public_ip_of_master
Then I did sudo su - inside the cluster because Hadoop functions only appears while being root.
Then I initiated the HDFS using start-dfs.sh and start-all.sh
Now the problem is that I want to tranfer files from the local Linux file system to the Hadoop Cluster and vice versa using the following command (inserting the command inside the cluster while being root):
root#master:~# hdfs dfs -put /home/abas1/Desktop/chromFa.tar.gz /Hadoop_File
The problem is that the local path which is: /home/abas1/Desktop/chromFa.tar.gz is never recognized and I can not seem to know what to do.
I am sure I am missing something trivial but I do not know what it is. I have to use either -copyFromLocal or -put.

local path is never recognized
That is not a Hadoop problem, then. You are on the master node (over SSH), as the root user. There is a /root folder with files, and probably no /home/abas1.
In other words, run ls -l /home, and you see what local files are available.
To get files to the master server to upload from that terminal session, you will want to SCP files first to there from a different machine.
Exit the SSH session
scp -i key root#master-ip home/abas1/Desktop/chromFa.tar.gz /tmp
ssh -i key root#master-ip
Then you can do this
hdfs mkdir /Hadoop_File
ls -l /tmp | grep chromFa # for example, to check file
hdfs -put /tmp/chromFa.tar.gz /Hadoop_file/
Hadoop functions only appears while being root.
Please do not use root for interacting with Hadoop services. Create unique user accounts for HDFS, YARN, Zookeeper, etc. with restricted permissions like you would for any other Unix process.
Using DataProc will do this... And you can still SSH to it, so you should really considering using it instead of manual GCE cluster.

Related

Unable to write over an SSHFS mounted folder with SLURM jobs

I have the following problematic and I am not sure what is happening. I'll explain briefly.
I work on a cluster with several nodes which are managed via slurm. All these nodes share the same disk memory (I think it uses NFS4). My problem is that since this disk memory is shared by a lots of users, we have a limit a mount of disk memory per user.
I use slurm to launch python scripts that runs some code and saves the output to a csv file and a folder.
Since I need more memory than assigned, what I do is I mount a remote folder via sshfs from a machine where I have plenty of disk. Then, I configure the python script to write to that folder via an environment variable, named EXPERIMENT_PATH. The script example is the following:
Python script:
import os
root_experiment_dir = os.getenv('EXPERIMENT_PATH')
if root_experiment_dir is None:
root_experiment_dir = os.path.expanduser("./")
print(root_experiment_dir)
experiment_dir = os.path.join( root_experiment_dir, 'exp_dir')
## create experiment directory
try:
os.makedirs(experiment_dir)
except:
pass
file_results_dir = os.path.join( root_experiment_dir, 'exp_dir' , 'results.csv' )
if os.path.isfile(file_results_dir):
f_results = open(file_results_dir, 'a')
else:
f_results = open(file_results_dir, 'w')
If I directly launch this python script, I can see the created folder and file in my remote machine whose folder has been mounted via sshfs. However, If I use sbatch to launch this script via the following bash file:
export EXPERIMENT_PATH="/tmp/remote_mount_point/"
sbatch -A server -p queue2 --ntasks=1 --cpus-per-task=1 --time=5-0:0:0 --job-name="HOLA" --output='./prueba.txt' ./run_argv.sh "python foo.py"
where run_argv.sh is a simple bash taking info from argv and launching, i.e. that file codes up:
#!/bin/bash
$*
then I observed that in my remote machine nothing has been written. I can check the mounted folder in /tmp/remote_mount_point/ and nothing appears as well. Only when I unmount this remote folder using: fusermount -u /tmp/remote_mount_point/ I can see that in the running machine a folder has been created with name /tmp/remote_mount_point/ and the file is created inside, but obviously nothing appears in remote machine.
In other words, it seems like by launching through slurm, it bypasses the sshfs mounted folder and creates a new one in the host machine which is only visible once the remote folder is unmounted.
Anyone knows why this happens and how to fix it? I emphasize that this only happens if I launch everything through slurm manager. If not, then everything works.
I shall emphasize that all the nodes in the cluster share the same disk space so I guess that the mounted folder is visible from all machines.
Thanks in advance.
I shall emphasize that all the nodes in the cluster share the same disk space so I guess that the mounted folder is visible from all machines.
This is not how it works, unfortunately. Trying to put it simply; you could say that mount point inside mount points (here SSHFS inside NFS) are "stored" in memory and not in the "parent" filesystem (here NFS) so the compute nodes have no idea there is an SSHFS mount on the login node.
For your setup to work, you should create the SSHFS mount point inside your submission script (which can create a whole lot of new problems, for instance regarding authentication, etc.)
But before you dive into that, you probably should enquiry whether the cluster has another filesystem ("scratch", "work", etc.) where there you could temporarily store larger data than what the quota allows in your home filesystem.

HDFS + create simbolic link between HDFS folder to local filesystem folder

I searched in google but not find it,
is it possible to create link between HDFS folder to local folder?
example
we want to create link between folder_1 in HDFS to /home/hdfs_mirror local folder
HDFS folder:
su hdfs
$ hdfs dfs -ls /hdfs_home/folder_1
Linux local folder:
ls /home/hdfs_mirror
I do not think it is possible.
This is because we are talking about two different File Systems (HDFS and Local FileSystem).
in case we want to keep syncing the Local Data Directory to HDFS directory then need to make use of some tools like Apache Flume.

Storefile from spark on windows to HDFS

I have installed Hadoop/YARN in a linux VM on my local windows machine. On the same windows machine (not in VM) I have installed Spark. When running spark on windows, I can read files stored in HDFS (in linux VM).
val lines = sc.textFile("hdfs://MyIP:9000/Data/sample.txt")
While saving a file using to HDFS saveAsTextFile("hdfs://MyIP:9000/Data/Output"), I am getting below error:
org.apache.hadoop.security.AccessControlException: Permission denied:
user=LocalWindowsUser, access=WRITE,
inode="/Data":hadoop:supergroup:drwxr-xr-x.
I guess, it's because Windows and Linux users are different and windows user doesn't have permission to write files in linux.
What is the correct way to store files from windows to HDFS (linux VM) using spark?
Your problem is that the username that you are using to access HDFS with write mode does not have permissions.
The directory /Data has the permissions rwxr-xr-x. This translates to mode 755. Your username is LocalWindowsUser which has read and execute permissions.
Possible solutions:
Soution 1:
Since this is a local system under your full control, change the permissions to allow everyone access. Execute this command while inside the VM as the user hadoop:
hdfs dfs -chmod -R 777 /Data
Solution 2:
Create an environment variable in Windows and set the username:
set HADOOP_USER_NAME=hadoop
The username really should be the user hdfs. Try that also if necessary.

How to use start-all.sh to start standalone Worker that uses different SPARK_HOME (than Master)?

I have installed spark 2.1.1 on 2 machines but in different relative locations ie in one machine I have installed somewhere on an NTFS drive and on the other one I have installed it on an ext4 drive. I am trying to start a cluster in standalone mode with 2 slaves and a master by having 1 Master and 1 slave on 1 machine and 1 slave on other machine.
When I try to start this cluster via start-all.sh script on master node, I get the following error :-
192.168.1.154: bash: line 0: cd: /home/<somePath>/spark-2.1.1-bin-hadoop2.7: No such file or directory
I have set proper SPARK_HOME in respective bashrc files. Below is my slave file (in the 1 master + 1 slave machine)
localhost
192.168.1.154
I can remotely login to the 1 slave machine via ssh. I am able to run Spark cluster individually in each machine.
It is my understanding when I try to remotely start a slave from my master machine via start-all.sh script it is trying to goto the location where spark is installed on master node, but as on slave node the spark is installed on a different location, it fails. Can anyone please tell me how can I rectify this problem?
In start-all.sh you can find the following:
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
# Load the Spark configuration
. "${SPARK_HOME}/sbin/spark-config.sh"
# Start Master
"${SPARK_HOME}/sbin"/start-master.sh
# Start Workers
"${SPARK_HOME}/sbin"/start-slaves.sh
which has nothing to do with the Spark installation on the standalone master. start-all.sh simply uses whatever SPARK_HOME you've defined globally and uses it across all nodes in the cluster, for standalone master and workers.
In your case, I'd recommend writing a custom startup script that would start the standalone Master and workers per respective SPARK_HOME env vars.
start-slaves.sh (source here) does simply the following:
cd "${SPARK_HOME}" \; "${SPARK_HOME}/sbin/start-slave.sh" "spark://$SPARK_MASTER_HOST:$SPARK_MASTER_PORT"
And so there is not much magic going on, but to ssh to every node and execute the command line.
I think I'd even use Ansible for this.
You should check your ~/.bashr. You can see my bashrc below:
export JAVA_HOME=/usr/local/java/jdk1.8.0_121
export JRE_HOME=$JAVA_HOME/jre
export SCALA_HOME=/usr/local/src/scala/scala-2.12.1
export SPARK_HOME=/usr/local/spark/2.1.0
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin
Finally, you have to update your bashrc environment:
source ~/.bashrc
In my case I had 2 Macs and 1 PC/Linux machine as workers. 1 of the Macs acted as a master as well.
On the Macs, I had installed spark under /Users/<user>/spark and set my $SPARK_HOME to this path.
On the Linux machine (ubuntu), I had setup the spark directory under /home/<user>/spark. When running start-all.sh on my spark master machine (1 of the Macs) would cause an error on the Linux worker:
192.168.1.33: bash: line 1: cd: /Users/<user>/spark: No such file or directory 192.168.1.33: bash: line 1: /Users/<user>/spark/sbin/start-worker.sh: No such file or directory
In order to fix the pathing problem, I mimicked the Mac by creating a symbolic link on the Linux machine pointing a "/Users" directory to the "/home" directory. This tricked Spark into working on that Linux machine/worker just as it works on the Macs
cd /
sudo ln -s home Users
This is probably not the most elegant solution but it meant I did not need to maintain my own version of start-all.sh and its associated subscripts.

Command to store File on HDFS

Introduction
A Hadoop NameNode and three DataNodes have been installed and are running. The next step is to provide a File to HDFS. The following commands have been executed:
hadoop fs -copyFromLocal ubuntu-14.04-desktop-amd64.iso
copyFromLocal: `.': No such file or directory
and
hadoop fs -put ubuntu-14.04-desktop-amd64.iso
put: `.': No such file or directory
without succes.
Question
Which command needs to be issued in order to store a file on HDFS?
If no path is provided, hadoop will try to copy the file in your hdfs home directory. In other words, if you're logged as utrecht, it will try to copy ubuntu-14.04-desktop-amd64.iso to /user/utrecht.
However, this folder doesn't exist from scratch (you can normally check the dfs via a web browser).
To make your command work, you have two choices :
copy it elsewhere (/ works, but putting everything there may lead to complications in the future)
create the directory you want with hdfs dfs -mkdir /yourFolderPath

Resources