Storefile from spark on windows to HDFS

Storefile from spark on windows to HDFS - linux

I have installed Hadoop/YARN in a linux VM on my local windows machine. On the same windows machine (not in VM) I have installed Spark. When running spark on windows, I can read files stored in HDFS (in linux VM).
val lines = sc.textFile("hdfs://MyIP:9000/Data/sample.txt")
While saving a file using to HDFS saveAsTextFile("hdfs://MyIP:9000/Data/Output"), I am getting below error:
org.apache.hadoop.security.AccessControlException: Permission denied:
user=LocalWindowsUser, access=WRITE,
inode="/Data":hadoop:supergroup:drwxr-xr-x.
I guess, it's because Windows and Linux users are different and windows user doesn't have permission to write files in linux.
What is the correct way to store files from windows to HDFS (linux VM) using spark?

Your problem is that the username that you are using to access HDFS with write mode does not have permissions.
The directory /Data has the permissions rwxr-xr-x. This translates to mode 755. Your username is LocalWindowsUser which has read and execute permissions.
Possible solutions:
Soution 1:
Since this is a local system under your full control, change the permissions to allow everyone access. Execute this command while inside the VM as the user hadoop:
hdfs dfs -chmod -R 777 /Data
Solution 2:
Create an environment variable in Windows and set the username:
set HADOOP_USER_NAME=hadoop
The username really should be the user hdfs. Try that also if necessary.

Related

How to transfer data from local file system (linux) to a Hadoop Cluster made on Google Cloud Platform

I am a beginner in Hadoop, I made a Hadoop Cluster (one master and two slaves) on Google Cloud Platform.
I accessed the master of the cluster using from the local file system (Linux): ssh -i key key#public_ip_of_master
Then I did sudo su - inside the cluster because Hadoop functions only appears while being root.
Then I initiated the HDFS using start-dfs.sh and start-all.sh
Now the problem is that I want to tranfer files from the local Linux file system to the Hadoop Cluster and vice versa using the following command (inserting the command inside the cluster while being root):
root#master:~# hdfs dfs -put /home/abas1/Desktop/chromFa.tar.gz /Hadoop_File
The problem is that the local path which is: /home/abas1/Desktop/chromFa.tar.gz is never recognized and I can not seem to know what to do.
I am sure I am missing something trivial but I do not know what it is. I have to use either -copyFromLocal or -put.

local path is never recognized
That is not a Hadoop problem, then. You are on the master node (over SSH), as the root user. There is a /root folder with files, and probably no /home/abas1.
In other words, run ls -l /home, and you see what local files are available.
To get files to the master server to upload from that terminal session, you will want to SCP files first to there from a different machine.
Exit the SSH session
scp -i key root#master-ip home/abas1/Desktop/chromFa.tar.gz /tmp
ssh -i key root#master-ip
Then you can do this
hdfs mkdir /Hadoop_File
ls -l /tmp | grep chromFa # for example, to check file
hdfs -put /tmp/chromFa.tar.gz /Hadoop_file/
Hadoop functions only appears while being root.
Please do not use root for interacting with Hadoop services. Create unique user accounts for HDFS, YARN, Zookeeper, etc. with restricted permissions like you would for any other Unix process.
Using DataProc will do this... And you can still SSH to it, so you should really considering using it instead of manual GCE cluster.

Trying to move a csv file from local file system to hadoop file system

I am trying to copy a csv file from my local file system to hadoop. But I am not able to successfully do it. I am not sure which permissions i need to change. As I understand. hdfs super user does not have access to the /home/naya/dataFiles/BlackFriday.csv
hdfs dfs -put /home/naya/dataFiles/BlackFriday.csv /tmp
# Error: put: Permission denied: user=naya, access=WRITE, inode="/tmp":hdfs:supergroup:drwxr-xr-x
sudo -u hdfs hdfs dfs -put /home/naya/dataFiles/BlackFriday.csv /tmp
# Error: put: `/home/naya/dataFiles/BlackFriday.csv': No such file or directory
Any help is highly appreciated. I want to do it via the command line utility. I can do it via cloudera manager from the hadoop side. But I want to understand whats happening behind the commands

How to allow the root user to write files into HDFS

I have installed hadoop on Cent OS 7. The daemon service written in python trying to make a directory in HDFS , but getting the below permission error.
mkdir: Permission denied: user=root, access=WRITE, inode="/rep_data/store/data/":hadoop:supergroup:drwxr-xr-x
looks like my service is running under root account.
So I would like to know how do I give a permission to the root user to make directory and write files.

If you are trying to create directory under HDFS root i.e /, you may face this type of issue. You can create directories in your home, without any issues
To create directory in root, Execute command like follows
sudo hdfs hdfs dfs -mkdir /directory/name
To create directory in your HDFS home execute below command
hdfs dfs -mkdir /user/user_home/directory/name

This is probably an issue because you are not the super user.
A workaround is to enable Access Control Lists in hdfs and give permissions to your user.
To enable support for ACLs, set dfs.namenode.acls.enabled to true in the NameNode configuration.
For more info check: link

How do I give apache permission to use a directory on an NTFS partition?

I am running Linux (Lubutu 12.10) on an older machine with a 20GB hard drive. I have a 1 TB external hard drive with an NTFS partition on it. On that partition, there is www directory that holds my web content. It is auto-mounted at startup as /media/t515/NTFS.
I would like to change the apache document directory from /var/www to /media/t515/NTFS/www.
I need to keep the partition as an NTFS partition, because I use the same hard drive on a different machine running WAMP.
I changed the file "default" in /etc/apache2/sites-available to the new location, and restarted the server. When I tried to go to local host, I got the error:
403 Forbidden
You don't have permission to access / on this server.
I then changed the automount options in fstab to include the option "umask=0000", and then to "umask=2200", both to no avail. I still get the same error message.
I can access the NTFS partition with no problem from other applications, and when logged in as any user. But Apache seems to be unable (or unwilling) to access the partition. How do I give apache permission to use a directory on an NTFS partition?

After many many attempts here is what succeeded for me and nothing else that is : changing the configuration of Apache so that it uses www-data (Apache user) no more but my own user instead.
Very simple to do. In my version of Apache the two lines to be changed are in the /etc/apache2/envvars file (it can be another file in another version) :
export APACHE_RUN_USER=www-data
export APACHE_RUN_GROUP=www-data
I replaced www-data by my user name (here toto :)) :
export APACHE_RUN_USER=toto
export APACHE_RUN_GROUP=toto

In my experience I've always had to remount the drive with RW permissions. found this:
sudo mount -t ntfs -o rw,auto,user,fmask=0022,dmask=0000 /dev/whatever /mnt/whatever
or:
For NTFS partitions, use the permissions option in fstab.
First unmount the ntfs partition.
Then edit /etc/fstab
Graphical gksu gedit /etc/fstab
Command line sudo -e /etc/fstab
Identify your partition UUID with blkid
sudo blkid
And add or edit a line for the ntfs partition
# change the "UUID" to your partition UUID
UUID=12102C02102CEB83 /media/windows ntfs-3g auto,users,permissions 0 0
Make a mount point (if needed)
sudo mkdir /media/windows
Now mount the partition
mount /media/windows
The options I gave you, auto, will automatically mount the partition
when you boot and users allows users to mount and umount .
You can then use chown and chmod on the ntfs partition.
Both found here: https://askubuntu.com/questions/11840/how-to-chmod-on-an-ntfs-or-fat32-partition

None of the answers above solve the issue, in fact, the problem is related to Apache itself, not filesystem or permissions.
The only thing you need to do is :
<Directory "/www/mywebdirectoryinapartitioneddisk">
Require all granted
</Directory>
this will solve the issue
here the post in my blog explaining everything in detail. It could work on NTFS
http://www.tbogard.com/2014/09/12/making-apache-server-to-read-a-partitioned-disk-the-definitive-solution/

It's actually quite simple:
1) Create a local user on the Windows host
2) Grant appropriate NTFS permissions to that user
3) Verify access (Windows only)
... THEN ...
4) Configure your NTFS mount on Linux to use the same Windows user and group (Linux user/group is irrelevant here)
5) Configure Apache to use that Linux group (Linux user/group is essential here)

HDFS start-all.sh by root or non-root user

I am learning Hadoop, and would like to try the pseudo-distributed operation
When I try to use start-all.sh to start the Hadoop daemons, should I use a non-root user like foo-user or use root.
Using root has no problem, however, I am a little bit concerned about it.
Using a non-root user, foo-user, it complains that it doesn't have permission to files
/var/run/hadoop/hadoop-foo-user-namenode.pid: permission denied
/var/run/hadoop/hadoop-foo-user-tasktracker-foohost.pid: permission denied
It was trying to create these two files in the directory /var/run/hadoop
I tried vim /var/run/hadoop/testfile, and couldn't save. So turns out that foo-user doesn't have permission to write at /var/run/hadoop
I checked the permission of /var/run/hadoop
drwxrwxr-x root hadoop 4096 Feb 8 23:42 hadoop
foo-user is in group hadoop, so should have write permission to /var/run/hadoop. Indeed, several other id files are created there, like the ...jobtracker.pid
So should I use root for start-all.sh or there is something wrong with the permission ( I am really confused)?

It's not recommended to start Hadoop as the root, below is quoted from Yahoo's Hadoop tutorial:
The user who owns the Hadoop instances will need to have read and
write access to each of these directories. It is not necessary for all
users to have access to these directories. Set permissions with chmod
as appropriate. In a large-scale environment, it is recommended that
you create a user named "hadoop" on each node for the express purpose
of owning and running Hadoop tasks. For a single individual's machine,
it is perfectly acceptable to run Hadoop under your own username. It
is not recommended that you run Hadoop as root.
Even though foo-user is in the group hadoop in the Linux filesystem, you still need to make sure
that foo-user is also a group member in HDFS (by default the group is called supergroup), you'll see what the group is when you do hadoop fs -ls path_to_your_data.

group as well as user needs to be hadoop. Here you have:
drwxrwxr-x root hadoop 4096 Feb 8 23:42 hadoop
so change the root into hadoop (curently i don't have access to any linux machine so I can't say exact commands), then make yourself sure that hadoop user is able to create filies and directories within /var/run/hadoop. I strongly recommend to run it s non-root user.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Storefile from spark on windows to HDFS - linux

Related

How to transfer data from local file system (linux) to a Hadoop Cluster made on Google Cloud Platform

Trying to move a csv file from local file system to hadoop file system

How to allow the root user to write files into HDFS

How do I give apache permission to use a directory on an NTFS partition?

HDFS start-all.sh by root or non-root user

Categories

Resources