Not able to copy data in hdfs with hdfs dfs commands - linux

I'm trying to copy data over hdfs. but none of the commands not working for me.
I followed an online tutorial to install a single node cluster. it got installed correctly because $jsp command showing me all the 6 jobs. but when I'm trying to copy a file over to hdfs its showing me error.
the command I'm running is
hduser#naren-Vostro-3560:~$ hdfs dfs -copyFromLocal /home/nare/Desktop/data/first.txt /app/hadoop/tmp
Error
14/12/30 02:18:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
copyFromLocal: '/app/hadoop/tmp': No such file or directory
I have given all the permissions to input file (first.txt)
naren#naren-Vostro-3560:~$ ls -al /home/naren/Desktop/data
total 3612
drwxrwxr-x 2 naren naren 4096 Dec 30 01:40 .
drwxr-xr-x 3 naren naren 4096 Dec 30 01:40 ..
-rwxrwxrwx 1 naren naren 674570 Dec 30 01:37 first.txt
-rwxrwxrwx 1 naren naren 1423803 Dec 30 01:39 second.txt
-rwxrwxrwx 1 naren naren 1573151 Dec 30 01:40 third.txt
the permissions to the hdfs folder also looks right to me
hduser#naren-Vostro-3560:~$ ls -l /app/hadoop
total 4
drwxr-x--- 5 hduser hadoop 4096 Dec 26 01:22 tmp
I'm new to hadoop and linux and got stuck here.
also i tried creating new directory with
hduser#naren-Vostro-3560:~$ hadoop fs -mkdir -p /user/hduser/sample
and it didn't creat any directory for me.
Please let me know where i'm going wrong.
Thanks in Advance!!
Hadoop Version: Hadoop 2.5.2
OS: Ubuntu 14.04

You need to make sure /app/hadoop is created in hdfs, not the local fs. In your check for the directory you use ls -l which checks on the local filesystem which is separate from the hdfs namespace. Try hadoop fs -ls /app/hadoop. If it's not there then create it with hadoop fs -mkdir. – snkherv

Related

tar command with -zxvf not extracting contents as expected

(ubuntu 18.04)
I'm attempting to extract an odbc driver from a tarball and following these instructions with command:
tar --directory=/opt -zxvf /SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux.tar.gz
This results in the following output:
root#08ba33ec2cfb:/# tar --directory=/opt -zxvf SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux.tar.gzSimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux/
SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux/GoogleBigQueryODBC.did
SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux/docs/
SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux/docs/release-notes.txt
SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux/docs/Simba Google BigQuery ODBC Connector Install and Configuration Guide.pdf
SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux/docs/OEM ODBC Driver Installation Instructions.pdf
SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux/setup/
SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux/setup/simba.googlebigqueryodbc.ini
SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux/setup/odbc.ini
SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux/setup/odbcinst.ini
SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux/SimbaODBCDriverforGoogleBigQuery32_2.4.6.1015.tar.gz
SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux/SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015.tar.gz
The guide linked to above says:
The Simba Google BigQuery ODBC Connector files are installed in the
/opt/simba/googlebigqueryodbc directory
Not for me, but I do see:
ls -l /opt/
total 8
drwxr-xr-x 1 1000 1001 4096 Apr 26 00:39 SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux
And:
ls -l /opt/SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux/
total 52324
-rwxr-xr-x 1 1000 1001 400 Apr 26 00:39 GoogleBigQueryODBC.did
-rw-rw-rw- 1 1000 1001 26688770 Apr 26 00:39 SimbaODBCDriverforGoogleBigQuery32_2.4.6.1015.tar.gz
-rw-rw-rw- 1 1000 1001 26876705 Apr 26 00:39 SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015.tar.gz
drwxr-xr-x 1 1000 1001 4096 Apr 26 00:39 docs
drwxr-xr-x 1 1000 1001 4096 Apr 26 00:39 setup
I was specifically looking for the .so driver file. All the above is on a docker container. I tried extracting the tarball locally on Ubuntu 18.04 (Same as my Docker container) and when I use Ubuntu desktop gui to extract by double clicking the tar.gz file and then clicking 'extract', I do indeed see the expected files.
It seems my tar command (tar --directory=/opt -zxvf /SimbaODBCDriverforGoogleBigQuery_2.4.6.1015-Linux.tar.gz) is not extracting the tarball as expected.
How can I extract the contents of the tarball properly? The tarball in question is the linux one on this link.
[edit]
Adding screens of contents of the tarball per comments. I had to click down two levels of nesting to arrive at 'stuff':
The instructions you linked to do not match the contents of the file I found from here. The first .tar.gz contains two other .tar.gz files. I looked into the 64 bit one and it has:
SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015/
SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015/ErrorMessages/
SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015/ErrorMessages/en-US/
SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015/ErrorMessages/en-US/SimbaBigQueryODBCMessages.xml
SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015/ErrorMessages/en-US/ODBCMessages.xml
SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015/ErrorMessages/en-US/SQLEngineMessages.xml
SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015/ErrorMessages/en-US/DSMessages.xml
SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015/ErrorMessages/en-US/DSCURLHTTPClientMessages.xml
SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015/third-party-licenses.txt
SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015/lib/
SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015/lib/libgooglebigqueryodbc_sb64.so
SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015/lib/cacerts.pem
SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015/lib/EULA.txt
SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015/Tools/
SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015/Tools/get_refresh_token.sh
Your .so is in the lib directory. Based on the instructions it looks like you need to extract this file (or the 32 bit if appropriate) and rename, in this case SimbaODBCDriverforGoogleBigQuery64_2.4.6.1015 to simba/googlebigqueryodbc. The tar command is doing what it is told but the instructions are way off.

file zip/tar in linux at specific location

I want to zip a set of directories and files on my centos 8 VM.
There are 3 directories and 1 file which I want to zip in such a way that only env.conf file will move to /etc/env.txt after unzipping it and remaining directories will be unzipped at current location.
Is there any way to achieve this.
drwxr-xr-x. 9 root root 114 Feb 25 12:40 config
-rw-r--r--. 1 root root 340 Feb 25 09:01 env.conf
drwxr-xr-x. 9 root root 4096 Feb 28 05:11 platform
drwxr-xr-x. 2 root root 135 Feb 28 07:49 install
I don't think this is possible. in fact this is considered a vulnerability if you could do that.
Imagine you download a zip file from some website. and after you unzip it in a temp folder. It registers itself as a service by writing a file in /etc somewhere, and gets control over your pc.
Example: zip-slip
You could however create a one-liner that extracts and moves the file wherever you want like this:
unzip <filename> && mv env.conf /etc/env.txt

Minikube Mount: bad file descriptor

When I mount a directory in Minikube and list out the directory, I get the errors below:
ls: cannot access '/mnt/nilla/assets': Bad file descriptor
ls: cannot access '/mnt/nilla/lib': Bad file descriptor
ls: cannot access '/mnt/nilla/priv': Bad file descriptor
ls: cannot access '/mnt/nilla/config': Bad file descriptor
ls: cannot access '/mnt/nilla/README.md': Bad file descriptor
ls: cannot access '/mnt/nilla/mix.exs': Bad file descriptor
ls: cannot access '/mnt/nilla/test': Bad file descriptor
ls: cannot access '/mnt/nilla/testmount': Bad file descriptor
total 0
-????????? ? ? ? ? ? README.md
d????????? ? ? ? ? ? assets
d????????? ? ? ? ? ? config
d????????? ? ? ? ? ? lib
-????????? ? ? ? ? ? mix.exs
d????????? ? ? ? ? ? priv
d????????? ? ? ? ? ? test
-????????? ? ? ? ? ? testmount
This is problem because when I mount this directory in my pod, the lsyncd service is copying it to a distribution folder. lsyncd does not know what to do with files that don't have proper descriptors.
I mount the the volume after starting Minikube, like:
nohup minikube mount ${HOME}/Development/nilla/:/mnt/nilla &> /dev/null &
How can I mount a directory and transfer the normal file descriptors that appear when I list the directory on my local computer? These are what they look like:
$ < ls -l nilla/
total 28
drwxr-xr-x 6 joes joes 4096 Apr 10 22:23 assets
drwxr-xr-x 2 joes joes 4096 Apr 10 22:23 config
drwxr-xr-x 4 joes joes 4096 Apr 10 22:23 lib
-rw-r--r-- 1 joes joes 1905 Apr 10 22:23 mix.exs
drwxr-xr-x 4 joes joes 4096 Apr 10 22:23 priv
-rw-r--r-- 1 joes joes 735 Apr 10 22:23 README.md
drwxr-xr-x 4 joes joes 4096 Apr 10 22:23 test
-rw-rw-r-- 1 joes joes 0 May 15 23:08 testmount
Additional notes: I'm using System 76's Pop OS, which is a fork of Ubuntu 20, and my Minikube VM is running Ubuntu 20 on Virtual Box.
Thanks.
#MikołajGłodziak in the comments pointed me in the right direction. The problem was the default driver for Minikube. I changed my minikube start command to specify one of the recommended drivers. As an example:
minikube start --driver=docker --mount-string ${HOME}/project/:/mnt/project
NOTE: You may get errors for trying to start up the same Minikube VM with a different driver. If that's the case, minikube delete will delete your current VM and make a new one the next time you run minikube start.
This seems to be an issue with minikube, which is currently being worked on.
See https://github.com/kubernetes/minikube/issues/12301
Current workaround is to use another driver.
easy way to do this
minikube start --driver=docker --mount-start="${HOME}/project/host" --mount

Does Spark support multiple users?

I have a 3-node spark 2.3.1 cluster running at the moment, and I'm also running a zeppelin server using a normal user, like ulab.
From zeppelin, I ran the commands:
%spark
val file = sc.textFile("file:///mnt/glusterfs/test/testfile")
file.saveAsTextFile("/mnt/glusterfs/test/testfile2")
It report a lot of error messages, something like:
WARN [2018-09-14 05:44:50,540] ({pool-2-thread-8} NotebookServer.java[afterStatusChange]:2302) - Job 20180907-130718_39068508 is finished, status: ERROR, exception: null, result: %text file: org.apache.spark.rdd.RDD[String] = file:///mnt/glusterfs/test/testfile MapPartitionsRDD[49] at textFile at <console>:51
org.apache.spark.SparkException: Job aborted.
...
... 64 elided
Caused by: java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus{path=file:/mnt/glusterfs/test/testfile2/_temporary/0/task_20180914054253_0050_m_000018/part-00018; isDirectory=false; length=33554979; replication=1; blocksize=33554432; modification_time=1536903780000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to file:/mnt/glusterfs/test/testfile2/part-00018
And I found that some temporary files owned by user root, while some owned by ulab, like the following:
bash-4.4# ls -l testfile2
total 32773
drwxr-xr-x 3 ulab ulab 4096 Sep 14 05:42 _temporary
-rw-r--r-- 1 ulab ulab 33554979 Sep 14 05:44 part-00018
bash-4.4# ls -l testfile2/_temporary/
total 4
drwxr-xr-x 210 ulab ulab 4096 Sep 14 05:44 0
bash-4.4# ls -l testfile2/_temporary/0
total 832
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000000
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000001
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000002
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000003
....
Is there any setup to let all these temporary files created by ulab? so we can use multiple users in spark driver to isolate the priviledges.
You can enable 'User Impersonate' option for spark interpreter which will start the spark job as logged-in user.
Refer this link for more info

kafka remove content from topic index files

I am testing kafka integration with spark as consumer. For debugging , have set-up log.retention.minutes = 2 in server.properties which cleans up .log file every 2mins. But .index file is not cleaned up
[cloudera#quickstart airline1-1]$ ls -l
total 0
-rw-r--r-- 1 root root 10485760 Apr 29 15:08 00000000000000000101.index
-rw-r--r-- 1 root root 0 Apr 29 15:08 00000000000000000101.log
-rw-r--r-- 1 root root 10485756 Apr 29 15:08 00000000000000000101.timeindex
Wondering why .index files are not cleaned up. Any insight would be helpful to understand what's happening in the background.
Also please share recommended approach to clean up log and index during testing. Found many google links referring to stop kakfa server -> remove topic partition files -> Restart kafka. But not inclined towards this approach , it could impact offsets state maintained in zookeeper.
Thanks very much!

Resources