Error while decompress a file from Local Linux to HDFS - linux

This command works fine in Local linux
gzip -d omega_data_path_2016-08-10.csv.gz
I would like to decompress a file with extension .csv.gz to HDFS location.
I tried the below command and i get this error
[cloudera#client08 localinputfiles]$ gzip -d omega_data_path_2016-08-10.csv.gz | hadoop dfs -put /user/cloudera/inputfiles/
gzip: omega_data_path_2016-08-10.csv already exists; do you wish to overwrite (y or n)? DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
put: `/user/cloudera/inputfiles/': No such file or directory
Could someone help me to fix this?

To make gzip write the output on standard output, use -c flag.
So the command would be,
gzip -dc omega_data_path_2016-08-10.csv.gz | hdfs dfs -put - /user/cloudera/omega_data_path_2016-08-10.csv

Related

Is it possible to untar a tar.gz file on HDFS and put it in different HDFS folder without bringing it to local systems

I have employee_mumbai.tar.gz file inside this I have name.json and salary.json.
And the tar.gz is present in HDFS location. Is it possible to untar/Unzip the gzip file and put the json files in HFDS folder without bringing it to a local file system.
N.B:
Please remember it is not a text file and both json file unique information.
Please let me know if it can be achieved to read the both file separately in different data frame directly too in spark.
This worked for me:
hdfs dfs -cat /data/<data.gz> | gzip -d | hdfs dfs -put - /data/

how to write text to file in HDFS without appending [duplicate]

I am using
hdfs dfs -put myfile mypath
and for some files I get
put: 'myfile': File Exists
does that mean there is a file with the same name or does that mean the same exact file (size, content) is already there?
how can I specify an -overwrite option here?
Thanks!
put: 'myfile': File Exists
Means,the file named "myfile" already exists in hdfs. You cannot have multiple files of the same name in hdfs
You can overwrite it using hadoop fs -put -f /path_to_local /path_to_hdfs
You can overwrite your file in hdfs using -f command.For example
hadoop fs -put -f <localfile> <hdfsDir>
OR
hadoop fs -copyFromLocal -f <localfile> <hdfsDir>
It worked fine for me. However -f command won't work in case of get or copyToLocal command. check this question
A file with the same name exists at the location you're trying to write to.
You can overwrite by specifying the -f flag.
Just updates to this answer, in Hadoop 3.X the command a bit different
hdfs dfs -put -f /local/to/path hdfs://localhost:9870/users/XXX/folder/folder2

unzip file into same directory in linux

Example:
Here's list of files in "/tmp/test_dir"
file1
zip -r Test_Files.zip *
When I unzip Test_Files.zip I'm getting the below output
Current working directory "/tmp/test_dir"
/tmp/test_dir/file1
What I'm expecting when I unzip Test_Files.zip;
/tmp/test_dir/Test_Files/file1
Can anyone help how do i get expected result as mentioned above?
Use unzip. You can use -o to overwrite the existing files and -q to make it quiet. In doubt? Just use terminal and type in unzip (or try /usr/bin/unzip) to see helpful information.

tar very large files to FTP directly splited into smaller files

I need to backup a large server into FTP storage. I can tar all files, I can upload using FTP and I can split the tar file into many small files.
But the problem is I can't do these three steps in one step. I can tar to FTP directly, I can tar with split, but can't tar with FTP and split.
The OS is CentOS 6.2
The Files Size more than 800G
Thanks
To can tar, split and ftp a directory with one command line you need the following:
split command write to the standard output only, so you can't pass the file to another command like ftp to process it, to do so you need to patch split to can use the --filter option to can pass the output file to ftp "on the fly" without having to save to hard disk by setting up the $FILE environmental variable with the output file (the file names would be x00, x01, x02 ...).
1) Here is the split patch: http://lists.gnu.org/archive/html/coreutils/2011-01/txt3j8asgk8WH.txt
After patching split command, you would see in the man that the --filter option available in your split command.
2) install the ncftp ftp client which is a good ftp client that allows you to connect to ftp and put file in one line command, without waiting for the ftp response like ordinary ftp client. the ncftp is useful to integrate with scripts and so on.
here is the command that would compress /home directory with tar split it to 100MB small files and transfer each file through FTP
tar cvz -i /home | split -d -b 100m --filter 'ncftpput -r 10 -F -c -u ftpUsername -p ftpPassword ftpHost $FILE'
note that we used the ncftpput that would pass the $FILE to ftp in single command too.
additional ftp options:
-r 10: allows you to try to reconnect 10 times after loosing connection with ftp.
-F: To use passive mode.
-c: takes the input from stdin.
To merge the split files (x00, x01, x02, x03 ...) to can extract the file use following command
cat x* > originalFile.tar
You can make a shell script and use
tar zcf - /usr/folder | split -b 30720m - /usr/archive.tgz
and then upload to FTP also because once you are doing tar and putting onto FTP then how can you split.

how to convert multiline linux commands to one line of command

Can someone please explain me how to use ">" and "|" in linux commands and convert me these three lines into one line of code please?
mysqldump --user=*** --password=*** $db --single-transaction -R > ${db}-$(date +%m-%d-%y).sql
tar -cf ${db}-$(date +%m-%d-%y).sql.tar ${db}-$(date +%m-%d-%y).sql
gzip ${db}-$(date +%m-%d-%y).sql.tar
rm ${db}-$(date +%m-%d-%y).sql (after conversion I guess this line will be useless)
The GNU tar program can itself do the compression normally done by gzip. You can use the -z flag to enable this. So the tar and gzip could be combined into:
tar -zcf ${db}-$(date +%m-%d-%y).sql.tar.gz ${db}-$(date +%m-%d-%y).sql
Getting tar to read from standard input for archiving is not a simple task but I would question its necessity in this particular case.
The intent of tar is to be able to package up a multitude of files into a single archive file but, since it's only one file you're processing (the output stream from mysqldump), you don't need to tar it up, you can just pipe it straight into gzip itself:
mysqldump blah blah | gzip > ${db}-$(date +%m-%d-%y).sql.gz
That's because gzip will compress standard input to standard output if you don't give it any file names.
This removes the need for any (possibly very large) temporary files during the compression process.
You can use next script:
#!/bin/sh
USER="***"
PASS="***"
DB="***"
mysqldump --user=$USER --password=$PASS $DB --single-transaction -R | gzip > ${DB}-$(date +%m-%d-%y).sql.gz
You can learn more about "|" here - http://en.wikipedia.org/wiki/Pipeline_(Unix). I can say that this construction moves output of mysqldump command to the standard input of gzip command, so that is like you connect output of one command with input of other via pipeline.
I dont see the point in using tar: You just have one file, and for compression you call gzip explicit. Tar is used to archive/pack multiple files into one.
You cammandline should be (the dump command is shorted, but I guess you will get it):
mysqldump .... | gzip > filename.sql.gz
To append the commands together in one line, I'd put && between them. That way if one fails, it stops executing them. You could also use a semicolon after each command, in which case each will run regardless if the prior command fails or not.
You should also know that tar will do the gzip for you with a "z" option, so you don't need the extra command.
Paxdiablo makes a good point that you can just pipe mysqldump directly into gzip.

Resources