How to delete HDFS folder having windows special characters (^M) in the name - linux

I wrote a shell script to create hdfs folders in windows 7 and ran on Linux server. Now, hdfs folders got created but with special character ^M at the end of the name(probably carriage return). It doesn't show up in Linux but i can see when the 'ls' output is redirected to a file.
I should have run dos2unix before running this script. However now I am not able to delete folders with ^M. Could someone assist on how to delete these folders.

Just a supplementary answers fo #SachinJ.
TL;DR
$ hdfs dfs -rm -r -f $(hdfs dfs -ls /path/to/dir | sed '<LINE_NUMBER>q;d' | awk '{print $<FILE_NAME_COLUM_NUMBER>}')
should be replace to line number of file you want to delete in the output of hdfs dfs -ls /path/to/dir.
Here is the example.
Details
Suppose your hdfs dir like this
$ hdfs dfs -ls /path/to/dir
Found 5 items
drwxr-xr-x - test supergroup 0 2019-08-22 10:41 /path/to/dir/dir1
drwxr-xr-x - test supergroup 0 2019-07-11 15:35 /path/to/dir/dir2
drwxr-xr-x - test supergroup 0 2019-07-05 17:53 /path/to/dir/dir3
drwxr-xr-x - test supergroup 0 2019-08-22 11:28 /path/to/dir/dirtodelete
drwxr-xr-x - test supergroup 0 2019-07-26 11:07 /path/to/dir/dir4
When you ls from it, the screen output looks just ok.
But you can't select it
$ hdfs dfs -ls /path/to/dir/dirtodelete
ls: `/path/to/dir/dirtodelete': No such file or directory
$ hdfs dfs -ls /path/to/dir/dirtodelete*
ls: `/path/to/dir/dirtodelete*': No such file or directory
What's more, when output ls result to file and use vim to read, it shows like following
$ hdfs dfs -ls /path/to/dir > tmp
$ vim tmp
Found 5 items
drwxr-xr-x - test supergroup 0 2019-08-22 10:41 /path/to/dir/dir1
drwxr-xr-x - test supergroup 0 2019-07-11 15:35 /path/to/dir/dir2
drwxr-xr-x - test supergroup 0 2019-07-05 17:53 /path/to/dir/dir3
drwxr-xr-x - test supergroup 0 2019-08-22 11:28 /path/to/dir/dirtodelete^M^M
drwxr-xr-x - test supergroup 0 2019-07-26 11:07 /path/to/dir/dir4
What is "^M", it's a CARRIAGE RETURN (CR). More info here
Linux \n(LF) eq to Windows \r\n(CRLF)
This problem occurs edit same file in Windows and Linux.
So, we just to use correct filename, then we can delete it .But it can't be copy from the screen.
Here sed command works!
ls output as following
$ hdfs dfs -ls /path/to/dir
Found 5 items
drwxr-xr-x - test supergroup 0 2019-08-22 10:41 /path/to/dir/dir1
drwxr-xr-x - test supergroup 0 2019-07-11 15:35 /path/to/dir/dir2
drwxr-xr-x - test supergroup 0 2019-07-05 17:53 /path/to/dir/dir3
drwxr-xr-x - test supergroup 0 2019-08-22 11:28 /path/to/dir/dirtodelete
drwxr-xr-x - test supergroup 0 2019-07-26 11:07 /path/to/dir/dir4
the filename is on line 5
so hdfs dfs -ls /path/to/dir | sed '5q;d' will cut the line we need.
sed '5q;d' means read the first 5 line and quit, delete former lines, so it selects 5th line.
Then we can use awk the select filename column, index form 1, so column number is 8.
so just write the command
$ hdfs dfs -ls /path/to/dir/ | sed '5q;d' | awk '{print $8}'
/path/to/dir/dirtodelete
Then we can delete it.
$ hdfs dfs -rm -r -f $(hdfs dfs -ls /path/to/dir/ | sed '5q;d' | awk '{print $8}')

Sometimes wildchar may not work ( rm filename* ), better use the below option.
rm -r $(ls | sed '<LINE_NUMER>q;d')
Replace with line number in the output of ls command.

Related

How to find out if ls command output is file or a directory Bash

ls command outputs everything that is contained in current directory. For example ls -la will output something like this
drwxr-xr-x 3 user user 4096 dec 19 17:53 .
drwxr-xr-x 15 user user 4096 dec 19 17:39 ..
drwxrwxr-x 2 user user 4096 dec 19 17:53 tess (directory)
-rw-r--r-- 1 user user 178 dec 18 21:52 file (file)
-rw-r--r-- 1 user user 30 dec 18 21:47 text (file)
And what if I want to know how much space does all files consume. For that I would have to sum $5 from all lines with ls -la | awk '{ sum+=$5 } END{print sum}'. So how can I only sum size of files and leave directories behind?
You can use the following :
find . -maxdepth 1 -type f -printf '%s\n' | awk '{s+=$1} END {print s}'
The find command selects all the files in the current directory and output their size. The awk command sums the integers and output the total.
Don't.
One of the most quoted pages on SO that I've seen is https://unix.stackexchange.com/questions/128985/why-not-parse-ls-and-what-do-to-instead.
That being said and as a hint for further development, ls -l | awk '/^-/{s+=$5} END {print s}' will probably do what you ask.

replacement on xargs variable returns empty string

I need to search for XML files inside a directory tree and create links for them on another directory (staging_ojs_pootle), naming these links with the file path (replacing slashes per dots).
the bash command is not working, I got stuck on the replacement part. Seems like the variable from xargs, named 'file', is not accessible inside the replacement code (${file/\//.})
find directory/ -name '*.xml' | xargs -I 'file' echo "ln" file staging_ojs_pootle/${file/\//.}
The replacement inside ${} result gives me an empty string.
Tried using sed but regular expressions were replacing all or just the last slash :/
find directory/ -name '*.xml' | xargs -I 'file' echo "ln" file staging_ojs_pootle/file |sed -e '/^ln/s/\(staging_ojs_pootle.*\)[\/]\(.*\)/\1.\2/g'
regards
Try this:
$ find directory/ -name '*.xml' |sed -r 'h;s|/|.|g;G;s|([^\n]+)\n(.+)|ln \2 staging_ojs_pootle/\1|e'
For example:
$ mkdir -p /tmp/test
$ touch {1,2,3,4}.xml
# use /tmp/test as staging_ojs_pootle
$ find /tmp/test -name '*.xml' |sed -r 'h;s|/|.|g;G;s|([^\n]+)\n(.+)|ln \2 /tmp/test/\1|e'
$ ls -al /tmp/test
total 8
drwxr-xr-x. 2 root root 4096 Jun 15 13:09 .
drwxrwxrwt. 9 root root 4096 Jun 15 11:45 ..
-rw-r--r--. 2 root root 0 Jun 15 11:45 1.xml
-rw-r--r--. 2 root root 0 Jun 15 11:45 2.xml
-rw-r--r--. 2 root root 0 Jun 15 11:45 3.xml
-rw-r--r--. 2 root root 0 Jun 15 11:45 4.xml
-rw-r--r--. 2 root root 0 Jun 15 11:45 .tmp.test.1.xml
-rw-r--r--. 2 root root 0 Jun 15 11:45 .tmp.test.2.xml
-rw-r--r--. 2 root root 0 Jun 15 11:45 .tmp.test.3.xml
-rw-r--r--. 2 root root 0 Jun 15 11:45 .tmp.test.4.xml
# if don NOT use the e modifier of s command, we can get the final command
$ find /tmp/test -name '*.xml' |sed -r 'h;s|/|.|g;G;s|([^\n]+)\n(.+)|ln \2 /tmp/test/\1|'
ln /tmp/test/1.xml /tmp/test/.tmp.test.1.xml
ln /tmp/test/2.xml /tmp/test/.tmp.test.2.xml
ln /tmp/test/3.xml /tmp/test/.tmp.test.3.xml
ln /tmp/test/4.xml /tmp/test/.tmp.test.4.xml
Explains:
for each xml file, use h to keep the origin filename in hold space.
the use s|/|.|g to substitute all / to . for xml filename.
use G to append the hold space to pattern space, then pattern space is CHANGED_FILENAME\nORIGIN_FILENAME.
use s|([^\n]+)\n(.+)|ln \2 staging_ojs_pootle/\1|e' to merge the command with CHANGED_FILENAME and ORIGIN_FILENAME, then use e modifier of s command to execute the command assembled above, which will do the actual works.
Hope this helps!
If you can be sure that the names of your XML files do not contain any word-splitting characters, you can use something like:
find directory -name "*.xml" | sed 'p;s/\//./' | xargs -n2 echo ln

Move all folders except one [duplicate]

This question already has answers here:
How to move files and directories excluding one specific directory to this directory
(3 answers)
Closed 5 years ago.
I have two directories dir1 and dir2. I need to move the content of folder dir1 to dir2 except one folder dir1/src.
I tried this
mv !(src) dir1/* dir2/
But it dosn't work, it still displays this error
bash: !: event not found
Maybe you are looking for something like this?
The answer to my question there states that what you are trying to to is achievable by using the extglob bash shell option. You can turn it on by executing shopt -s extglob or by adding that command to your ~/.bashrc and relogin. Afterwards you can use the function.
To use your example of moving everything from dir1 except dir1/src to dir2, this should work:
mv -vt dir2/ dir1/!(src)
Example output:
$ mkdir -pv dir1/{a,b,c,src} dir2
mkdir: created directory 'dir1'
mkdir: created directory 'dir1/a'
mkdir: created directory 'dir1/b'
mkdir: created directory 'dir1/c'
mkdir: created directory 'dir1/src'
mkdir: created directory 'dir2'
$ ls -l dir1/
total 16
drwxrwxr-x 2 dw dw 4096 Apr 7 13:30 a
drwxrwxr-x 2 dw dw 4096 Apr 7 13:30 b
drwxrwxr-x 2 dw dw 4096 Apr 7 13:30 c
drwxrwxr-x 2 dw dw 4096 Apr 7 13:30 src
$ ls -l dir2/
total 0
$ shopt -s extglob
$ mv -vt dir2/ dir1/!(src)
'dir1/a' -> 'dir2/a'
'dir1/b' -> 'dir2/b'
'dir1/c' -> 'dir2/c'
$ ls -l dir1/
total 4
drwxrwxr-x 2 dw dw 4096 Apr 7 13:30 src
$ ls -l dir2/
total 12
drwxrwxr-x 2 dw dw 4096 Apr 7 13:30 a
drwxrwxr-x 2 dw dw 4096 Apr 7 13:30 b
drwxrwxr-x 2 dw dw 4096 Apr 7 13:30 c
More information about extglob can be found here.

Linux combine sort files by date created and given file name

I need to combine these to commands in order to have a sorted list by date created with the specified "filename".
I know that sorting files by date can be achieved with:
ls -lrt
and finding a file by name with
find . -name "filename*"
I don't know how to combine these two. I tried with a pipeline but I don't get the right result.
[EDIT]
Not sorted
find . -name "filename" -printf '%TY:%Tm:%Td %TH:%Tm %h/%f\n' | sort
Forget xargs. "Find" and "sort" are all the tools you need.
My best guess would be to use xargs:
find . -name 'filename*' -print0 | xargs -0 /bin/ls -ltr
There's an upper limit on the number of arguments, but it shouldn't be a problem unless they occupy more than 32kB (read more here), in which case you will get blocks of sorted files :)
find . -name "filename" -exec ls --full-time \{\} \; | cut -d' ' -f7- | sort
You might have to adjust the cut command depending on what your version of ls outputs.
Check the below-shared command:
1) List Files directory with Last Modified Date/Time
To list files and shows the last modified files at top, we will use -lt options with ls command.
$ ls -lt /run
output
total 24
-rw-rw-r--. 1 root utmp 2304 Sep 8 14:58 utmp
-rw-r--r--. 1 root root 4 Sep 8 12:41 dhclient-eth0.pid
drwxr-xr-x. 4 root root 100 Sep 8 03:31 lock
drwxr-xr-x. 3 root root 60 Sep 7 23:11 user
drwxr-xr-x. 7 root root 160 Aug 26 14:59 udev
drwxr-xr-x. 2 root root 60 Aug 21 13:18 tuned
https://linoxide.com/linux-how-to/how-sort-files-date-using-ls-command-linux/

Linux - Save only recent 10 folders and delete the rest

I have a folder that contains versions of my application, each time I upload a new version a new sub-folder is created for it, the sub-folder name is the current timestamp, here is a printout of the main folder used (ls -l |grep ^d):
drwxrwxr-x 7 root root 4096 2011-03-31 16:18 20110331161649
drwxrwxr-x 7 root root 4096 2011-03-31 16:21 20110331161914
drwxrwxr-x 7 root root 4096 2011-03-31 16:53 20110331165035
drwxrwxr-x 7 root root 4096 2011-03-31 16:59 20110331165712
drwxrwxr-x 7 root root 4096 2011-04-03 20:18 20110403201607
drwxrwxr-x 7 root root 4096 2011-04-03 20:38 20110403203613
drwxrwxr-x 7 root root 4096 2011-04-04 14:39 20110405143725
drwxrwxr-x 7 root root 4096 2011-04-06 15:24 20110406151805
drwxrwxr-x 7 root root 4096 2011-04-06 15:36 20110406153157
drwxrwxr-x 7 root root 4096 2011-04-06 16:02 20110406155913
drwxrwxr-x 7 root root 4096 2011-04-10 21:10 20110410210928
drwxrwxr-x 7 root root 4096 2011-04-10 21:50 20110410214939
drwxrwxr-x 7 root root 4096 2011-04-10 22:15 20110410221414
drwxrwxr-x 7 root root 4096 2011-04-11 22:19 20110411221810
drwxrwxr-x 7 root root 4096 2011-05-01 21:30 20110501212953
drwxrwxr-x 7 root root 4096 2011-05-01 23:02 20110501230121
drwxrwxr-x 7 root root 4096 2011-05-03 21:57 20110503215252
drwxrwxr-x 7 root root 4096 2011-05-06 16:17 20110506161546
drwxrwxr-x 7 root root 4096 2011-05-11 10:00 20110511095709
drwxrwxr-x 7 root root 4096 2011-05-11 10:13 20110511100938
drwxrwxr-x 7 root root 4096 2011-05-12 14:34 20110512143143
drwxrwxr-x 7 root root 4096 2011-05-13 22:13 20110513220824
drwxrwxr-x 7 root root 4096 2011-05-14 22:26 20110514222548
drwxrwxr-x 7 root root 4096 2011-05-14 23:03 20110514230258
I'm looking for a command that will leave the last 10 versions (sub-folders) and deletes the rest.
Any thoughts?
There you go. (edited)
ls -dt */ | tail -n +11 | xargs rm -rf
First list directories recently modified then take all of them except first 10, then send them to rm -rf.
ls -dt1 /path/to/folder/*/ | sed '11,$p' | rm -r
this assumes those are the only directories and no others are present in the working directory.
ls -dt1 will normally only print the newest directory however the /*/ will
only match directories and print their full paths the 1 ensures one
line per match/listing t sorts time with newest at the top.
sed takes the 11th line on down to the bottom and prints only those lines, which are then passed to rm.
You can use xargs, but for testing you may wish to remove | rm -r to see if the directories are listed properly first.
If the directories' names contain the date one can delete all but the last 10 directories with the default alphabetical sort
ls -d */ | head -n -10 | xargs rm -rf
ls -lt | grep ^d | sed -e '1,10d' | awk '{sub(/.* /, ""); print }' | xargs rm -rf
Explanation:
list all contents of current directory in chronological order (most recent files first)
filter out all the directories
ignore the 10 first lines / directories
use awk to extract the file names from the remaining 'ls -l' output
remove the files
EDIT:
find . -maxdepth 1 -type d ! -name \\.| sort | tac | sed -e '1,10d' | xargs rm -rf
I suggest the following sequence. I use a similar approach on my Synology NAS to delete old backups. It doesn't rely on the folder names, instead it uses the last modified time to decide which folders to delete. It also uses zero-termination in order to correctly handle quotes, spaces and newline characters in the folder names:
find /path/to/folder -maxdepth 1 -mindepth 1 -type d -printf '%Ts\t' -print0 \
| sort -rnz \
| tail -n +11 -z \
| cut -f2- -z \
| xargs -0 -r rm -rf
IMPORTANT: This will delete any matching folders! I strongly recommend doing a test run first by replacing the last command xargs -0 -r rm -rf with xargs -0 which will echo the matching folders instead of deleting them.
A short explanation of each step:
find /path/to/folder -maxdepth 1 -mindepth 1 -type d -printf '%Ts\t' -print0
Find all directories (-type d) directly inside the backup folder (-maxdepth 1) except the backup folder itself (-mindepth 1), print (-printf) the Unix time (%Ts) of the last modification followed by a tab character (\t, used in step 4) and the full file name followed by a null character (-print0).
sort -rnz
Sort the zero-terminated items (-z) from the previous step using a numerical comparison (-n) and reverse the order (-r). The result is a list of all folders sorted by their last modification time in descending order.
tail -n +11 -z
Print the last lines (tail) from the previous step starting from line 11 (-n +11) considering each line as zero-terminated (-z). This excludes the newest 10 folders (by modification time) from the remaining steps.
cut -f2- -z
Cut each line from the second field until the end (-f2-) treating each line as zero-terminaded (-z) to obtain a list containing the full path to each folder older than 10 days.
xargs -r -0 rm -rf
Take the zero-terminated (-0) items from the previous step (xargs), and, if there are any (-r avoids running the command passed to xargs if there are no nonblank characters), force delete (rm -rf) them.
Your directory names are sorted in chronological order, which makes this easy. The list of directories in chronological order is just *, or [0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] to be more precise. So you want to delete all but the last 10 of them.
set [0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]/
while [ $# -gt 10 ]; do
rm -rf "$1"
shift
fi
(While there are more than 10 directories left, delete the oldest one.)

Resources