recursively "normalize" filenames

recursively "normalize" filenames - linux

i mean getting rid of special chars in filenames, etc.
i have made a script, that can recursively rename files [http://pastebin.com/raw.php?i=kXeHbDQw]:
e.g.: before:
THIS i.s my file (1).txt
after running the script:
This-i-s-my-file-1.txt
Ok. here it is:
But: when i wanted to test it "fully", with filenames like this:
¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÂÃÄÅÆÇÈÊËÌÎÏÐÑÒÔÕ×ØÙUÛUÝÞßàâãäåæçèêëìîïðñòôõ÷øùûýþÿ.txt
áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&'()*+,:;<=>?#[\]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£.txt
it fails [http://pastebin.com/raw.php?i=iu8Pwrnr]:
$ sh renamer.sh directorythathasthefiles
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†....and so on
$
so "mv" can't handle special chars.. :\
i worked on it for many hours..
does anyone has a working one? [that can handle chars [filenames] in that 2 lines too?]

mv handles special characters just fine. Your script doesn't.
In no particular order:
You are using find to find all directories, and ls each directory separately.
Why use for DEPTH in... if you can do exactly the same with one command?
find -maxdepth 100 -type d
Which makes the arbitrary depth limit unnecessary
find -type d
Don't ever parse the output of ls, especially if you can let find handle that, too
find -not -type d
Make sure it works in the worst possible case:
find -not -type d -print0 | while read -r -d '' FILENAME; do
This stops read from eating certain escapes and choking on filenames with new-line characters.
You are repeating the entire ls | replace cycle for every single character. Don't - it kills performance. Loop over each directory all files once, and just use multiple sed's, or multiple replacements in one sed command.
sed 's/á/a/g; s/í/i/g; ...'
(I was going to suggest sed 'y/áí/ai/', but unfortunately that doesn't seem to work with Unicode. Perhaps perl -CS -Mutf8 -pe 'y/áí/ai/' would.)
You're still thinking in ASCII: "other special chars - ASCII Codes 33.. ..255". Don't.
These days, most systems use Unicode in UTF-8 encoding, which has a much wider range of "special" characters - so big that listing them out one by one becomes pointless. (It is even multibyte - "e" is one byte, "ė" is three bytes.)
True ASCII has 128 characters. What you currently have in mind are the ISO 8859 character sets (sometimes called "ANSI") - in particular, ISO 8859-1. But they go all the way up to 8859-16, and only the "ASCII" part stays the same.
echo -n $(command) is rather useless.
There are much easier ways to find the directory and basename given a path. For example, you can do
directory=$(dirname "$path")
oldnname=$(basename "$path")
# filter $oldname
mv "$path" "$directory/$newname"
Do not use egrep to check for errors. Check the program's return code. (Like you already do with cd.)
And instead of filtering out other errors, do...
if [[ -e $directory/$newname ]]; then
echo "target already exists, skipping: $oldname -> $newname"
continue
else
mv "$path" "$directory/$newname"
fi
The ton of sed 's/------------/-/g' calls can be changed to a single regexp:
sed -r 's/-{2,}/-/g'
The [ ]s in tr [foo] [bar] are unnecessary. They just cause tr to replace [ to [, and ] to ].
Seriously?
echo "$FOLDERNAME" | sed "s/$/\//g"
How about this instead?
echo "$FOLDERNAME/"
And finally, use detox.

Try something like:
find . -print0 -type f | awk 'BEGIN {RS="\x00"} { printf "%s\x00", $0; gsub("[^[:alnum:]]", "-"); printf "%s\0", $0 }' | xargs -0 -L 2 mv
Use of xargs(1) will ensure that each filename passed exactly as one parameter. awk(1) is used to add new filename right after old one.
One more trick: sed -e 's/-+/-/g' will replace groups of more than one "-" with exactly one.

Assuming the rest of your script is right, your problem is that you are using read but you should use read -r. Notice how the backslash disappeared:
áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&'()*+,:;<=>?#[\]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£.txt
áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£

Ugh...
Some tips to clean up your script:
** Use sed to do translation on multiple characters at once, that'll clean things up and make it easier to manage:
dev:~$ echo 'áàaieeé!.txt' | sed -e 's/[áàã]/a/g; s/[éè]/e/g'
aaaieee!.txt
** rather than renaming the file for each change, run all your filters then do one move
$ NEWNAME='áàaieeé!.txt'
$ NEWNAME="$(echo "$NEWNAME" | sed -e 's/[áàã]/a/g; s/[éè]/e/g')"
$ NEWNAME="$(echo "$NEWNAME" | sed -e 's/aa*/a/g')"
$ echo $NEWNAME
aieee!.txt
** rather than doing a ls | read ... loop, use:
for OLDNAME in $DIR/*; do
blah
blah
blah
done
** separate out your path traversal and renaming logic into two scripts. One script finds the files which need to be renamed, one script handles the normalization of a single file. Once you learn the 'find' command, you'll realize you can toss the first script :)

Related

Rm and Egrep -v combo

I want to remove all the logs except the current log and the log before that.
These log files are created after 20 minutes.So the files names are like
abc_23_19_10_3341.log
abc_23_19_30_3342.log
abc_23_19_50_3241.log
abc_23_20_10_3421.log
where 23 is today's date(might include yesterday's date also)
19 is the hour(7 o clock),10,30,50,10 are the minutes.
In this case i want i want to keep abc_23_20_10_3421.log which is the current log(which is currently being writen) and abc_23_19_50_3241.log(the previous one)
and remove the rest.
I got it to work by creating a folder,putting the first files in that folder and removing the files and then deleting it.But that's too long...
I also tried this
files_nodelete=`ls -t | head -n 2 | tr '\n' '|'`
rm *.txt | egrep -v "$files_nodelete"
but it didnt work.But if i put ls instead of rm it works.
I am an amateur in linux.So please suggest a simple idea..or a logic..xargs rm i tried but it didnt work.
Also read about mtime,but seems abit complicated since I am new to linux
Working on a solaris system

Try the logadm tool in Solaris, it might be the simplest way to rotate logs. If you just want to get things done, it will do it.
http://docs.oracle.com/cd/E23823_01/html/816-5166/logadm-1m.html

If you want a solution similar (but working) to your try this:
ls abc*.log | sort | head -n-2 | xargs rm
ls abc*.log: list all files, matching the pattern abc*.log
sort: sorts this list lexicographical (by name) from oldes to to newest logfile
head -n-2: return all but the last two entry in the list (you can give -n a negativ count too)
xargs rm: compose the rm command with the entries from stdin
If there are two or less files in the directory, this command will return an error like
rm: missing operand
and will not delete any files.

It is usually not a good idea to use ls to point to files. Some files may cause havoc (files which have a [Newline] or a weird character in their name are the usual exemples ....).
Using shell globs : Here is an interresting way : we count the files newer than the one we are about to remove!
pattern='abc*.log'
for i in $pattern ; do
[ -f "$i" ] || break ;
#determine if this is the most recent file, in the current directory
# [I add -maxdepth 1 to limit the find to only that directory, no subdirs]
if [ $(find . -maxdepth 1 -name "$pattern" -type f -newer "$i" -print0 | tr -cd '\000' | tr '\000' '+' | wc -c) -gt 1 ];
then
#there are 2 files more recent than $i that match the pattern
#we can delete $i
echo rm "$i" # remove the echo only when you are 100% sure that you want to delete all those files !
else
echo "$i is one of the 2 most recent files matching '${pattern}', I keep it"
fi
done
I only use the globbing mechanism to feed filenames to "find", and just use the terminating "0" of the -printf0 to count the outputed filenames (thus I have no problems with any special characters in those filenames, I just need to know how many files were outputted)
tr -cd "\000" will keep only the \000, ie the terminating NUL character outputed by print0. Then I translate each \000 to a single + character, and I count them with the wc -c. If I see 0, "$i" was the most recent file. If I see 1, "$i" was the one just a bit older (so the find sees only the most recent one). And if I see more than 1, it means the 2 files (mathching the pattern) that we want to keep are newer than "$i", so we can delete "$i"
I'm sure someone will step in with a better one, but the idea could be reused, I guess...

Thanks guyz for all the answers.
I found my answer
files=`ls -t *.txt | head -n 2 | tr '\n' '|' | rev |cut -c 2- |rev`
rm `ls -t | egrep -v "$files"`
Thank you for the help

Bash loop through directory including hidden file

I am looking for a way to make a simple loop in bash over everything my directory contains, i.e. files, directories and links including hidden ones.
I will prefer if it could be specifically in bash but it has to be the most general. Of course, file names (and directory names) can have white space, break line, symbols. Everything but "/" and ASCII NULL (0×0), even at the first character. Also, the result should exclude the '.' and '..' directories.
Here is a generator of files on which the loop has to deal with :
#!/bin/bash
mkdir -p test
cd test
touch A 1 ! "hello world" \$\"sym.dat .hidden " start with space" $'\n start with a newline'
mkdir -p ". hidden with space" $'My Personal\nDirectory'
So my loop should look like (but has to deal with the tricky stuff above):
for i in * ;
echo ">$i<"
done
My closest try was the use of ls and bash array, but it is not working with, is:
IFS=$(echo -en "\n\b")
l=( $(ls -A .) )
for i in ${l[#]} ; do
echo ">$i<"
done
unset IFS
Or using bash arrays but the ".." directory is not exclude:
IFS=$(echo -en "\n\b")
l=( [[:print:]]* .[[:print:]]* )
for i in ${l[#]} ; do
echo ">$i<"
done
unset IFS

* doesn't match files beginning with ., so you just need to be explicit:
for i in * .[^.]*; do
echo ">$i<"
done
.[^.]* will match all files and directories starting with ., followed by a non-. character, followed by zero or more characters. In other words, it's like the simpler .*, but excludes . and ... If you need to match something like ..foo, then you might add ..?* to the list of patterns.

As chepner noted in the comments below, this solution assumes you're running GNU bash along with GNU find GNU sort...
GNU find can be prevented from recursing into subdirectories with the -maxdepth option. Then use -print0 to end every filename with a 0x00 byte instead of the newline you'd usually get from -print.
The sort -z sorts the filenames between the 0x00 bytes.
Then, you can use sed to get rid of the dot and dot-dot directory entries (although GNU find seems to exclude the .. already).
I also used sed to get read of the ./ in front of every filename. basename could do that too, but older systems didn't have basename, and you might not trust it to handle the funky characters right.
(These sed commands each required two cases: one for a pattern at the start of the string, and one for the pattern between 0x00 bytes. These were so ugly I split them out into separate functions.)
The read command doesn't have a -z or -0 option like some commands, but you can fake it with -d "" and blanking the IFS environment variable.
The additional -r option prevents a backslash-newline combo from being interpreted as a line continuation. (A file called backslash\\nnewline would otherwise be mangled to backslashnewline.) It might be worth seeing if other backslash-combos get interpreted as escape sequences.
remove_dot_and_dotdot_dirs()
{
sed \
-e 's/^[.]\{1,2\}\x00//' \
-e 's/\x00[.]\{1,2\}\x00/\x00/g'
}
remove_leading_dotslash()
{
sed \
-e 's/^[.]\///' \
-e 's/\x00[.]\//\x00/g'
}
IFS=""
find . -maxdepth 1 -print0 |
sort -z |
remove_dot_and_dotdot_dirs |
remove_leading_dotslash |
while read -r -d "" filename
do
echo "Doing something with file '${filename}'..."
done

It may not be the most favorable way but I tried bellow thing
while read line ; do echo $line; done <<< $(ls -a | grep -v -w ".")
check the below trail which I did

Try the find command, something like:
find .
That will list all the files in all recursive directories.
To output only files excluding the leading . or .. try:
find . -type f -printf %P\\n

awk system syntax throwing error

I'm trying to move all my files in my directory individually to a new directory phonex and rename them at the same time phonex.txt.
so e.g.
1.txt, 2.txt, jim.txt
become:
phone1.txt in directory phone1
phone2.txt in directory phone2
phone3.txt in directory phone3.
I'm a newbie to awk, but I have managed to create the directories, but I cannot get the rename right.
I have tried:
ls|xargs -n1|awk ' {i++;system("mkdir phone"i);system("mv "$0" phone”i ”.txt -t phone"i)}'
which errors with lots of:
mv: cannot stat `phone”i': No such file or directory
and:
ls|xargs -n1|awk ' {i++;system("mkdir phone"i);system("mv "$0" phone”i ”/phone"i”.txt”)}'
error:
awk: 1: unexpected character 0xe2
xargs: /bin/echo: terminated by signal 13
Can anyone help me finish it off? TIA!

Piping ls into xargs into awk is completely unnecessary in this scenario. What you are trying to do can be accomplished using a simple loop:
for f in *.txt; do
i=$((i+1))
dir="phone$i"
mkdir "$dir" && mv "$f" "$dir/$dir.txt"
done
Depending on your shell, the increment of $i can be done in a different way (like ((++i)) in bash) but this way is POSIX-compliant so should work on any modern shell.
By the way, the reason for your original error is that you are using curly quotes ”, which are not understood by the shell. You should always only use single ' and double " in the shell.

If you wanted it with incrementing numbers.
Else Tom Fenech's way is the way to go !
for i in *.txt;do d=phone$((++X));mkdir "$d"; mv "$i" "$d/$d.txt";done
Also you may want to set x to zero x=0 before doing this in case it is already set as something else

Removing 10 Characters of Filename in Linux

I just downloaded about 600 files from my server and need to remove the last 11 characters from the filename (not including the extension). I use Ubuntu and I am searching for a command to achieve this.
Some examples are as follows:
aarondyne_kh2_13thstruggle_or_1250556383.mus should be renamed to aarondyne_kh2_13thstruggle_or.mus
aarondyne_kh2_darknessofunknow_1250556659.mp3 should be renamed to aarondyne_kh2_darknessofunknow.mp3
It seems that some duplicates might exist after I do this, but if the command fails to complete and tells me what the duplicates would be, I can always remove those manually.

Try using the rename command. It allows you to rename files based on a regular expression:
The following line should work out for you:
rename 's/_\d+(\.[a-z0-9A-Z]+)$/$1/' *
The following changes will occur:
aarondyne_kh2_13thstruggle_or_1250556383.mus renamed as aarondyne_kh2_13thstruggle_or.mus
aarondyne_kh2_darknessofunknow_1250556659.mp3 renamed as aarondyne_kh2_darknessofunknow.mp3
You can check the actions rename will do via specifying the -n flag, like this:
rename -n 's/_\d+(\.[a-z0-9A-Z]+)$/$1/' *
For more information on how to use rename simply open the manpage via: man rename

Not the prettiest, but very simple:
echo "$filename" | sed -e 's!\(.*\)...........\(\.[^.]*\)!\1\2!'
You'll still need to write the rest of the script, but it's pretty simple.

find . -type f -exec sh -c 'mv {} `echo -n {} | sed -E -e "s/[^/]{10}(\\.[^\\.]+)?$/\\1/"`' ";"

one way to go:
you get a list of your files, one per line (by ls maybe) then:
ls....|awk '{o=$0;sub(/_[^_.]*\./,".",$0);print "mv "o" "$0}'
this will print the mv a b command
e.g.
kent$ echo "aarondyne_kh2_13thstruggle_or_1250556383.mus"|awk '{o=$0;sub(/_[^_.]*\./,".",$0);print "mv "o" "$0}'
mv aarondyne_kh2_13thstruggle_or_1250556383.mus aarondyne_kh2_13thstruggle_or.mus
to execute, just pipe it to |sh
I assume there is no space in your filename.

This script assumes each file has just one extension. It would, for instance, rename "foo.something.mus" to "foo.mus". To keep all extensions, remove one hash mark (#) from the first line of the loop body. It also assumes that the base of each filename has at least 12 character, so that removing 11 doesn't leave you with an empty name.
for f in *; do
ext=${f##*.}
new_f=${base%???????????.$ext}
if [ -f "$new_f" ]; then
echo "Will not rename $f, $new_f already exists" >&2
else
mv "$f" "$new_f"
fi
done

Unix: How to delete files listed in a file

I have a long text file with list of file masks I want to delete
Example:
/tmp/aaa.jpg
/var/www1/*
/var/www/qwerty.php
I need delete them. Tried rm `cat 1.txt` and it says the list is too long.
Found this command, but when I check folders from the list, some of them still have files
xargs rm <1.txt Manual rm call removes files from such folders, so no issue with permissions.

This is not very efficient, but will work if you need glob patterns (as in /var/www/*)
for f in $(cat 1.txt) ; do
rm "$f"
done
If you don't have any patterns and are sure your paths in the file do not contain whitespaces or other weird things, you can use xargs like so:
xargs rm < 1.txt

Assuming that the list of files is in the file 1.txt, then do:
xargs rm -r <1.txt
The -r option causes recursion into any directories named in 1.txt.
If any files are read-only, use the -f option to force the deletion:
xargs rm -rf <1.txt
Be cautious with input to any tool that does programmatic deletions. Make certain that the files named in the input file are really to be deleted. Be especially careful about seemingly simple typos. For example, if you enter a space between a file and its suffix, it will appear to be two separate file names:
file .txt
is actually two separate files: file and .txt.
This may not seem so dangerous, but if the typo is something like this:
myoldfiles *
Then instead of deleting all files that begin with myoldfiles, you'll end up deleting myoldfiles and all non-dot-files and directories in the current directory. Probably not what you wanted.

Use this:
while IFS= read -r file ; do rm -- "$file" ; done < delete.list
If you need glob expansion you can omit quoting $file:
IFS=""
while read -r file ; do rm -- $file ; done < delete.list
But be warned that file names can contain "problematic" content and I would use the unquoted version. Imagine this pattern in the file
*
*/*
*/*/*
This would delete quite a lot from the current directory! I would encourage you to prepare the delete list in a way that glob patterns aren't required anymore, and then use quoting like in my first example.

You could use '\n' for define the new line character as delimiter.
xargs -d '\n' rm < 1.txt
Be careful with the -rf because it can delete what you don't want to if the 1.txt contains paths with spaces. That's why the new line delimiter a bit safer.
On BSD systems, you could use -0 option to use new line characters as delimiter like this:
xargs -0 rm < 1.txt

xargs -I{} sh -c 'rm "{}"' < 1.txt should do what you want. Be careful with this command as one incorrect entry in that file could cause a lot of trouble.
This answer was edited after #tdavies pointed out that the original did not do shell expansion.

You can use this one-liner:
cat 1.txt | xargs echo rm | sh
Which does shell expansion but executes rm the minimum number of times.

Just to provide an another way, you can also simply use the following command
$ cat to_remove
/tmp/file1
/tmp/file2
/tmp/file3
$ rm $( cat to_remove )

In this particular case, due to the dangers cited in other answers, I would
Edit in e.g. Vim and :%s/\s/\\\0/g, escaping all space characters with a backslash.
Then :%s/^/rm -rf /, prepending the command. With -r you don't have to worry to have directories listed after the files contained therein, and with -f it won't complain due to missing files or duplicate entries.
Run all the commands: $ source 1.txt

cat 1.txt | xargs rm -f | bash Run the command will do the following for files only.
cat 1.txt | xargs rm -rf | bash Run the command will do the following recursive behaviour.

Here's another looping example. This one also contains an 'if-statement' as an example of checking to see if the entry is a 'file' (or a 'directory' for example):
for f in $(cat 1.txt); do if [ -f $f ]; then rm $f; fi; done

Here you can use set of folders from deletelist.txt while avoiding some patterns as well
foreach f (cat deletelist.txt)
rm -rf ls | egrep -v "needthisfile|*.cpp|*.h"
end

This will allow file names to have spaces (reproducible example).
# Select files of interest, here, only text files for ex.
find -type f -exec file {} \; > findresult.txt
grep ": ASCII text$" findresult.txt > textfiles.txt
# leave only the path to the file removing suffix and prefix
sed -i -e 's/:.*$//' textfiles.txt
sed -i -e 's/\.\///' textfiles.txt
#write a script that deletes the files in textfiles.txt
IFS_backup=$IFS
IFS=$(echo "\n\b")
for f in $(cat textfiles.txt);
do
rm "$f";
done
IFS=$IFS_backup
# save script as "some.sh" and run: sh some.sh

In case somebody prefers sed and removing without wildcard expansion:
sed -e "s/^\(.*\)$/rm -f -- \'\1\'/" deletelist.txt | /bin/sh
Reminder: use absolute pathnames in the file or make sure you are in the right directory.
And for completeness the same with awk:
awk '{printf "rm -f -- '\''%s'\''\n",$1}' deletelist.txt | /bin/sh
Wildcard expansion will work if the single quotes are remove, but this is dangerous in case the filename contains spaces. This would need to add quotes around the wildcards.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

recursively "normalize" filenames - linux

Related

Rm and Egrep -v combo

Bash loop through directory including hidden file

awk system syntax throwing error

Removing 10 Characters of Filename in Linux

Unix: How to delete files listed in a file

Categories

Resources