Linux copy file and rename to substring of filename - linux

My files got this structure:
mynewfile-runtime-tested-1102-19.4-alpha.zip
mysdk-sdk-tested-1102-19.4-alpha.zip
sources-tested-1102-19.4-alpha.zip
I looking for a way how to dynamically detect and drop the suffix of tested-1102-19.4-alpha and to copy the files with new names so it will look like:
mynewfile-runtime.zip
mysdk-sdk.zip
sources.zip
The suffix should be detected dynamically by delimiters ('-'), my other chunk of file have suffix like nottested-404-11.2.34-beta and the other one is final-01-1-release. The only thing remain constant is the delimiter of '-'

for file in *.zip; do
mv "$file" "${file%-*-*-*-*.zip}.zip"
done
This is fully portable POSIX shell, without forks to sed or other programs.
The ${file%pattern} bit says to remove the shortest matching string.
You can also remove the longest match with %%, or from the left with # and ##, respectively.

To only move the files that match the pattern you can do this:
#!/bin/sh
suffix='-*-*-*-*.zip'
for file in *$suffix
do
trimmed=${file%$suffix}
echo mv "$file" "$trimmed".zip
done
Remove the echo when you are confident with the result.

Related

How to rename fasta header based on filename in multiple files?

I have a directory with multiple fasta file named as followed:
BC-1_bin_1_genes.faa
BC-1_bin_2_genes.faa
BC-1_bin_3_genes.faa
BC-1_bin_4_genes.faa
etc. (about 200 individual files)
The fasta header look like this:
>BC-1_k127_3926653_6 # 4457 # 5341 # -1 # ID=2_6;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.697
I now want to add the filename to the header since I want to annotate the sequences for each file.I tried the following:
for file in *.faa;
do
sed -i "s/>.*/${file%%.*}/" "$file" ;
done
It worked partially but it removed the ">" from the header which is essential for the fasta file. I tried to modify the "${file%%.*}" part to keep the carrot but it always called me out on bad substitutions.
I also tried this:
awk '/>/{sub(">","&"FILENAME"_");sub(/\.faa/,x)}1' *.faa
This worked in theory but only printed everything on my terminal rather than changing it in the respective files.
Could someone assist with this?
It's not clear whether you want to replace the earlier header, or add to it. Both scenarios are easy to do. Don't replace text you don't want to replace.
for file in ./*.faa;
do
sed -i "s/^>.*/>${file%%.*}/" "$file"
done
will replace the header, but include a leading > in the replacement, effectively preserving it; and
for file in ./*.faa;
do
sed -i "s/^>.*/&${file%%.*}/" "$file"
done
will append the file name at the end of the header (& in the replacement string evaluates to the string we are replacing, again effectively preserving it).
For another variation, try
for file in *.faa;
do
sed -i "/^>/s/\$/ ${file%%.*}/" "$file"
done
which says on lines which match the regex ^>, replace the empty string at the end of the line $ with the file name.
Of course, your Awk script could easily be fixed, too. Standard Awk does not have an option to parallel the -i "in-place" option of sed, but you can easily use a temporary file:
for file in ./*.faa;
do
awk '/>/{ $0 = $0 " " FILENAME);sub(/\.faa/,"")}1' "$file" >"$file.tmp" &&
mv "$file.tmp" "$file"
done
GNU Awk also has an -i inplace extension which you could simply add to the options of your existing script if you have GNU Awk.
Since FASTA files typically contain multiple headers, adding to the header rather than replacing all headers in a file with the same string seems more useful, so I changed your Awk script to do that instead.
For what it's worth, the name of the character ^ is caret (carrot is 🥕). The character > is called greater than or right angle bracket, or right broket or sometimes just wedge.
You just need to detect the pattern to replace and use regex to implement it:
fasta_helper.sh
location=$1
for file in $location/*.faa
do
full_filename=${file##*/}
filename="${full_filename%.*}"
#scape special chars
filename=$(echo $filename | sed 's_/_\\/_g')
echo "adding file name: $filename to: $full_filename"
sed -i -E "s/^[^#]+/>$filename /" $location/$full_filename
done
usage:
Just pass the folder with fasta files:
bash fasta_helper.sh /foo/bar
test:
lectures
Regex: matching up to the first occurrence of a character
Extract filename and extension in Bash
https://unix.stackexchange.com/questions/78625/using-sed-to-find-and-replace-complex-string-preferrably-with-regex
Locating your files
Suggesting to first identify your files with find command or ls command.
find . -type f -name "*.faa" -printf "%f\n"
A find command to print only file with filenames extension .faa. Including sub directories to current directory.
ls -1 "*.faa"
An ls command to print files and directories with extension .faa. In current directory.
Processing your files
Once you have the correct files list, iterate over the list and apply sed command.
for fileName in $(find . -type f -name "*.faa" -printf "%f\n"); do
stripedFileName=${fileName/.*/} # strip extension .faa
sed -i "1s|\$| $stripedFileName|" "fileName" # append value of stripedFileName at end of line 1
done

rename all files after extension

Is it possible to write a script to rename all files after the extension?
Example in the folder, there are :
hello.txt-123ahr
bye.txt-56athe
test.txt-98hg12
I want the output:
hello.txt
bye.txt
test.txt
If you just want to remove everything from the dash forwards, you can use Parameter expansion:
#!/bin/bash
for file in *.txt-* ; do
mv "$file" "${file%-*}"
done
Where ${file%-*} means "remove from $file everytning from the last dash". If you want to start from the first dash, use %%.
Note that you might overwrite some files if their leading parts are equivalent, e.g. hello.txt-123abc and hello.txt-456xyz.

find returning inverted results

In a few words a wrote this little script to clean up some directories where I had consolidated directories/files from multiple sources where I used the cp command with the --backup=numbered feature so that files with identical names would have a suffix like .~1~ appended to avoid overwriting. I then ran fdupes to remove duplicate files, in some cases fdupes removed the file which did not have the suffix appended from the cp command (the original file) so I wanted to scan the directories looking for files with the suffix appended by the cp command and if the file does not exist with the suffix removed I would move mv the file otherwise I would leave it to avoid deleting anything as fdupes did not think it was a duplicate.
The issues is the test condition if [ -f ... ] part of the code below returns inverted results than what it should and I cannot understand why. For example, when the file exists it would return false and when the file did not exist it would return true. I fixed it by reversing the actions that I wanted to do based on the inverted return code and verified it was working as intended and it was so I ran it as such but would like to know if anyone knows why it would behave the way it did. I am not a bash script expert by any means so its possible that I missed something simple.
#!/bin/bash
logfile=$$.log
exec > $logfile 2>&1
IFS='
'
#set -f
for FILE in $(find . -type f -regextype posix-extended -regex '^.*(\.~[0-9]+~)+$')
do
FILE2=${FILE%%.~[0-9]*} # remove the suffix
if [ -f "${FILE2}" ]
then
echo ERROR: "${FILE2}" already exists!
else
echo "${FILE}" renamed "${FILE2}"
mv "${FILE}" "${FILE2}"
fi
done
You might be able to see the problem by modifying your script to show both FILE and FILE2 in the error message. There are a few minor problems with the script which could cause some confusion (but not the "inverted" logic):
find output is not sorted. If you had more than one backup file, a randomly chosen one would replace the original file;
you could sort the output using an expression like |sort -t~ -n -k2 on the end of the find-command.
the regular expression allows multiple matches of the ~[0-9]~ pattern. Conceivably you could have some odd file which ends with ~1~~2~.
the part where the suffix is removed assumes a single ~[0-9]~ is on the end of the filename. An embedded ~0, e.g., foo~0bar~1~ would reduce FILE to foo. The workaround for that would be more cumbersome (since the suffix-stripping uses globbing), but could be done with a case statement which matched an explicit number of digits (likely three digits would be enough).

Linux : Copy Multiple files and remove it's extension

I have got a directory containing files of type *.cpp.So i would like to copy each file in the directory and to paste it in the same directory by using
cp -a *.cpp
with an option to remove the .cpp while pasting.Is it possible ?
Here is a simple bash script. This script assumes that the file name only contains one "." character and splits based on that.
#!/bin/sh
for f in *.cpp; do
#This line splits the file name on the delimiter "."
baseName=`echo $f | cut -d "." -f 1`
newExtension=".new"
cp $f $baseName$newExtension
done
You can do this just by means of bash parameter extension, as mentioned in the bash manual:
${parameter%%word}
Remove matching suffix pattern. The word is expanded to produce
a pattern just as in pathname expansion. If the pattern matches
a trailing portion of the expanded value of parameter, then the
result of the expansion is the expanded value of parameter with
the shortest matching pattern (the ``%'' case) or the longest
matching pattern (the ``%%'' case) deleted.
...
for i in *.cpp
do
cp -a $i ${i%%.cpp}
done
Can use rename, optionally with -f to force rewrite existing files.
rename -f 's/\.ext$//' *.ext
To preview actions, but don't change files, use -n switch (No action).
This is not copy but move :-(

How to remove the extension of a file?

I have a folder that is full of .bak files and some other files also. I need to remove the extension of all .bak files in that folder. How do I make a command which will accept a folder name and then remove the extension of all .bak files in that folder ?
Thanks.
To remove a string from the end of a BASH variable, use the ${var%ending} syntax. It's one of a number of string manipulations available to you in BASH.
Use it like this:
# Run in the same directory as the files
for FILENAME in *.bak; do mv "$FILENAME" "${FILENAME%.bak}"; done
That works nicely as a one-liner, but you could also wrap it as a script to work in an arbitrary directory:
# If we're passed a parameter, cd into that directory. Otherwise, do nothing.
if [ -n "$1" ]; then
cd "$1"
fi
for FILENAME in *.bak; do mv "$FILENAME" "${FILENAME%.bak}"; done
Note that while quoting your variables is almost always a good practice, the for FILENAME in *.bak is still dangerous if any of your filenames might contain spaces. Read David W.'s answer for a more-robust solution, and this document for alternative solutions.
There are several ways to remove file suffixes:
In BASH and Kornshell, you can use the environment variable filtering. Search for ${parameter%word} in the BASH manpage for complete information. Basically, # is a left filter and % is a right filter. You can remember this because # is to the left of %.
If you use a double filter (i.e. ## or %%, you are trying to filter on the biggest match. If you have a single filter (i.e. # or %, you are trying to filter on the smallest match.
What matches is filtered out and you get the rest of the string:
file="this/is/my/file/name.txt"
echo ${file#*/} #Matches is "this/` and will print out "is/my/file/name.txt"
echo ${file##*/} #Matches "this/is/my/file/" and will print out "name.txt"
echo ${file%/*} #Matches "/name.txt" and will print out "/this/is/my/file"
echo ${file%%/*} #Matches "/is/my/file/name.txt" and will print out "this"
Notice this is a glob match and not a regular expression match!. If you want to remove a file suffix:
file_sans_ext=${file%.*}
The .* will match on the period and all characters after it. Since it is a single %, it will match on the smallest glob on the right side of the string. If the filter can't match anything, it the same as your original string.
You can verify a file suffix with something like this:
if [ "${file}" != "${file%.bak}" ]
then
echo "$file is a type '.bak' file"
else
echo "$file is not a type '.bak' file"
fi
Or you could do this:
file_suffix=$(file##*.}
echo "My file is a file '.$file_suffix'"
Note that this will remove the period of the file extension.
Next, we will loop:
find . -name "*.bak" -print0 | while read -d $'\0' file
do
echo "mv '$file' '${file%.bak}'"
done | tee find.out
The find command finds the files you specify. The -print0 separates out the names of the files with a NUL symbol -- which is one of the few characters not allowed in a file name. The -d $\0means that your input separators are NUL symbols. See how nicely thefind -print0andread -d $'\0'` together?
You should almost never use the for file in $(*.bak) method. This will fail if the files have any white space in the name.
Notice that this command doesn't actually move any files. Instead, it produces a find.out file with a list of all the file renames. You should always do something like this when you do commands that operate on massive amounts of files just to be sure everything is fine.
Once you've determined that all the commands in find.out are correct, you can run it like a shell script:
$ bash find.out
rename .bak '' *.bak
(rename is in the util-linux package)
Caveat: there is no error checking:
#!/bin/bash
cd "$1"
for i in *.bak ; do mv -f "$i" "${i%%.bak}" ; done
You can always use the find command to get all the subdirectories
for FILENAME in `find . -name "*.bak"`; do mv --force "$FILENAME" "${FILENAME%.bak}"; done

Resources