GREP a range of files with a numeric filename - linux

I have files that are located in a temp folder that I need to move to another folder, the files are named in sequence as so:
1_492724_860619121.dbf.gz
1_492725_860619121.dbf.gz
1_492726_860619121.dbf.gz
...
1_493069_860619121.dbf.gz
I used to move these files monthly so I used grep on the month in question :
for i in `ls -ltr | grep Jul|awk '{print $9}'`; do mv $i JulFolder; done
Now I only want to move a range of files based on their name :
from 1_492724_860619121.dbf.gz to 1_493053_860619121.dbf.gz
What is the correct use the of combination of grep and awk to select the desired files ?
Note that awk '{print $9}' is used to select the right column containing the files' name from ls -ltr.

Did you try with a bash range?
mv 1_{492724..493053}_860619121.dbf.gz somefolder/

Can be done with plain POSIX-shell grammar:
#!/bin/sh
min=492724
max=493053
src_dir=./
dst_dir=~/somewhere
mkdir -p "$dst_dir"
# Iterates path in src_dir matching the pattern
for path in "$src_dir"/1_*_*.dbf.gz; do
# Trims out leading directory and 1_ prefix from path
file_part=${path##*/1_}
# Trims out trailing _* from file_part to keep only number
number=${file_part%%_*}
# Checks number is within desired range
if [ "$number" -ge "$min" ] && [ "$number" -le "$max" ]; then
# Moves the file
mv -- "$path" "$dst_dir/"
fi
done

You can try below. (change FROM and TO as you want)
for i in `ls -1|awk -F_ '{if($2 >= FROM && $2 <= TO) print $0}' FROM=492724 TO=493053`
do
mv $i toFolder
done

Related

Replace filename to a string of the first line in multiple files in bash

I have multiple fasta files, where the first line always contains a > with multiple words, for example:
File_1.fasta:
>KY620313.1 Hepatitis C virus isolate sP171215 polyprotein gene, complete cds
File_2.fasta:
>KY620314.1 Hepatitis C virus isolate sP131957 polyprotein gene, complete cds
File_3.fasta:
>KY620315.1 Hepatitis C virus isolate sP127952 polyprotein gene, complete cds
I would like to take the word starting with sP* from each file and rename each file to this string (for example: File_1.fasta to sP171215.fasta).
So far I have this:
$ for match in "$(grep -ro '>')";do
fname=$("echo $match|awk '{print $6}'")
echo mv "$match" "$fname"
done
But it doesn't work, I always get the error:
grep: warning: recursive search of stdin
I hope you can help me!
you can use something like this:
grep '>' *.fasta | while read -r line ; do
new_name="$(echo $line | cut -d' ' -f 6)"
old_name="$(echo $line | cut -d':' -f 1)"
mv $old_name "$new_name.fasta"
done
It searches for *.fasta files and handles every "hitted" line
it splits each result of grep by spaces and gets the 6th element as new name
it splits each result of grep by : and gets the first element as old name
it
moves/renames from old filename to new filename
There are several things going on with this code.
For a start, .. I actually don't get this particular error, and this might be due to different versions.
It might resolve to the fact that grep interprets '>' the same as > due to bash expansion being done badly. I would suggest maybe going for "\>".
Secondly:
fname=$("echo $match|awk '{print $6}'")
The quotes inside serve unintended purpose. Your code should like like this, if anything:
fname="$(echo $match|awk '{print $6}')"
Lastly, to properly retrieve your data, this should be your final code:
for match in "$(grep -Hr "\>")"; do
fname="$(echo "$match" | cut -d: -f1)"
new_fname="$(echo "$match" | grep -o "sP[^ ]*")".fasta
echo mv "$fname" "$new_fname"
done
Explanations:
grep -H -> you want your grep to explicitly use "Include Filename", just in case other shell environments decide to alias grep to grep -h (no filenames)
you don't want to be doing grep -o on your file search, as you want to have both the filename and the "new filename" in one data entry.
Although, i don't see why you would search for '>' and not directory for 'sP' as such:
for match in "$(grep -Hro "sP[0-9]*")"
This is not the exact same behaviour, and has different edge cases, but it just might work for you.
Quite straightforward in (g)awk :
create a file "script.awk":
FNR == 1 {
for (i=1; i<=NF; i++) {
if (index($i, "sP")==1) {
print "mv", FILENAME, $i ".fasta"
nextfile
}
}
}
use it :
awk -f script.awk *.fasta > cmmd.txt
check the content of the output.
mv File_1.fasta sP171215.fasta
mv File_2.fasta sP131957.fasta
if ok, launch rename with . cmmd.txt
For all fasta files in directory, search their first line for the first word starting with sP and rename them using that word as the basename.
Using a bash array:
for f in *.fasta; do
arr=( $(head -1 "$f") )
for word in "${arr[#]}"; do
[[ "$word" =~ ^sP* ]] && echo mv "$f" "${word}.fasta" && break
done
done
or using grep:
for f in *.fasta; do
word=$(head -1 "$f" | grep -o "\bsP\w*")
[ -z "$word" ] || echo mv "$f" "${word}.fasta"
done
Note: remove echo after you are ok with testing.

bash - loop through subdirectories, cat files and rename with directory name

I have a folder structure like this ...
data/
---B1/
name_x_1.gz
name_y_1.gz
name_z_2.gz
name_p_2.gz
---C1
name_s_1.gz
name_t_1.gz
name_u_2.gz
name_v_2.gz
I need to go in to each subdirectory (e.g. B1) and perform the following:
cat *_1.gz > B1_1.gz
cat *_2.gz > B1_2.gz
I'm having problems with the file naming part. I can get in directories using the following:
for d in */; do
cat *_1.gz > $d_1.gz
cat *_2.gz > $d_2.gz
done
However I get an error that $d is a directory -- how do I strip the name to create the concatenated filename?
Thanks
Taking your question verbatim: If you have a variable d, where you know that it ends in / (as is the case in your example), you can get the value with this last character stripped by writing ${d:0:-1} (i.e. the substring starting at the beginning, up to (excluding) the last character.
Of course in your case, I would rather write the loop as
for d in *; do
which already creates the names without a trailing slash. But this is still probably not what you want, because d would assume the name of the entries in the directory you have cd'ed to, but you want the name of the directory itself. You can optain this for instance by $(basename "$PWD"), which turns your loop into (i.e.)
cd B1
prefix=$(basename "$PWD") # This set prefix to B1
for f in *
do
# Since your original code indicates that you want to create a *copy* of the file
# with a new name, I do the same here.
cp -v "$f" "${prefix}_$f" #
done
You can also use cat, as in your original solution, if you prefer.
If you're calling bash, you can use parameter expansion and do everything natively in the shell without creating a sub-shell to another process. This is POSIX compliant
#!/bin/bash
for dir in data/*; do
cat "$dir/"*_1.gz > "$dir/${dir##*/}_1.gz"
cat "$dir/"*_2.gz > "$dir/${dir##*/}_2.gz"
done
Sure, just descend into the directory.
# assuming PWD = data/
for d in */; do
(
cd "$d"
cat *_1.gz > "$(basename "$d")"_1.gz
cat *_2.gz > "$(basename "$d")"_2.gz
)
done
how do I strip the name to create the concatenated filename?
The simplest and most portable is with basename.
This requires Ed, which should hopefully be present on your machine. If not, I trust your distribution will have a package for it.
#!/bin/sh
cat >> edprint+.txt << EOF
1p
q
EOF
cat >> edpop+.txt << EOF
1d
wq
EOF
b1="${PWD}/data/B1"
c1="${PWD}/$data/C1"
find "${b1}" -maxdepth 1 -type f > b1stack
find "${c1}" -maxdepth 1 -type f > c1stack
while [ $(wc -l b1stack | cut -d' ' -f1) -gt 0 ]
do
b1line=$(ed -s b1stack < edprint+.txt)
b1name=$(basename "${b1line}")
b1suffix=$(echo "${b1name}" | cut -d'_' -f3)
b1fixed=$(echo "B1_${b1suffix}"
mv -v "${b1}/${b1line}" "${b1}/${b1fixed}"
ed -s b1stack < edpop+.txt
done
while [ $(wc -l c1stack | cut -d' ' -f1) -gt 0 ]
do
c1line=$(ed -s c1stack < edprint+.txt)
c1name=$(basename "${c1line}")
c1suffix=$(echo "${c1name}" | cut -d'_' -f3)
c1fixed=$(echo "B1_${c1suffix}"
mv -v "${c1}/${c1line}" "${c1}/${c1fixed}"
ed -s c1stack < edpop+.txt
done
rm -v ./edprint+.txt
rm -v ./edpop+.txt
rm -v ./b1stack
rm -v ./c1stack

How to clean up multiple file names using bash?

I have. directory with ~250 .txt files in it. Each of these files has a title like this:
Abraham Lincoln [December 01, 1862].txt
George Washington [October 25, 1790].txt
etc...
However, these are terrible file names for reading into python and I want to iterate over all of them to change them to a more suitable format.
I've tried similar things for changing single variables that are shared across many files. But I can't wrap my head around how I should iterate over these files and change the formatting of their names while still keeping the same information.
The ideal output would be something like
1861_12_01_abraham_lincoln.txt
1790_10_25_george_washington.txt
etc...
Please try the straightforward (tedious) bash script:
#!/bin/bash
declare -A map=(["January"]="01" ["February"]="02" ["March"]="03" ["April"]="04" ["May"]="05" ["June"]="06" ["July"]="07" ["August"]="08" ["September"]="09" ["October"]="10" ["November"]="11" ["December"]="12")
pat='^([^[]+) \[([A-Za-z]+) ([0-9]+), ([0-9]+)]\.txt$'
for i in *.txt; do
if [[ $i =~ $pat ]]; then
newname="$(printf "%s_%s_%s_%s.txt" "${BASH_REMATCH[4]}" "${map["${BASH_REMATCH[2]}"]}" "${BASH_REMATCH[3]}" "$(tr 'A-Z ' 'a-z_' <<< "${BASH_REMATCH[1]}")")"
mv -- "$i" "$newname"
fi
done
for file in *.txt; do
# extract parts of the filename to be differently formatted with a regex match
[[ $file =~ (.*)\[(.*)\] ]] || { echo "invalid file $file"; exit; }
# format extracted strings and generate the new filename
formatted_date=$(date -d "${BASH_REMATCH[2]}" +"%Y_%m_%d")
name="${BASH_REMATCH[1]// /_}" # replace spaces in the name with underscores
f="${formatted_date}_${name,,}" # convert name to lower-case and append it to date string
new_filename="${f::-1}.txt" # remove trailing underscore and add `.txt` extension
# do what you need here
echo $new_filename
# mv $file $new_filename
done
I like to pull the filename apart, then put it back together.
Also GNU date can parse-out the time, which is simpler than using sed or a big case statement to convert "October" to "10".
#! /usr/bin/bash
if [ "$1" == "" ] || [ "$1" == "--help" ]; then
echo "Give a filename like \"Abraham Lincoln [December 01, 1862].txt\" as an argument"
exit 2
fi
filename="$1"
# remove the brackets
filename=`echo "$filename" | sed -e 's/[\[]//g;s/\]//g'`
# cut out the name
namepart=`echo "$filename" | awk '{ print $1" "$2 }'`
# cut out the date
datepart=`echo "$filename" | awk '{ print $3" "$4" "$5 }' | sed -e 's/\.txt//'`
# format up the date (relies on GNU date)
datepart=`date --date="$datepart" +"%Y_%m_%d"`
# put it back together with underscores, in lower case
final=`echo "$namepart $datepart.txt" | tr '[A-Z]' '[a-z]' | sed -e 's/ /_/g'`
echo mv \"$1\" \"$final\"
EDIT: converted to BASH, from Bourne shell.

Find file with largest number of lines in single directory

I'm trying to create a function that only outputs the file with the largest number of lines in a directory (and not any sub-directories). I'm being asked to make use of the wc function but don't really understand how to read each file individually and then sort them just to find the largest. Here is what I have so far:
#!/bin/bash
function sort {
[ $# -ne 1 ] && echo "Invalid number of arguments">&2 && exit 1;
[ ! -d "$1" ] && echo "Invalid input: not a directory">&2 && exit 1;
# Insert function here ;
}
# prompt if wanting current directory
# if yes
# sort $PWD
# if no
#sort $directory
This solution is almost pure Bash (wc is the only external command used):
shopt -s dotglob # Include filenames with initial '.' in globs
shopt -s nullglob # Make globs produce nothing when nothing matches
dir=$1
maxlines=-1
maxfile=
for file in "$dir"/* ; do
[[ -f $file ]] || continue # Skip non-files
[[ -L $file ]] && continue # Skip symlinks
numlines=$(wc -l < "$file")
if (( numlines > maxlines )) ; then
maxfile=$file
maxlines=$numlines
fi
done
[[ -n "$maxfile" ]] && printf '%s\n' "$maxfile"
Remove the shopt -s dotglob if you don't want to process files whose names begin with a dot. Remove the [[ -L $file ]] && continue if you want to process symlinks to files.
This solution should handle all filenames (ones containing spaces, ones containing glob characters, ones beginning with '-', ones containing newlines, ...), but it runs wc for each file so it may be unacceptably slow compared to solutions that feed many files to wc at once if you need to handle directories that have large numbers of files.
How about this:
wc -l * | sort -nr | head -2 | tail -1
wc -l counts lines (you get an error for directories, though), then sort in reverse order treating the first column as a number, then take the first two lines, then the second, as we need to skip over the total line.
wc -l * 2>/dev/null | sort -nr | head -2 | tail -1
The 2>/dev/null throws away all the errors, if you want a neater output.
Use a function like this:
my_custom_sort() {
for i in "${1+$1/}"*; do
[[ -f "$i" ]] && wc -l "$i"
done | sort -n | tail -n1 | cut -d" " -f2
}
And use it with or without directory (in latter case, it uses the current directory):
my_custom_sort /tmp
helloworld.txt

How can I count the different file types within a folder using linux terminal?

Hey I'm star struck on how to count the different amounts of file types / extensions recursively in a folder. I also need to print them to a .txt file.
For example I have 10 txt's 20 .docx files mixed up in multiple folders.
Help me !
find ./ -type f |awk -F . '{print $NF}' | sort | awk '{count[$1]++}END{for(j in count) print j,"("count[j]" occurences)"}'
Gets all filenames with find, then uses awk to get the extension, then uses awk again to count the occurences
Just with bash: version 4 required for this code
#!/bin/bash
shopt -s globstar nullglob
declare -A exts
for f in * **/*; do
[[ -f $f ]] || continue # only count files
filename=${f##*/} # remove directories from pathname
ext=${filename##*.}
[[ $filename == $ext ]] && ext="no_extension"
: ${exts[$ext]=0} # initialize array element if unset
(( exts[$ext]++ ))
done
for ext in "${!exts[#]}"; do
echo "$ext ${exts[$ext]}"
done | sort -k2nr | column -t
this one seems unsolved so far, so here is how far I got counting files and ordering them:
find . -type f | sed -n 's/..*\.//p' | sort -f | uniq -ic

Resources