How to clean up multiple file names using bash?

How to clean up multiple file names using bash? - string

I have. directory with ~250 .txt files in it. Each of these files has a title like this:
Abraham Lincoln [December 01, 1862].txt
George Washington [October 25, 1790].txt
etc...
However, these are terrible file names for reading into python and I want to iterate over all of them to change them to a more suitable format.
I've tried similar things for changing single variables that are shared across many files. But I can't wrap my head around how I should iterate over these files and change the formatting of their names while still keeping the same information.
The ideal output would be something like
1861_12_01_abraham_lincoln.txt
1790_10_25_george_washington.txt
etc...

Please try the straightforward (tedious) bash script:
#!/bin/bash
declare -A map=(["January"]="01" ["February"]="02" ["March"]="03" ["April"]="04" ["May"]="05" ["June"]="06" ["July"]="07" ["August"]="08" ["September"]="09" ["October"]="10" ["November"]="11" ["December"]="12")
pat='^([^[]+) \[([A-Za-z]+) ([0-9]+), ([0-9]+)]\.txt$'
for i in *.txt; do
if [[ $i =~ $pat ]]; then
newname="$(printf "%s_%s_%s_%s.txt" "${BASH_REMATCH[4]}" "${map["${BASH_REMATCH[2]}"]}" "${BASH_REMATCH[3]}" "$(tr 'A-Z ' 'a-z_' <<< "${BASH_REMATCH[1]}")")"
mv -- "$i" "$newname"
fi
done

for file in *.txt; do
# extract parts of the filename to be differently formatted with a regex match
[[ $file =~ (.*)\[(.*)\] ]] || { echo "invalid file $file"; exit; }
# format extracted strings and generate the new filename
formatted_date=$(date -d "${BASH_REMATCH[2]}" +"%Y_%m_%d")
name="${BASH_REMATCH[1]// /_}" # replace spaces in the name with underscores
f="${formatted_date}_${name,,}" # convert name to lower-case and append it to date string
new_filename="${f::-1}.txt" # remove trailing underscore and add `.txt` extension
# do what you need here
echo $new_filename
# mv $file $new_filename
done

I like to pull the filename apart, then put it back together.
Also GNU date can parse-out the time, which is simpler than using sed or a big case statement to convert "October" to "10".
#! /usr/bin/bash
if [ "$1" == "" ] || [ "$1" == "--help" ]; then
echo "Give a filename like \"Abraham Lincoln [December 01, 1862].txt\" as an argument"
exit 2
fi
filename="$1"
# remove the brackets
filename=`echo "$filename" | sed -e 's/[\[]//g;s/\]//g'`
# cut out the name
namepart=`echo "$filename" | awk '{ print $1" "$2 }'`
# cut out the date
datepart=`echo "$filename" | awk '{ print $3" "$4" "$5 }' | sed -e 's/\.txt//'`
# format up the date (relies on GNU date)
datepart=`date --date="$datepart" +"%Y_%m_%d"`
# put it back together with underscores, in lower case
final=`echo "$namepart $datepart.txt" | tr '[A-Z]' '[a-z]' | sed -e 's/ /_/g'`
echo mv \"$1\" \"$final\"
EDIT: converted to BASH, from Bourne shell.

Related

GREP a range of files with a numeric filename

I have files that are located in a temp folder that I need to move to another folder, the files are named in sequence as so:
1_492724_860619121.dbf.gz
1_492725_860619121.dbf.gz
1_492726_860619121.dbf.gz
...
1_493069_860619121.dbf.gz
I used to move these files monthly so I used grep on the month in question :
for i in `ls -ltr | grep Jul|awk '{print $9}'`; do mv $i JulFolder; done
Now I only want to move a range of files based on their name :
from 1_492724_860619121.dbf.gz to 1_493053_860619121.dbf.gz
What is the correct use the of combination of grep and awk to select the desired files ?
Note that awk '{print $9}' is used to select the right column containing the files' name from ls -ltr.

Did you try with a bash range?
mv 1_{492724..493053}_860619121.dbf.gz somefolder/

Can be done with plain POSIX-shell grammar:
#!/bin/sh
min=492724
max=493053
src_dir=./
dst_dir=~/somewhere
mkdir -p "$dst_dir"
# Iterates path in src_dir matching the pattern
for path in "$src_dir"/1_*_*.dbf.gz; do
# Trims out leading directory and 1_ prefix from path
file_part=${path##*/1_}
# Trims out trailing _* from file_part to keep only number
number=${file_part%%_*}
# Checks number is within desired range
if [ "$number" -ge "$min" ] && [ "$number" -le "$max" ]; then
# Moves the file
mv -- "$path" "$dst_dir/"
fi
done

You can try below. (change FROM and TO as you want)
for i in `ls -1|awk -F_ '{if($2 >= FROM && $2 <= TO) print $0}' FROM=492724 TO=493053`
do
mv $i toFolder
done

Replace filename to a string of the first line in multiple files in bash

I have multiple fasta files, where the first line always contains a > with multiple words, for example:
File_1.fasta:
>KY620313.1 Hepatitis C virus isolate sP171215 polyprotein gene, complete cds
File_2.fasta:
>KY620314.1 Hepatitis C virus isolate sP131957 polyprotein gene, complete cds
File_3.fasta:
>KY620315.1 Hepatitis C virus isolate sP127952 polyprotein gene, complete cds
I would like to take the word starting with sP* from each file and rename each file to this string (for example: File_1.fasta to sP171215.fasta).
So far I have this:
$ for match in "$(grep -ro '>')";do
fname=$("echo $match|awk '{print $6}'")
echo mv "$match" "$fname"
done
But it doesn't work, I always get the error:
grep: warning: recursive search of stdin
I hope you can help me!

you can use something like this:
grep '>' *.fasta | while read -r line ; do
new_name="$(echo $line | cut -d' ' -f 6)"
old_name="$(echo $line | cut -d':' -f 1)"
mv $old_name "$new_name.fasta"
done
It searches for *.fasta files and handles every "hitted" line
it splits each result of grep by spaces and gets the 6th element as new name
it splits each result of grep by : and gets the first element as old name
it
moves/renames from old filename to new filename

There are several things going on with this code.
For a start, .. I actually don't get this particular error, and this might be due to different versions.
It might resolve to the fact that grep interprets '>' the same as > due to bash expansion being done badly. I would suggest maybe going for "\>".
Secondly:
fname=$("echo $match|awk '{print $6}'")
The quotes inside serve unintended purpose. Your code should like like this, if anything:
fname="$(echo $match|awk '{print $6}')"
Lastly, to properly retrieve your data, this should be your final code:
for match in "$(grep -Hr "\>")"; do
fname="$(echo "$match" | cut -d: -f1)"
new_fname="$(echo "$match" | grep -o "sP[^ ]*")".fasta
echo mv "$fname" "$new_fname"
done
Explanations:
grep -H -> you want your grep to explicitly use "Include Filename", just in case other shell environments decide to alias grep to grep -h (no filenames)
you don't want to be doing grep -o on your file search, as you want to have both the filename and the "new filename" in one data entry.
Although, i don't see why you would search for '>' and not directory for 'sP' as such:
for match in "$(grep -Hro "sP[0-9]*")"
This is not the exact same behaviour, and has different edge cases, but it just might work for you.

Quite straightforward in (g)awk :
create a file "script.awk":
FNR == 1 {
for (i=1; i<=NF; i++) {
if (index($i, "sP")==1) {
print "mv", FILENAME, $i ".fasta"
nextfile
}
}
}
use it :
awk -f script.awk *.fasta > cmmd.txt
check the content of the output.
mv File_1.fasta sP171215.fasta
mv File_2.fasta sP131957.fasta
if ok, launch rename with . cmmd.txt

For all fasta files in directory, search their first line for the first word starting with sP and rename them using that word as the basename.
Using a bash array:
for f in *.fasta; do
arr=( $(head -1 "$f") )
for word in "${arr[#]}"; do
[[ "$word" =~ ^sP* ]] && echo mv "$f" "${word}.fasta" && break
done
done
or using grep:
for f in *.fasta; do
word=$(head -1 "$f" | grep -o "\bsP\w*")
[ -z "$word" ] || echo mv "$f" "${word}.fasta"
done
Note: remove echo after you are ok with testing.

MD5 comparison between two text files

I just started learning Linux shell scripting. I have to compare this two files in Linux shell scripting for version control example :
file1.txt
275caa62391ff4f3096b1e8a4975de40 apple
awd6s54g64h6se4h6se45wahae654j6 ball
e4rby1s6y4653a46h153a41bqwa54tvi cat
r53aghe4354hr35a4hr65a46eeh5j45ro castor
file2.txt
275caa62391ff4f3096b1e8a4975de40 apple
js65fg4a64zgr65f4w65ea465fa65gh7 ball
wroghah4a65ejdtse5z4g6sa7H658aw7 candle
wagjh54hr5ae454zrwrh354aha4564re castor
How to sort this text files in newly added(one which is added in file 2 but not in file 1) ,deleted(one which is deleted in file 2 but not in file 1) and changed files (have same name but different checksum) ?
I tried using diff , bcompare , vimdiff but I am not getting a proper output as a text file.
Thanks in advance

I don't know if such a command exist, but I've taken the liberty to write you a sorting mechanism in Bash. Although it's optimised, I suggest you recreate it in a language of your own choice.
#! /bin/bash
# Sets the array delimiter to a newline
IFS=$'\n'
# If $1 is empty, default to 'file1.txt'. Same for $2.
FILE1=${1:-file1.txt}
FILE2=${2:-file2.txt}
DELETED=()
ADDED=()
CHANGED=()
# Loop over array $1 and print content
function array_print {
# -n creates a "pointer" to an array. This
# way you can pass large arrays to functions.
local -n array=$1
echo "$1: "
for i in "${array}"; do
echo $i
done
}
# This function loops over the entries in file_in and checks
# if they exist in file_tst. Unless doubles are found, a
# callback is executed.
function array_sort {
local file_in="$1"
local file_tst="$2"
local callback=${3:-true}
local -n arr0=$4
local -n arr1=$5
while read -r line; do
tst_hash=$(grep -Eo '^[^ ]+' <<< "$line")
tst_name=$(grep -Eo '[^ ]+$' <<< "$line")
hit=$(grep $tst_name $file_tst)
# If found, skip. Nothing is changed.
[[ $hit != $line ]] || continue
# Run callback
$callback "$hit" "$line" arr0 arr1
done < "$file_in"
}
# If tst is empty, line will be added to not_found. For file 1 this
# means that file doesn't exist in file2, thus is deleted. Otherwise
# the file is changed.
function callback_file1 {
local tst=$1
local line=$2
local -n not_found=$3
local -n found=$4
if [[ -z $tst ]]; then
not_found+=($line)
else
found+=($line)
fi
}
# If tst is empty, line will be added to not_found. For file 2 this
# means that file doesn't exist in file1, thus is added. Since the
# callback for file 1 already filled all the changed files, we do
# nothing with the fourth parameter.
function callback_file2 {
local tst=$1
local line=$2
local -n not_found=$3
if [[ -z $tst ]]; then
not_found+=($line)
fi
}
array_sort "$FILE1" "$FILE2" callback_file1 DELETED CHANGED
array_sort "$FILE2" "$FILE1" callback_file2 ADDED CHANGED
array_print ADDED
array_print DELETED
array_print CHANGED
exit 0
Since it might be hard to understand the code above, I've written it out. I hope it helps :-)
while read -r line; do
tst_hash=$(grep -Eo '^[^ ]+' <<< "$line")
tst_name=$(grep -Eo '[^ ]+$' <<< "$line")
hit=$(grep $tst_name $FILE2)
# If found, skip. Nothing is changed.
[[ $hit != $line ]] || continue
# If name does not occur, it's deleted (exists in
# file1, but not in file2)
if [[ -z $hit ]]; then
DELETED+=($line)
else
# If name occurs, it's changed. Otherwise it would
# not come here due to previous if-statement.
CHANGED+=($line)
fi
done < "$FILE1"
while read -r line; do
tst_hash=$(grep -Eo '^[^ ]+' <<< "$line")
tst_name=$(grep -Eo '[^ ]+$' <<< "$line")
hit=$(grep $tst_name $FILE1)
# If found, skip. Nothing is changed.
[[ $hit != $line ]] || continue
# If name does not occur, it's added. (exists in
# file2, but not in file1)
if [[ -z $hit ]]; then
ADDED+=($line)
fi
done < "$FILE2"

Files which are only in file1.txt:
awk 'NR==FNR{a[$2];next} !($2 in a)' file2.txt file1.txt > only_in_file1.txt
Files which are only in file2.txt:
awk 'NR==FNR{a[$2];next} !($2 in a)' file1.txt file2.txt > only_in_file2.txt
Then something like this answer:
awk compare columns from two files, impute values of another column
e.g:
awk 'FNR==NR{a[$1]=$1;next}{print $0,a[$1]?a[$2]:"NA"}' file2.txt file1.txt | grep NA | awk '{print $1,$2}' > md5sdiffer.txt
You'll need to come up with how you want to present these though.
There might be a more elegant way to loop though the final example (as opposed to finding those with NA and then re-filtering), however it's still enough to go off

Retrieve string between characters and assign on new variable using awk in bash

I'm new to bash scripting, I'm learning how commands work, I stumble in this problem,
I have a file /home/fedora/file.txt
Inside of the file is like this:
[apple] This is a fruit.
[ball] This is a sport's equipment.
[cat] This is an animal.
What I wanted is to retrieve words between "[" and "]".
What I tried so far is :
while IFS='' read -r line || [[ -n "$line" ]];
do
echo $line | awk -F"[" '{print$2}' | awk -F"]" '{print$1}'
done < /home/fedora/file.txt
I can print the words between "[" and "]".
Then I wanted to put the echoed word into a variable but i don't know how to.
Any help I will appreciate.

Try this:
variable="$(echo $line | awk -F"[" '{print$2}' | awk -F"]" '{print$1}')"
or
variable="$(awk -F'[\[\]]' '{print $2}' <<< "$line")"
or complete
while IFS='[]' read -r foo fruit rest; do echo $fruit; done < file
or with an array:
while IFS='[]' read -ra var; do echo "${var[1]}"; done < file

In addition to using awk, you can use the native parameter expansion/substring extraction provided by bash. Below # indicates a trim from the left, while % is used to trim from the right. (note: a single # or % indicates removal up to the first occurrence, while ## or %% indicates removal of all occurrences):
#!/bin/bash
[ -r "$1" ] || { ## validate input is readable
printf "error: insufficient input. usage: %s filename\n" "${0##*/}"
exit 1
}
## read each line and separate label and value
while read -r line || [ -n "$line" ]; do
label=${line#[} # trim initial [ from left
label=${label%%]*} # trim through ] from right
value=${line##*] } # trim from left through '[ '
printf " %-8s -> '%s'\n" "$label" "$value"
done <"$1"
exit 0
Input
$ cat dat/labels.txt
[apple] This is a fruit.
[ball] This is a sport's equipment.
[cat] This is an animal.
Output
$ bash readlabel.sh dat/labels.txt
apple -> 'This is a fruit.'
ball -> 'This is a sport's equipment.'
cat -> 'This is an animal.'

Bash- scramble characters contained in a string

So I have this function with the following output:
AGsg4SKKs74s62#
I need to find a way to scramble the characters without deleting anything..aka all characters must be present after I scramble them.
I can only bash utilities including awk and sed.

echo 'AGsg4SKKs74s62#' | sed 's/./&\n/g' | shuf | tr -d "\n"
Output (e.g.):
S7s64#2gKAGsKs4

Here's a pure Bash function that does the job:
scramble() {
# $1: string to scramble
# return in variable scramble_ret
local a=$1 i
scramble_ret=
while((${#a})); do
((i=RANDOM%${#a}))
scramble_ret+=${a:i:1}
a=${a::i}${a:i+1}
done
}
See if it works:
$ scramble 'AGsg4SKKs74s62#'
$ echo "$scramble_ret"
G4s6s#2As74SgKK
Looks all right.

I know that you haven't mentioned Perl but it could be done like this:
perl -MList::Util=shuffle -F'' -lane 'print shuffle #F' <<<"AGsg4SKKs74s62#"
-a enables auto-split mode and -F'' sets the field separator to an empty string, so each character goes into a separate array element. The array is shuffled using the function provided by the core module List::Util.

Here is my solution, usage: shuffleString "any-string". Performance is not in my consideration when using bash.
function shuffleString() {
local line="$1"
for i in $(seq 1 ${#line}); do
local p=$(expr $RANDOM % ${#line})
if [[ $p -lt $i ]]; then
local line="${line:0:$p}${line:$i:1}${line:$p+1:$i-$p-1}${line:$p:1}${line:$i+1}"
elif [[ $p -gt $i ]]; then
local line="${line:0:$i}${line:$p:1}${line:$i+1:$p-$i-1}${line:$i:1}${line:$p+1}"
fi
done
echo "$line"
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to clean up multiple file names using bash? - string

Related

GREP a range of files with a numeric filename

Replace filename to a string of the first line in multiple files in bash

MD5 comparison between two text files

Retrieve string between characters and assign on new variable using awk in bash

Bash- scramble characters contained in a string

Categories

Resources