In my script I am taking a text file and splitting into sections. Before doing any splitting, I am reformatting the name of the text file. PROBLEM: Creating a folder/directory and naming it the formatted file name. This is where segments are placed. However the script breaks when the text file has spaces in it. But that is the reason I am trying to reformat the name first and then do the rest of the operations. How could I do so in that sequence?
execute script: text_split.sh -s "my File .txt" -c 2
text_split.sh
# remove whitespace and format file name
FILE_PATH="/archive/"
find $FILE_PATH -type f -exec bash -c 'mv "$1" "$(echo "$1" \
| sed -re '\''s/^([^-]*)-\s*([^\.]*)/\L\1\E-\2/'\'' -e '\''s/ /_/g'\'' -e '\''s/_-/-/g'\'')"' - {} \;
sleep 1
# arg1: path to input file / source
# create directory
function fallback_out_file_format() {
__FILE_NAME=`rev <<< "$1" | cut -d"." -f2- | rev`
__FILE_EXT=`rev <<< "$1" | cut -d"." -f1 | rev`
mkdir -p $FILE_PATH${__FILE_NAME};
__OUT_FILE_FORMAT="$FILE_PATH${__FILE_NAME}"/"${__FILE_NAME}-part-%03d.${__FILE_EXT}"
echo $__OUT_FILE_FORMAT
exit 1
}
# Set variables and default values
OUT_FILE_FORMAT=''
# Grab input arguments
while getopts “s:c” OPTION
do
case $OPTION in
s) SOURCE=$(echo "$OPTARG" | sed 's/ /\\ /g' ) ;;
c) CHUNK_LEN="$OPTARG" ;;
?) usage
exit 1
;;
esac
done
if [ -z "$OUT_FILE_FORMAT" ] ; then
OUT_FILE_FORMAT=$(fallback_out_file_format $SOURCE)
fi
Your script takes a filename argument, specified with -s, then modifies a hard-coded directory by renaming the files it contains, then uses the initial filename to generate an output directory and filename. It definitely sounds like the workflow should be adjusted. For instance, instead of trying to correct all the bad filenames in /archive/, just fix the name of the file specified with -s.
To get filename and extension, use bash's string manipulation ability, as shown in this question:
filename="${fullfile##*/}"
extension="${filename##*.}"
name="${filename%.*}"
You can trim whitespace from the input string using tr -d ' '.
You can then join this to your FILE_PATH variable with something like this:
FILE_NAME=$(echo $1 | tr -d ' ')
FILE_PATH="/archive/"
FILE_PATH=$FILE_PATH$FILE_NAME
You can escape the space using a back slash \
Now the user may not always provide with the back slash, so the script can use sed to convert all (space) to \
sed 's/ /\ /g'
you can obtain the new directory name as
dir_name=`echo $1 | sed 's/ /\ /g'
Related
I have an input file input.txt that contains the following values:
# time(t) Temperature Pressure Velocity(u, v, w)
t T P u v w
0 T0 P0 (u0 v0 w0)
0.0015 T1 P1 (u1 v1 w1)
0.0021 T2 P2 (u2 v2 w2)
0.0028 T3 P3 (u3 v3 w3)
0.0031 T4 P4 (u4 v4 w4)
0.0041 T5 P5 (u5 v5 w5)
... ... ... ... ...
... ... ... ... ...
1.5001 TN PN (uN vN wN)
where Ti, Pi, ui, vi, and wi for i = 0 to N are floating-point numbers.
I have on the other hand, some directories that correspond to the times:
0 # this is a directory
0.0015 # this is a directory also
0.0021 # ...etc.
0.0028
0.0031
...
...
I have a template myTemplate.txt file that looks like the following:
# This is my template file
The time of the simulation is: {%TIME%}
The Temperature is {%T%}
The pressure is {%P%}
The velocity vector is: ({%U%} {%V%} {%W%})
My goal is to create a file output.txt under each time directory using the template file myTemplate.txt and populate the values from the input file input.txt.
I have tried the following:
# assume the name of the directory perfectly matches the time in input file
inputfile="input.txt"
times = $(find . -maxdepth 1 -type d)
for eachTime in $times
do
line=$(sed -n "/^$eachTime/p" $inputfile)
T=$(echo "$line" cut -f2 ) # get temperature
P=$(echo "$line" | cut -f3 ) # get pressure
U=$(echo "$line" | cut -f4 | tr -d '(') # remove '('
V=$(echo "$line" | cut -f5 )
W=$(echo "$line" | cut -f6 | tr -d ')' ) # remove ')'
# I am stuck here, How can I generate a file output.txt from
# the template and save it under the directory.
done
I am stuck in the step where I need to populate the values in the template file and generate a file output.txt under each directory.
Any help on how to achieve that or may by suggesting an efficient way to accomplish this task using linux standard utilities such as sed, awk is very much appreciated.
I have adapted your bash script which contains multiple typos/errors.
This is not the most efficient way to accomplish this but I have tested it on your data and it works:
Create a script file generate.sh:
#!/bin/bash
timedir=$(find * -maxdepth 1 -type d) # use * to get rid of ./ at the beginning
templateFile='./myTemplate.txt' # the path to your template file
for eachTime in $timedir
do
# use bash substitution to replace . with \. in times
# in order to avoid unexpected matches
line="$(grep -m 1 -e '^'${eachTime//./\.} input.txt)"
if [ -z "$line" ]
then
echo "***Error***: Data at time: $eachTime were not found!" >&2
exit 1
fi
# the line below is redundant since time is already known
# replace tabs/and spaces with a single space
line=$(echo "$line" | tr -s '[:blank:]' ' ' )
Time=$(echo "$line" | cut -d' ' -f1 )
Temperature=$(echo "$line" | cut -d' ' -f2 )
Pressure=$(echo "$line" | cut -d' ' -f3 )
U=$(echo "$line" | tr -d '()' | cut -d' ' -f4 )
V=$(echo "$line" | tr -d '()' | cut -d' ' -f5 )
W=$(echo "$line" | tr -d '()' | cut -d' ' -f6 )
# Create a temporary file
buff_file="$(mktemp)"
# Copy the template to that file
cp "$templateFile" "$buff_file"
# Use sed to replace the values
sed -i "s/{%TIME%\}/$eachTime/g" "$buff_file"
sed -i "s/{%T%}/$Temperature/g" "$buff_file"
sed -i "s/{%P%}/$Pressure/g" "$buff_file"
sed -i "s/{%U%}/$U/g" "$buff_file"
sed -i "s/{%V%}/$V/g" "$buff_file"
sed -i "s/{%W%}/$W/g" "$buff_file"
# Copy that temporary file under the time directory
cp "$buff_file" "$eachTime"/output.txt
# delete the temporary file
rm "$buff_file"
done
echo "Done!"
Run the script:
chmod +x generate.sh
./generate.sh
I have checked that a file output.txt is created under each time directory and contains the correct values from input.txt. The script should also raise an error if a time is not found.
this is a working prototype, note that there is no error handling for missing directories or wrong input formatting etc.
$ awk 'NR==FNR {temp=temp sep $0; sep=ORS;next}
FNR==2 {for(i=1;i<=NF;i++) h[$i]=i}
FNR>3 {text=temp;
sub("{%TIME%}", $h["t"] ,text);
# add other sub(..., text) substitutions!
print text > ($1 "/output.txt")}' template.txt input.txt
this only replaces the time but you can repeat the same pattern for the other variables.
Reads the template file and saves in variable temp. Reads the input file and captures the header names for easy reference to array h. For each data line, do the replacements and save to the corresponding directory (assumes it exists).
This should be trivial to read:
sub("{%TIME%}", $h["t"], text) substitute {%TIME%} with the value of $h["t"] in variable text.
$h["t"] means the value at index h["t"], which we put the index of t in the header line, which is 1. So instead of writing $1 we can write $h["t"] so the variable we're referring to is documented in place.
The other variable you'll refer to again with the names "T", "P", etc.
After some string conversion of heterogeneous data, there are files with the following content:
file1.txt:
mat 445
file2.txt:
mat 734.2
and so on. But there are also intruders that do not match that pattern, e. g.
filen.txt:
mat 1
FBW
With everything that starts with "mat" I would like to proceed while all other lines shall be deleted.
The following does not work (and seems rather ponderous):
for f in *.txt ; do
if [[ ${f:0:3} == "mat" ]]; then
# do some string conversion with that line, which is not important here
sed -i -e 's/^.*\(mat.*\).*$/\1/' $f
sed -i -e 's/ //g' $f
tr '.' '_' < $f
sed -i -e 's/^/\<http:\/\/uricorn.fly\/tib\_lok\_sys\#/' "$f"
sed -i -e 's/\(.*\)[0-9]/&> /' "$f"
else
# delete the line that does not match the pattern
sed -i -e '^[mat]/d' $f
fi
done
As the comment below shows the if condition is incorrect as it does not match the file's content but its name.
Desired output should then be:
file1.txt
<http://uricorn.fly/tib_lok_sys#mat445>
file2.txt
<http://uricorn.fly/tib_lok_sys#mat734_2>
filen.txt
<http://uricorn.fly/tib_lok_sys#mat1>
How can this be achieved?
Source data, with some extras added to the last 2 files:
$ for s in 1 2 n
do
fn="file${s}.txt"
echo "+++++++++++ ${fn}"
cat "${fn}"
done
+++++++++++ file1.txt
mat 445
+++++++++++ file2.txt
mat 734.2.3
+++++++++++ filen.txt
mat 1 2 3
FBW
One awk solution that implements the most recent set of question edits:
awk -i inplace ' # overwrite the source file
/^mat/ { gsub(/ /,"") # if line starts with "^mat" then remove spaces ...
gsub(/\./,"_") # and replace periods with underscores
printf "<http://uricorn.fly/tib_lok_sys#%s>\n", $0 # print the desired output
}
' file{1,2,n}.txt
NOTES:
the -i inplace option requires GNU awk 4.1.0 (or better)
remove comments to declutter code
The above generates the following:
$ for s in 1 2 n
do
fn="file${s}.txt"
echo "+++++++++++ ${fn}"
cat "${fn}"
done
+++++++++++ file1.txt
<http://uricorn.fly/tib_lok_sys#mat445>
+++++++++++ file2.txt
<http://uricorn.fly/tib_lok_sys#mat734_2_3>
+++++++++++ filen.txt
<http://uricorn.fly/tib_lok_sys#mat123>
Sed:
sed -ri '/^mat/{s/[ ]//g;s/[.]/_/g;s#^(.*)$#<http://uricorn.fly/tib_lok_sys#\1>#g}' *.txt
Search for lines starting with mat and then first remove spaces, replace . with _ and finally substitute this string with a string including the http string prepended.
The other answers are far more elegant, but none worked on my system so here is what did eventually:
for f in *.txt ; do
# Remove every line that does not contain 'mat'
sed -i '/mat/!d' $f
# Remove every character until 'mat' begins
sed -i -e 's/^.*\(mat.*\).*$/\1/' $f
# Remove the blank between 'mat' and number
sed -i -e 's/ //g' $f
# Replace the dot in subcategories with an underscore
tr '.' '_' < $f
# Add URI
sed -i -e 's/^/\<http:\/\/uricorn.fly\/tib\_lok\_sys\#/' "$f"
sed -i -e 's/\(.*\)[0-9]/&> /' "$f"
uniq $f
done
I have multiple fasta files, where the first line always contains a > with multiple words, for example:
File_1.fasta:
>KY620313.1 Hepatitis C virus isolate sP171215 polyprotein gene, complete cds
File_2.fasta:
>KY620314.1 Hepatitis C virus isolate sP131957 polyprotein gene, complete cds
File_3.fasta:
>KY620315.1 Hepatitis C virus isolate sP127952 polyprotein gene, complete cds
I would like to take the word starting with sP* from each file and rename each file to this string (for example: File_1.fasta to sP171215.fasta).
So far I have this:
$ for match in "$(grep -ro '>')";do
fname=$("echo $match|awk '{print $6}'")
echo mv "$match" "$fname"
done
But it doesn't work, I always get the error:
grep: warning: recursive search of stdin
I hope you can help me!
you can use something like this:
grep '>' *.fasta | while read -r line ; do
new_name="$(echo $line | cut -d' ' -f 6)"
old_name="$(echo $line | cut -d':' -f 1)"
mv $old_name "$new_name.fasta"
done
It searches for *.fasta files and handles every "hitted" line
it splits each result of grep by spaces and gets the 6th element as new name
it splits each result of grep by : and gets the first element as old name
it
moves/renames from old filename to new filename
There are several things going on with this code.
For a start, .. I actually don't get this particular error, and this might be due to different versions.
It might resolve to the fact that grep interprets '>' the same as > due to bash expansion being done badly. I would suggest maybe going for "\>".
Secondly:
fname=$("echo $match|awk '{print $6}'")
The quotes inside serve unintended purpose. Your code should like like this, if anything:
fname="$(echo $match|awk '{print $6}')"
Lastly, to properly retrieve your data, this should be your final code:
for match in "$(grep -Hr "\>")"; do
fname="$(echo "$match" | cut -d: -f1)"
new_fname="$(echo "$match" | grep -o "sP[^ ]*")".fasta
echo mv "$fname" "$new_fname"
done
Explanations:
grep -H -> you want your grep to explicitly use "Include Filename", just in case other shell environments decide to alias grep to grep -h (no filenames)
you don't want to be doing grep -o on your file search, as you want to have both the filename and the "new filename" in one data entry.
Although, i don't see why you would search for '>' and not directory for 'sP' as such:
for match in "$(grep -Hro "sP[0-9]*")"
This is not the exact same behaviour, and has different edge cases, but it just might work for you.
Quite straightforward in (g)awk :
create a file "script.awk":
FNR == 1 {
for (i=1; i<=NF; i++) {
if (index($i, "sP")==1) {
print "mv", FILENAME, $i ".fasta"
nextfile
}
}
}
use it :
awk -f script.awk *.fasta > cmmd.txt
check the content of the output.
mv File_1.fasta sP171215.fasta
mv File_2.fasta sP131957.fasta
if ok, launch rename with . cmmd.txt
For all fasta files in directory, search their first line for the first word starting with sP and rename them using that word as the basename.
Using a bash array:
for f in *.fasta; do
arr=( $(head -1 "$f") )
for word in "${arr[#]}"; do
[[ "$word" =~ ^sP* ]] && echo mv "$f" "${word}.fasta" && break
done
done
or using grep:
for f in *.fasta; do
word=$(head -1 "$f" | grep -o "\bsP\w*")
[ -z "$word" ] || echo mv "$f" "${word}.fasta"
done
Note: remove echo after you are ok with testing.
I have the following code:
names=$(ls *$1*.txt)
head -q -n 1 $names | cut -d "_" -f 2
where the first line finds and stores all names matching the command line input into a variable called names, and the second grabs the first line in each file (element of the variable names) and outputs the second part of the line based on the "_" delim.
This is all good, however I would like to prepend the filename (stored as lines in the variable names) to the output of cut. I have tried:
names=$(ls *$1*.txt)
head -q -n 1 $names | echo -n "$names" cut -d "_" -f 2
however this only prints out the filenames
I have tried
names=$(ls *$1*.txt
head -q -n 1 $names | echo -n "$names"; cut -d "_" -f 2
and again I only print out the filenames.
The desired output is:
$
filename1.txt <second character>
where there is a single whitespace between the filename and the result of cut.
Thank you.
Best approach, using awk
You can do this all in one invocation of awk:
awk -F_ 'NR==1{print FILENAME, $2; exit}' *"$1"*.txt
On the first line of the first file, this prints the filename and the value of the second column, then exits.
Pure bash solution
I would always recommend against parsing ls - instead I would use a loop:
You can avoid the use of awk to read the first line of the file by using bash built-in functionality:
for i in *"$1"*.txt; do
IFS=_ read -ra arr <"$i"
echo "$i ${arr[1]}"
break
done
Here we read the first line of the file into an array, splitting it into pieces on the _.
Maybe something like that will satisfy your need BUT THIS IS BAD CODING (see comments):
#!/bin/bash
names=$(ls *$1*.txt)
for f in $names
do
pattern=`head -q -n 1 $f | cut -d "_" -f 2`
echo "$f $pattern"
done
If I didn't misunderstand your goal, this also works.
I've always done it this way, I just found out that this is a deprecated way to do it.
#!/bin/bash
names=$(ls *"$1"*.txt)
for e in $names;
do echo $e `echo "$e" | cut -c2-2`;
done
The bash script I wrote is supposed to modify my text files. The problem is the speed of operation. There are 4 lines of each file I want to modify.
This is my bash script to modify all .txt files in a given folder:
srcdir="$1" //source directory
cpr=$2 //given string argument
find $srcdir -name "*.txt" | while read i; do
echo "#############################"
echo "$i"
echo "Custom string: $cpr"
echo "#############################"
# remove document name and title
sed -i 's_document=.*\/[0-9]\{10\}\(, User=team\)\?__g' $i
# remove document date
sed -i 's|document date , [0-9]\{2\}\/[0-9]\{2\}\/[0-9]\{4\} [0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\} MDT||g' $i
# remove document id
sed -i 's|document id = 878h67||g' $i
# replace new producer
sed_arg="-i 's|Reproduced by $cpr|john smith|g' $i"
eval sed "$sed_arg"
done
I dont know how to concatinate all my sed commands in one command or two, so the job would be done faster ( I think! )
I have tried the OR operator for regex | but no success.
Have you tried
sed -i -e 's/pattern/replacement/g' -e 's/pattern1/replace1/g' file
sed -i '
s_document=.*\/[0-9]\{10\}\(, User=team\)\?__g;
s|document date , [0-9]\{2\}\/[0-9]\{2\}\/[0-9]\{4\} [0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\} MDT||g;
s|document id = 878h67||g;
s|Reproduced by '"$cpr"'|john smith|g' $i