bash manipulate multiple strings in a file - string

I am looking to strip the below input lines from the filename and I am using this file:
cat <<EOF >./tz.txt
2019/12/_MG_0263.CR2.xmp: bopt:keywordlist="pinhole,car,2019"
2019/12/_MG_0262.CR2.xmp: bopt:keywordlist="pinhole,car,2019"
2020/06/ok/_MG_0003.CR2.xmp: bopt:keywordlist="lowkey,car,Chiaroscuro,2020"
2020/06/ok/_MG_0002.CR2.xmp: bopt:keywordlist="lowkey,car,Chiaroscuro,2020"
2020/04/_MG_0137.CR2.xmp: bopt:keywordlist="red,car,2020"
2020/04/_MG_0136.CR2.xmp: bopt:keywordlist="red,car,2020"
2020/04/_MG_0136.CR2.xmp: bopt:keywordlist="red,car,2020"
EOF
and now I am using the below script (stored in file ab.sh) to just exclude the [filename.xmp: bopt:] (e.g. _MG_0263.CR2.xmp: bopt:) from each line so that the output looks like this:
2019/12/ keywordlist="pinhole,car,2019"
2019/12/ keywordlist="pinhole,car,2019"
2020/06/ok/ keywordlist="lowkey,car,Chiaroscuro,2020"
2020/06/ok/ keywordlist="lowkey,car,Chiaroscuro,2020"
2020/04/ keywordlist="red,car,2020"
2020/04/ keywordlist="red,car,2020"
2020/04/ keywordlist="red,car,2020"
Above is the complete expected output. Some folders may have different structures, like the one 2020/06/ok/
The script code is below:
#!/bin/bash
file="./tz.txt"
while read line ; do
# variable a generates the folter structure with a variable range of considered columns
# using awk to figure out how many columns (aka folders) there are in the structure
a=$( cut -d"/" -f 1-$( awk -F'/' '{ print NF-1 }' $line ) $line )
# | |
# -this bit should create a number for-
# -the cut command -
# then b variable stores the last bit in the string
b=$( cut -d":" -f 3 $line )
# and below combine results from above variables
echo ${a} ${b}
done < ${file}
In the attached image is an illustration of the logic used to split the string in columns and get only the relevant data.
The problem is that I get the below error and I am not sure where I’ve gone wrong.
Thank you for any suggestions or help
$ sh ~/ab.sh
awk: fatal: cannot open file `2019/12/_MG_0263.CR2.xmp:' for
reading (No such file or directory)
cut: '2019/12/_MG_0263.CR2.xmp:': No such file or directory
cut: 'bopt:keywordlist="pinhole,car,2019"': No such file or directory
cut: '2019/12/_MG_0263.CR2.xmp:': No such file or directory
cut: 'bopt:keywordlist="pinhole,car,2019"': No such file or directory
awk: fatal: cannot open file `2019/12/_MG_0262.CR2.xmp:' for reading (No such file or directory)
cut: '2019/12/_MG_0262.CR2.xmp:': No such file or directory
cut: 'bopt:keywordlist="pinhole,car,2019"': No such file or directory
cut: '2019/12/_MG_0262.CR2.xmp:': No such file or directory
cut: 'bopt:keywordlist="pinhole,car,2019"': No such file or directory
awk: fatal: cannot open file `2020/06/ok/_MG_0003.CR2.xmp:' for reading (No such file or directory)
cut: '2020/06/ok/_MG_0003.CR2.xmp:': No such file or directory
cut: 'bopt:keywordlist="lowkey,car,Chiaroscuro,2020"': No such file or directory
cut: '2020/06/ok/_MG_0003.CR2.xmp:': No such file or directory
cut: 'bopt:keywordlist="lowkey,car,Chiaroscuro,2020"': No such file or directory
....

One awk idea to replace the while loop:
awk -F':' '
{ gsub(/[^/]+$/,"",$1) # strip everything after last "/" from 1st field
print $1, $3
}' "${file}"
# or as a one-liner sans comments:
awk -F':' '{gsub(/[^/]+$/,"",$1); print $1, $3}' "${file}"
This generates:
2019/12/ keywordlist="pinhole,car,2019"
2019/12/ keywordlist="pinhole,car,2019"
2020/06/ok/ keywordlist="lowkey,car,Chiaroscuro,2020"
2020/06/ok/ keywordlist="lowkey,car,Chiaroscuro,2020"
2020/04/ keywordlist="red,car,2020"
2020/04/ keywordlist="red,car,2020"
2020/04/ keywordlist="red,car,2020"
One sed alternative:
$ sed -En 's|^(.*)/[^/]+:.*:([^:]+)$|\1/ \2|p' "${file}"
Where:
-En - enable support for extended regexs, suppress automatic printing of input lines
since data includes the / character we'll use | as the sed script delimiter
^(.*)/ - [1st capture group] match everything up to the last / before ...
[^/]+: - matching everything that's not a / up to the 1st :, then ...
.*: - match everything up to next :
([^:]+)$ - [2nd capture group] lastly match everything at end of line that is not :
\1/ \2 - print 1st capture group + / + 2nd capture group
This generates:
2019/12/ keywordlist="pinhole,car,2019"
2019/12/ keywordlist="pinhole,car,2019"
2020/06/ok/ keywordlist="lowkey,car,Chiaroscuro,2020"
2020/06/ok/ keywordlist="lowkey,car,Chiaroscuro,2020"
2020/04/ keywordlist="red,car,2020"
2020/04/ keywordlist="red,car,2020"
2020/04/ keywordlist="red,car,2020"

First of all, the final parameter to the awk command should be a filename. You are passing it a variable containing the contents of one line of the input file. This is why you are getting the awk: fatal: cannot open file errors.
Secondly, you are making the same mistake with the cut command, resulting in the : No such file or directory errors.
Both awk and cut are designed to work on complete files. You can chain them together so that the output of one becomes the input of another by using the pipe character: |. For example:
cat ${file} | awk ... | cut ...
But this can quickly become complex and unwieldy. A better solution is to use the Stream Editor sed. sed will read it's input line by line and can perform quite complex operations on each line before outputting the result, line by line.
This should do what you want:
#!/bin/bash
file="/tz.txt"
sed -En 's/^([0-9]{4}\/[0-9]{2}\/).*bopt:(.*)$/\1 \2/p' ${file}
Here is an explanation of the quoted expression:
s/pat/rep/p Search for pat and if found, replace with rep and print the result.
In our case, pat is:
^ The beginning of a line
( Start remembering what follows
[0-9]{4} Any digit repeated exactly 4 times
\/ The / character (escaped)
[0-9]{2}\/ Any digit repeated exactly 2 times, followed by /
) Stop remembering
.*bopt: any 0 or more characters followed by bopt:
(.*) remember 0 or more characters...
$ ...up to the end of the line.
And rep is:
\1 \2 The first thing remembered, followed by a space, followed by the second thing we remembered.

Related

awk, sed, grep specific strings from a file in Linux

Here is part of the complete file that I am trying to filter:
Hashmode: 13761 - VeraCrypt PBKDF2-HMAC-SHA256 + XTS 512 bit + boot-mode (Iterations: 200000)
Speed.#2.........: 2038 H/s (56.41ms) # Accel:128 Loops:32 Thr:256 Vec:1
Speed.#3.........: 2149 H/s (53.51ms) # Accel:128 Loops:32 Thr:256 Vec:1
Speed.#*.........: 4187 H/s
The aim is to print the following:
13761 VeraCrypt PBKDF2-HMAC-SHA256 4187 H/s
Here is what I tried.
The complete file is called complete.txt
cat complete.txt | grep Hashmode | awk '{print $2,$4,$5}' > mode.txt
Output:
13761 VeraCrypt PBKDF2-HMAC-SHA256
Then:
cat complete.txt | grep Speed.# | awk '{print $2,$3}' > speed.txt
Output:
4187 H/s
Then:
paste mode.txt speed.txt
The issue is that the lines do not match. There are approx 200 types of modes to filter within the file 'complete.txt'
I also have a feeling that this can be done using a much simpler command with sed or awk.
I am guessing you are looking for something like the following.
awk '/Hashmode:/ { if(label) print label, speed; label = $2 " " $4 " " $5 }
/Speed\.#/ { speed = $2 " " $ 3 }
END { if (label) print label, speed }' complete.txt
We match up the Hashmode line with the last Speed.# line which follows, then print when we see a new Hashmode, or reach end of file. (Failing to print the last one is a common beginner bug.)
This might work for you (GNU sed):
sed -E '/Hashmode:/{:a;x;s/^[^:]*: (\S+) -( \S+ \S+ ).*\nSpeed.*:\s*(\S+ \S+).*/\1\2\3/p;x;h;d};H;$!d;ba' file
If a line contains Hashmode, swap to the hold space and using pattern matching, manipulate its contents to the desired format and print, swap back to the pattern space, copy the current line to the hold space and delete the current line.
Otherwise, append the current line to the hold space and delete the current line, unless the current line is the last line in the file, in which case process the line as if it contained Hashmode.
N.B. The first time Hashmode is encountered, nothing is output. Subsequent matches and the end-of-file condition will be the only times printing occurs.

Using awk to delete multiple lines using argument passed on via function

My input.csv file is semicolon separated, with the first line being a header for attributes. The first column contains customer numbers. The function is being called through a script that I activate from the terminal.
I want to delete all lines containing the customer numbers that are entered as arguments for the script. EDIT: And then export the file as a different file, while keeping the original intact.
bash deleteCustomers.sh 1 3 5
Currently only the last argument is filtered from the csv file. I understand that this is happening because the output file gets overwritten each time the loop runs, restoring all previously deleted arguments.
How can I match all the lines to be deleted, and then delete them (or print everything BUT those lines), and then output it to one file containing ALL edits?
delete_customers () {
echo "These customers will be deleted: "$#""
for i in "$#";
do
awk -F ";" -v customerNR=$i -v input="$inputFile" '($1 != customerNR) NR > 1 { print }' "input.csv" > output.csv
done
}
delete_customers "$#"
Here's some sample input (first piece of code is the first line in the csv file). In the output CSV file I want the same formatting, with the lines for some customers completely deleted.
Klantnummer;Nationaliteit;Geslacht;Title;Voornaam;MiddleInitial;Achternaam;Adres;Stad;Provincie;Provincie-voluit;Postcode;Land;Land-voluit;email;gebruikersnaam;wachtwoord;Collectief ;label;ingangsdatum;pakket;aanvullende verzekering;status;saldo;geboortedatum
1;Dutch;female;Ms.;Josanne;S;van der Rijst;Bliek 189;Hellevoetsluis;ZH;Zuid-Holland;3225 XC;NL;Netherlands;JosannevanderRijst#dayrep.com;Sourawaspen;Lae0phaxee;Klant;CZ;11-7-2010;best;tand1;verleden;-137;30-12-1995
2;Dutch;female;Mrs.;Inci;K;du Bois;Castorweg 173;Hengelo;OV;Overijssel;7557 KL;NL;Netherlands;InciduBois#gustr.com;Hisfireeness;jee0zeiChoh;Klant;CZ;30-8-2015;goed ;geen;verleden;188;1-8-1960
3;Dutch;female;Mrs.;Lusanne;G;Hijlkema;Plutostraat 198;Den Haag;ZH;Zuid-Holland;2516 AL;NL;Netherlands;LusanneHijlkema#dayrep.com;Digum1969;eiTeThun6th;Klant;Achmea;12-2-2010;best;mix;huidig;-335;9-3-1973
4;Dutch;female;Dr.;Husna;M;Hoegee;Tiendweg 89;Ameide;ZH;Zuid-Holland;4233 VW;NL;Netherlands;HusnaHoegee#fleckens.hu;Hatimon;goe5OhS4t;Klant;VGZ;9-8-2015;goed ;gezin;huidig;144;12-8-1962
5;Dutch;male;Mr.;Sieds;D;Verspeek;Willem Albert Scholtenstraat 38;Groningen;GR;Groningen;9711 XA;NL;Netherlands;SiedsVerspeek#armyspy.com;Thade1947;Taexiet9zo;Intern;CZ;17-2-2004;beter;geen;verleden;-49;12-10-1961
6;Dutch;female;Ms.;Nazmiye;R;van Spronsen;Noorderbreedte 180;Amsterdam;NH;Noord-Holland;1034 PK;NL;Netherlands;NazmiyevanSpronsen#jourrapide.com;Whinsed;Oz9ailei;Intern;VGZ;17-6-2003;beter;mix;huidig;178;8-3-1974
7;Dutch;female;Ms.;Livia;X;Breukers;Everlaan 182;Veenendaal;UT;Utrecht;3903
Try this in loop..
awk -v variable=$var '$1 != variable' input.csv
awk - to make decision based on columns
-v - to use a variable into a awk command
variable - store the value for awk to process
$var - to search for a specific string in run-time
!= - to check if not exist
input.csv - your input file
It's awk's behavior, when you use -v it can will work with variable on run-time and provide an output that doesn't contain the value you passed. This way, you get all the values that are not matching to your variable. Hope this is helpful. :)
Thanks
This bash script should work:
!/bin/bash
FILTER="!/(^"$(echo "$#" | sed -e "s/ /\|^/g")")/ {print}"
awk "$FILTER" input.csv > output.csv
The idea is to build an awk relevant FILTER and then use it.
Assuming the call parameters are: 1 2 3, the filter will be: !/(^1|^2|^3)/ {print}
!: to invert matching
^: Beginning of the line
The input data are in the input.csv file and output result will be in the output.csv file.

extract sequences from multifasta file by ID in file using awk

I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs.
FASTA file seq.fasta:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11605
TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC
CCTGTTCGGGCGCCACTGCTAG
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
>7P58X:01334:11635
TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT
CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC
GAGCG
>7P58X:01336:11621
ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT
CTAGCAGTAGAGGAGATCTCCTCGACGCAGGACT
IDs file id.txt:
7P58X:01332:11636
7P58X:01334:11613
I want to get the fasta file with only those sequences matching the IDs in the id.txt file:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
I really like the awk approach I found in answers here and here, but the code given there is still not working perfectly for the example I gave. Here is why:
(1)
awk -v seq="7P58X:01332:11636" -v RS='>' '$1 == seq {print RS $0}' seq.fasta
this code works well for the multiline sequences but IDs have to be inserted separately to the code.
(2)
awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' id.txt seq.fasta
this code can take the IDs from the id.txt file but returns only the first line of the multiline sequences.
I guess that the good thing would be to modify the RS variable in the code (2) but all of my attempts failed so far. Can, please, anybody help me with that?
$ awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
Following awk may help you on same.
awk 'FNR==NR{a[$0];next} /^>/{val=$0;sub(/^>/,"",val);flag=val in a?1:0} flag' ids.txt fasta_file
I'm facing a similar problem. The size of my multi-fasta file is ~ 25G.
I use sed instead of awk, though my solution is an ugly hack.
First, I extracted the line number of the title of each sequence to a data file.
grep -n ">" multi-fasta.fa > multi-fasta.idx
What I got is something like this:
1:>DM_0000000004
5:>DM_0000000005
11:>DM_0000000007
19:>DM_0000000008
23:>DM_0000000009
Then, I extracted the wanted sequence by its title, eg. DM_0000000004, using the scripts below.
seqnm=$1
idx0_idx1=`grep -n $seqnm multi-fasta.idx`
idx0=`echo $idx0_idx1 | cut -d ":" -f 1`
idx0plus1=`expr $idx0 + 1`
idx1=`echo $idx0_idx1 | cut -d ":" -f 2`
idx2=`head -n $idx0plus1 multi-fasta.idx | tail -1 | cut -d ":" -f 1`
idx2minus1=`expr $idx2 - 1`
sed ''"$idx1"','"$idx2minus1"'!d' multi-fasta.fa > ${seqnm}.fasta
For example, I want to extract the sequence of DM_0000016115. The idx0_idx1 variable gives me:
7507:42520:>DM_0000016115
7507 (idx0) is the line number of line 42520:>DM_0000016115 in multi-fasta.idx.
42520 (idx1) is the line number of line >DM_0000016115 in multi-fasta.fa.
idx2 is the line number of the sequence title right beneath the wanted one (>DM_0000016115).
At last, using sed, we can extract the lines between idx1 and idx2 minus 1, which are the title and the sequence, in which case you can use grep -A.
The advantage of this ugly-hack is that it does not require a specific number of lines for each sequence in the multi-fasta file.
What bothers me is this process is slow. For my 25G multi-fasta file, such extraction takes tens of seconds. However, it's much faster than using samtools faidx .

How to remove all lines from a text file starting at first empty line?

What is the best way to remove all lines from a text file starting at first empty line in Bash? External tools (awk, sed...) can be used!
Example
1: ABC
2: DEF
3:
4: GHI
Line 3 and 4 should be removed and the remaining content should be saved in a new file.
With GNU sed:
sed '/^$/Q' "input_file.txt" > "output_file.txt"
With AWK:
$ awk '/^$/{exit} 1' test.txt > output.txt
Contents of output.txt
$ cat output.txt
ABC
DEF
Walkthrough: For lines that matches ^$ (start-of-line, end-of-line), exit (the whole script). For all lines, print the whole line -- of course, we won't get to this part after a line has made us exit.
Bet there are some more clever ways to do this, but here's one using bash's 'read' builtin. The question asks us to keep lines before the blank in one file and send lines after the blank to another file. You could send some of standard out one place and some another if you are willing to use 'exec' and reroute stdout mid-script, but I'm going to take a simpler approach and use a command line argument to let me know where the post-blank data should go:
#!/bin/bash
# script takes as argument the name of the file to send data once a blank line
# found
found_blank=0
while read stuff; do
if [ -z $stuff ] ; then
found_blank=1
fi
if [ $found_blank ] ; then
echo $stuff > $1
else
echo $stuff
fi
done
run it like this:
$ ./delete_from_empty.sh rest_of_stuff < demo
output is:
ABC
DEF
and 'rest_of_stuff' has
GHI
if you want the before-blank lines to go somewhere else besides stdout, simply redirect:
$ ./delete_from_empty.sh after_blank < input_file > before_blank
and you'll end up with two new files: after_blank and before_blank.
Perl version
perl -e '
open $fh, ">","stuff";
open $efh, ">", "rest_of_stuff";
while(<>){
if ($_ !~ /\w+/){
$fh=$efh;
}
print $fh $_;
}
' demo
This creates two output files and iterates over the demo data. When it hits a blank line, it flips the output from one file to the other.
Creates
stuff:
ABC
DEF
rest_of_stuff:
<blank line>
GHI
Another awk would be:
awk -vRS= '1;{exit}' file
By setting the record separator RS to be an empty string, we define the records as paragraphs separated by a sequence of empty lines. It is now easily to adapt this to select the nth block as:
awk -vRS= '(FNR==n){print;exit}' file
There is a problem with this method when processing files with a DOS line-ending (CRLF). There will be no empty lines as there will always be a CR in the line. But this problem applies to all presented methods.

Remove all text from last dot in bash

I have a file named test.txt which has:
abc.cde.ccd.eed.12345.5678.txt
abcd.cdde.ccdd.eaed.12346.5688.txt
aabc.cade.cacd.eaed.13345.5078.txt
abzc.cdae.ccda.eaed.29345.1678.txt
abac.cdae.cacd.eead.18145.2678.txt
aabc.cdve.cncd.ened.19945.2345.txt
If I want to remove everything beyond the first . like:
cde.ccd.eed.12345.5678.txt
cdde.ccdd.eaed.12346.5688.txt
cade.cacd.eaed.13345.5078.txt
cdae.ccda.eaed.29345.1678.txt
cdae.cacd.eead.18145.2678.txt
cdve.cncd.ened.19945.2345.txt
Then I will do
for i in `cat test.txt`; do echo ${i#*.}; done
but If I want to remove everything after the last . like:
abc.cde.ccd.eed.12345.5678
abcd.cdde.ccdd.eaed.12346.5688
aabc.cade.cacd.eaed.13345.5078
abzc.cdae.ccda.eaed.29345.1678
abac.cdae.cacd.eead.18145.2678
aabc.cdve.cncd.ened.19945.2345
what should I do?
With awk:
awk 'BEGIN{FS=OFS="."} NF--' file
In case there are no empty lines, this works. It sets input and output field separators to the dot .. Then, decreases the number of fields in one, so that the last one is kept out. Then it performs the default awk action: {print $0}, that is, print the line.
With sed:
sed 's/\.[^.]*$//' file
This catches the last block of . + text + end of line and replaces it with nothing. That is, it removes it.
With rev and cut:
rev file | cut -d'.' -f2- | rev
rev reverses the line, so that cut can print from the 2nd word to the end. Then, rev back to get the correct output.
With bash:
while ISF= read -r line
do
echo "${line%.*}"
done < file
This perform a string operation consisting in replacing the shortest match of .* from the end of the variable $line content.
With grep:
grep -Po '.*(?=\.)' file
Look-ahead to print just what is before the last dot.
All of them return:
abc.cde.ccd.eed.12345.5678
abcd.cdde.ccdd.eaed.12346.5688
aabc.cade.cacd.eaed.13345.5078
abzc.cdae.ccda.eaed.29345.1678
abac.cdae.cacd.eead.18145.2678
aabc.cdve.cncd.ened.19945.2345

Resources