How to mark line in target file in case line matched - linux

my bash script read each line from file - /tmp/file.CSV until EOF
And find if this line match line in other file - /tmp/target.CSV ( in case of full match bash script need to add "+" in the beginning of the matched line )
for example
line="/VPNfig/EME/EM3/Ucll/ucelobeconn/6EKoHH11" ( from /tmp/file.CSV )
we see that $line have full match with line:
1,ull,LINUX,"/VPNfig/EME/EM3/Ucll/ucelobeconn/6EKoHH11",fnt,rfdr,OK ( from /tmp/target.CSV )
then we need to add "+" on the line in /tmp/target.CSV as
+1,ull,LINUX,"/VPNfig/EME/EM3/Ucll/ucelobeconn/6EKoHH11",fnt,rfdr,OK
please advice how to do that with sed or awk or maybe perl one liner in my bash script
more /tmp/target.CSV
1,ull,LINUX,"/VPNfig/EME/EM3/Ucll/ucelobeconn/6EKoHH11",fnt,rfdr,OK
2,Ama,LINUX,"/VPNfig/EME/EM8/Franlecom Eana SA/Amen",comrse,temporal,OK
3,ArnTel,LINUX,"/VPConfig/EME/EM3/ArmenTem Armenia)/ArmenTe",Coers,FAIL
4,Ahh,LINUX,"/VPConfig/EMA/EM/llk/AAe",Coers,FAIL
142,ucell,LINUX,/VPNAAonfig/EMEA/EM3/Ucell/ede3fc34,Glo,G/rvrev443,OK
more file.CSV
/VPNfig/EME/EM3/Ucll/ucelobeconn/6EKoHH11
/VPNfig/EME/EM8/Franlecom Eana SA/Amen
/VPConfig/EME/EM3/ArmenTem Armenia)/ArmenTe
/VPConfig/EME/EM0/TTR/Ar
/VPNAAonfig/EMEA/EM3/Ucell/ede3fc34
my bash code
while read -r line
do
grep -iq "$line" /tmp/target.CSV
if [[ $? -ne 0 ]]
then
echo "$line" NOT MATCH target.CSV
else
sed .................
fi
done < /tmp/file.CSV
Example of expected results (according to the files /tmp/target.CSV file.CSV )
more /tmp/target.CSV
+1,ull,LINUX,"/VPNfig/EME/EM3/Ucll/ucelobeconn/6EKoHH11",fnt,rfdr,OK
+2,Ama,LINUX,"/VPNfig/EME/EM8/Franlecom Eana SA/Amen",comrse,temporal,OK
+3,ArnTel,LINUX,"/VPConfig/EME/EM3/ArmenTem Armenia)/ArmenTe",Coers,FAIL
4,Ahh,LINUX,"/VPConfig/EMA/EM/llk/AAe",Coers,FAIL
more file.CSV
+/VPNfig/EME/EM3/Ucll/ucelobeconn/6EKoHH11
+/VPNfig/EME/EM8/Franlecom Eana SA/Amen
+/VPConfig/EME/EM3/ArmenTem Armenia)/ArmenTe
/VPConfig/EME/EM0/TTR/Ar
+/VPNAAonfig/EMEA/EM3/Ucell/ede3fc34

awk -F\" -v OFS=\" 'FNR==NR{ a[$0]++; next} $2 in a { $0 = "+" $0 } 1' file.csv target.csv
Output:
+1,ull,LINUX,"/VPNfig/EME/EM3/Ucll/ucelobeconn/6EKoHH11",fnt,rfdr,OK
+2,Ama,LINUX,"/VPNfig/EME/EM8/Franlecom Eana SA/Amen",comrse,temporal,OK
+3,ArnTel,LINUX,"/VPConfig/EME/EM3/ArmenTem Armenia)/ArmenTe",Coers,FAIL
4,Ahh,LINUX,"/VPConfig/EMA/EM/llk/AAe",Coers,FAIL
Or
awk -F\" -v OFS=\" 'FNR==NR{ a[$0]++; next} { print ($2 in a ? "+" : " ") $0 }' file.csv target.csv
awk -F\" -v OFS=\" 'FNR==NR{ a[$0]++; next} { $0 = ($2 in a ? "+" : " ") $0 } 1' file.csv target.csv
Output:
+1,ull,LINUX,"/VPNfig/EME/EM3/Ucll/ucelobeconn/6EKoHH11",fnt,rfdr,OK
+2,Ama,LINUX,"/VPNfig/EME/EM8/Franlecom Eana SA/Amen",comrse,temporal,OK
+3,ArnTel,LINUX,"/VPConfig/EME/EM3/ArmenTem Armenia)/ArmenTe",Coers,FAIL
4,Ahh,LINUX,"/VPConfig/EMA/EM/llk/AAe",Coers,FAIL
And this one is valid whether each line starts with a single space or not:
awk -F\" -v OFS=\" 'FNR==NR{ a[$0]++; next} { sub(/^ ?/, $2 in a ? "+" : " ") } 1' file.csv target.csv
Try
awk -F\" -v OFS=\" 'FNR==NR{ a[$0]++; next} { sub(/^ ?/, $2 in a ? "+" : " ") } 1' file.csv target.csv
Update (1)
awk -F, -v OFS=, 'FNR==NR{ sub(/[ \t\r]*$/, ""); a[$0]++; next} { t = $4; gsub(/(^"|"$)/, "", t); sub(/^[ \t]*/, t in a ? "+" : " "); } 1' file.csv target.csv
Output:
+1,ull,LINUX,"/VPNfig/EME/EM3/Ucll/ucelobeconn/6EKoHH11",fnt,rfdr,OK
+2,Ama,LINUX,"/VPNfig/EME/EM8/Franlecom Eana SA/Amen",comrse,temporal,OK
+3,ArnTel,LINUX,"/VPConfig/EME/EM3/ArmenTem Armenia)/ArmenTe",Coers,FAIL
4,Ahh,LINUX,"/VPConfig/EMA/EM/llk/AAe",Coers,FAIL
+142,ucell,LINUX,/VPNAAonfig/EMEA/EM3/Ucell/ede3fc34,Glo,G/rvrev443,OK
Update (2)
awk -F, -v OFS=, 'FNR==NR{ sub(/[ \t\r]$/, ""); a[$0]++; b[FNR]=$0; next} { t = $4; gsub(/(^"|"$)/, "", t); r = " "; if (t in a) { c[t]++; r = "+" }; sub(/^[ \t]*/, r); } 1; END { for (i = 1; i in b; ++i) { t = b[i]; sub(/^[ \t]*/, t in c ? "+" : " ", t); print t > "/dev/stderr" } }' file.csv target.csv > new_target.csv 2> new_file.cs

Try this Perl's one liner:
perl -pi -e '$_="+".$_ if($_=~m{/VPNfig/EME/EM3/Ucll/ucelobeconn/6EKoHH11}is);' /tmp/target.CSV

Related

Adding double quotes around non-numeric columns by awk

I have a file like this;
2018-01-02;1.5;abcd;111
2018-01-04;2.75;efgh;222
2018-01-07;5.25;lmno;333
2018-01-09;1.25;prs;444
I'd like to add double ticks to non-numeric columns, so the new file should look like;
"2018-01-02";1.5;"abcd";111
"2018-01-04";2.75;"efgh";222
"2018-01-07";5.25;"lmno";333
"2018-01-09";1.25;"prs";444
I tried this so far, know that this is not the correct way
head myfile.csv -n 4 | awk 'BEGIN{FS=OFS=";"} {gsub($1,echo $1 ,$1)} 1' | awk 'BEGIN{FS=OFS=";"} {gsub($3,echo "\"" $3 "\"",$3)} 1'
Thanks in advance.
You may use this awk that sets ; as input/output delimiter and then wraps each field with "s if that field is non-numeric:
awk '
BEGIN {
FS = OFS = ";"
}
{
for (i=1; i<=NF; ++i)
$i = ($i+0 == $i ? $i : "\"" $i "\"")
} 1' file
"2018-01-02";1.5;"abcd";111
"2018-01-04";2.75;"efgh";222
"2018-01-07";5.25;"lmno";333
"2018-01-09";1.25;"prs";444
Alternative gnu-awk solution:
awk -v RS='[;\n]' '$0+0 != $0 {$0 = "\"" $0 "\""} {ORS=RT} 1' file
Using GNU awk and typeof(): Fields - - that are numeric strings have the strnum attribute. Otherwise, they have the string attribute.1
$ gawk 'BEGIN {
FS=OFS=";"
}
{
for(i=1;i<=NF;i++)
if(typeof($i)=="string")
$i=sprintf("\"%s\"",$i)
}1' file
Some output:
"2018-01-02";1.5;"abcd";111
- -
Edit:
If some the fields are already quoted:
$ gawk 'BEGIN {
FS=OFS=";"
}
{
for(i=1;i<=NF;i++)
if(typeof($i)=="string")
gsub(/^"?|"?$/,"\"",$i)
}1' <<< string,123,"quoted string"
Output:
"string",123,"quoted string"
Further enhancing upon anubhava's solution (including handling fields already double-quoted :
gawk -e 'sub(".+",$-_==+$-_?"&":(_)"&"_\
)^gsub((_)_, _)^(ORS = RT)' RS='[;\n]' \_='\42'
"2018-01-02";1.5;"abcd";111
"2018-01-04";2.75;"efgh";222
"2018-01-07";5.25;"lmno";333
"2018-01-09";1.25;"prs";444
"2018-01-09";1.25;"prs";111111111111111111112222222222
222222223333333333333333333333
333344444444444444444499999999
999991111111111111111111122222
222222222222233333333333333333
333333333444444444444444444999
999999999991111111111111111111
122222222222222222233333333333
333333333333333444444444444444
444999999999999991111111111111
111111122222222222222222233333
333333333333333333333444444444
444444444999999999999991111111
111111111111122222222222222222
233333333333333333333333333444
444444444444444999999999999991
111111111111111111122222222222
222222233333333333333333333333
333444444444444444444999999999
999991111111111111111111122222
222222222222233333333333333333
333333333444444444444444444999
999999999999

Regex issue for match a column value

I wrote a script to extract a column value from a file which doesn't matches the pattern defined in col metadata file.
But it is not returning the right output. Can anyone point out the issue here? I was trying to match string with double quotes .quotes also needs to be matched.
Code:
`awk -F'|' -v n="$col_pos" -v m="$col_patt" 'NR!=1 && $n !~ "^" m "$" {
printf "%s:%s:%s\n", FILENAME, FNR, $0 > "/dev/stderr"
count++
}
END {print count}' $input_file`
run output :-
++ awk '-F|' -v n=4 -v 'm="[a-z]+#gmail.com"' 'NR!=1 && $n !~ "^" m "$" {
printf "%s:%s:%s\n", FILENAME, FNR, $0 > "/dev/stderr"
count++
}
END {print count}' /test/data/infa_shared/dev/SrcFiles/datawarehouse/poc/BNX.csv
10,22,"00AF","abc#gmail.com",197,10,1/1/2020 12:06:10.260 PM,"BNX","Hard b","50","Us",1,"25" -- this line is not expected in output as it matches the email pattern "[a-z]+#gmail.com". pattern is extracted from the below file
Input file for pattern extraction file_col_metadata
FILE_ID~col_POS~COL_START_POS~COL_END_POS~datatype~delimited_ind~col_format~columnlength
5~4~~~char~Y~"[a-z]+#gmail.com"~100
If you replace awk -F'|' ... with awk -F',' ... it will work.

manipulating files using awk linux

I have a 1.txt file (with field separator as ||o||):
aidagolf6#gmail.com||o||bb1e6b92d60454122037f302359d8a53||o||Aida||o||Aida||o||Muji?
aidagolf6#gmail.com||o||bcfddb5d06bd02b206ac7f9033f34677||o||Aida||o||Aida||o||Muji?
aidagolf6#gmail.com||o||bf6265003ae067b19b88fa4359d5c392||o||Aida||o||Aida||o||Garic Gara
aidagolf6#gmail.com||o||d3a6a8b1ed3640188e985f8a1efbfe22||o||Aida||o||Aida||o||Muji?
aidagolfa#hotmail.com||o||14f87ec1e760d16c0380c74ec7678b04||o||Aida||o||Aida||o||Rodriguez Puerto
2.txt (with field separator as :):
bf6265003ae067b19b88fa4359d5c392:hyworebu:#
14f87ec1e760d16c0380c74ec7678b04:sujycugu
I have a result.txt file (which will match 2nd column of 1.txt with first column of 2.txt and if results match, it will replace the 2nd column of 1.txt with 2nd column of 2.txt)
aidagolf6#gmail.com||o||hyworebu:#||o||Aida||o||Aida||o||Garic Gara
aidagolfa#hotmail.com||o||sujycugu||o||Aida||o||Aida||o||Rodriguez Puerto
And a left.txt file (which contains unmatched rows from 1.txt that have no match in 2.txt):
aidagolf6#gmail.com||o||d3a6a8b1ed3640188e985f8a1efbfe22||o||Aida||o||Aida||o||Muji?
aidagolf6#gmail.com||o||bb1e6b92d60454122037f302359d8a53||o||Aida||o||Aida||o||Muji?
aidagolf6#gmail.com||o||bcfddb5d06bd02b206ac7f9033f34677||o||Aida||o||Aida||o||Muji?
The script I am trying is:
awk -F '[|][|]o[|][|]' -v s1="||o||" '
NR==FNR {
a[$2] = $1;
b[$2]= $3s1$4s1$5;
next
}
($1 in a){
$1 = "";
sub(/:/, "")
print a[$1]s1$2s1b[$1] > "result.txt";
next
}' 1.txt 2.txt
The problem is the script is using ||o|| in 2.txt also due to which I am getting wrong results.
EDIT
Modified script:
awk -v s1="||o||" '
NR==FNR {
a[$2] = $1;
b[$2]= $3s1$4s1$5;
next
}
($1 in a){
$1 = "";
sub(/:/, "")
print a[$1]s1$2s1b[$1] > "result.txt";
next
}' FS = "||o||" 1.txt FS = ":" 2.txt
Now, I am getting following error:
awk: fatal: cannot open file `FS' for reading (No such file or
directory)
I've modified your original script:
awk -F'[|][|]o[|][|]' -v s1="||o||" '
NR == FNR {
a[$2] = $1;
b[$2] = $3 s1 $4 s1 $5;
c[$2] = $0; # keep the line for left.txt
}
NR != FNR {
split($0, d, ":");
r = substr($0, index($0, ":") + 1); # right side of the 1st ":"
if (a[d[1]] != "") {
print a[d[1]] s1 r s1 b[d[1]] > "result.txt";
c[d[1]] = ""; # drop from the list of left.txt
}
}
END {
for (var in c) {
if (c[var] != "") {
print c[var] > "left.txt"
}
}
}' 1.txt 2.txt
Next verion changes the order of file reading to reduce memory consumption:
awk -F'[|][|]o[|][|]' -v s1="||o||" '
NR == FNR {
split($0, a, ":");
r = substr($0, index($0, ":") + 1); # right side of the 1st ":"
map[a[1]] = r;
}
NR != FNR {
if (map[$2] != "") {
print $1 s1 map[$2] s1 $3 s1 $4 s1 $5 > "result.txt";
} else {
print $0 > "left.txt"
}
}' 2.txt 1.txt
and the final version makes use of file-based database which minimizes DRAM consumption, although I'm not sure if Perl is acceptable in your system.
perl -e '
use DB_File;
$file1 = "1.txt";
$file2 = "2.txt";
$result = "result.txt";
$left = "left.txt";
my $dbfile = "tmp.db";
tie(%db, "DB_File", $dbfile, O_CREAT|O_RDWR, 0644) or die "$dbfile: $!";
open(FH, $file2) or die "$file2: $!";
while (<FH>) {
chop;
#_ = split(/:/, $_, 2);
$db{$_[0]} = $_[1];
}
close FH;
open(FH, $file1) or die "$file1: $!";
open(RESULT, "> $result") or die "$result: $!";
open(LEFT, "> $left") or die "$left: $!";
while (<FH>) {
#_ = split(/\|\|o\|\|/, $_);
if (defined $db{$_[1]}) {
$_[1] = $db{$_[1]};
print RESULT join("||o||", #_);
} else {
print LEFT $_;
}
}
close FH;
untie %db;
'
rm tmp.db

Merge two files using awk in linux

I have a 1.txt file:
betomak#msn.com||o||0174686211||o||7880291304ca0404f4dac3dc205f1adf||o||Mario||o||Mario||o||Kawati
zizipi#libero.it||o||174732943.0174732943||o||e10adc3949ba59abbe56e057f20f883e||o||Tiziano||o||Tiziano||o||D'Intino
frankmel#hotmail.de||o||0174844404||o||8d496ce08a7ecef4721973cb9f777307||o||Melanie||o||Melanie||o||Kiesel
apoka-paris#hotmail.fr||o||0174847613||o||536c1287d2dc086030497d1b8ea7a175||o||Sihem||o||Sihem||o||Sousou
sofianomovic#msn.fr||o||174902297.0174902297||o||9893ac33a018e8d37e68c66cae23040e||o||Nabile||o||Nabile||o||Nassime
donaldduck#yahoo.com||o||174912161.0174912161||o||0c770713436695c18a7939ad82bc8351||o||Donald||o||Donald||o||Duck
cernakova#centrum.cz||o||0174991962||o||d161dc716be5daf1649472ddf9e343e6||o||Dagmar||o||Dagmar||o||Cernakova
trgsrl#tiscali.it||o||0175099675||o||d26005df3e5b416d6a39cc5bcfdef42b||o||Esmeralda||o||Esmeralda||o||Trogu
catherinesou#yahoo.fr||o||0175128896||o||2e9ce84389c3e2c003fd42bae3c49d12||o||Cat||o||Cat||o||Sou
ermimurati24#hotmail.com||o||0175228687||o||a7766a502e4f598c9ddb3a821bc02159||o||Anna||o||Anna||o||Beratsja
cece_89#live.fr||o||0175306898||o||297642a68e4e0b79fca312ac072a9d41||o||Celine||o||Celine||o||Jacinto
kendinegel39#hotmail.com||o||0175410459||o||a6565ca2bc8887cde5e0a9819d9a8ee9||o||Adem||o||Adem||o||Bulut
A 2.txt file:
9893ac33a018e8d37e68c66cae23040e:134:#a1
536c1287d2dc086030497d1b8ea7a175:~~#!:/92\
8d496ce08a7ecef4721973cb9f777307:demodemo
FS for 1.txt is "||o||" and for 2.txt is ":"
I want to merge two files in a single file result.txt based on the condition that the 3rd column of 1.txt must match with 1st column of 2.txt file and should be replaced by the 2nd column of 2.txt file.
The expected output will contain all the matching lines:
I am showing you one of them:
sofianomovic#msn.fr||o||174902297.0174902297||o||134:#a1||o||Nabile||o||Nabile||o||Nassime
I tried the script:
awk -F"||o||" 'NR==FNR{s=$0; sub(/:[^:]*$/, "", s); a[s]=$NF;next} {s = $5; for (i=6; i<=NF; ++i) s = s "," $i; if (s in a) { NF = 5; $5=a[s]; print } }' FS=: <(tr -d '\r' < 2.txt) FS="||o||" OFS="||o||" <(tr -d '\r' < 1.txt) > result.txt
But getting an empty file as the result. Any help would be highly appreciated.
If your actual Input_file(s) are same as shown sample then following awk may help you in same.
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
next
}
($1 in a){
print a[$1] s1 $2 FS $3 s1 b[$1]
}
' FS="|" 1.txt FS=":" 2.txt
EDIT: Since OP has changed requirement a bit so providing code as per new ask where it will create 2 files too 1 file which will have ids present in 1.txt and NOT in 2.txt and other will be vice versa of it.
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
c[$9]=$0;
next
}
($1 in a){
val=$1;
$1="";
sub(/:/,"");
print a[val] s1 $0 s1 b[val];
d[val]=$0;
next
}
{
print > "NOT_present_in_2.txt"
}
END{
for(i in d){
delete c[i]
};
for(j in c){
print j,c[j] > "NOT_present_in_1.txt"
}}
' FS="|" 1.txt FS=":" OFS=":" 2.txt
You can use this awk to get your output:
awk -F ':' 'NR==FNR{a[$1]=$2 FS $3; next} FNR==1{FS=OFS="||o||"; gsub(/[|]/, "\\\\&", FS)}
$3 in a{$3=a[$3]; print}' file2 file1 > result.txt
cat result.txt
frankmel#hotmail.de||o||0174844404||o||demodemo:||o||Melanie||o||Melanie||o||Kiesel
apoka-paris#hotmail.fr||o||0174847613||o||~~#!:/92\||o||Sihem||o||Sihem||o||Sousou
sofianomovic#msn.fr||o||174902297.0174902297||o||134:#a1||o||Nabile||o||Nabile||o||Nassime

Need help on formating a line using sed

The following is the line that I wanted to split it to tab separate part.
>VFG000676(gb|AAD32411)_(lef)_anthrax_toxin_lethal_factor_precursor_[Anthrax_toxin_(VF0142)]_[Bacillus_anthracis_str._Sterne]
the output that I want is
>VFG000676\t(gb|AAD32411)\t(lef)\tanthrax_toxin_lethal_factor_precursor\t [Anthrax_toxin_(VF0142)]\t[Bacillus_anthracis_str._Sterne]
I used this command
grep '>' x.fa | sed 's/^>\(.*\) (gi.*) \(.*\) \[\(.*\)\].*/\1\t\2\t\3/' | sed 's/ /_/g' > output.tsv
but the output is not what I want.
UPDATE: I finally fixed the issue by using the following code
grep '>' VFs_no_block.fa | sed 's/^>\(.*\)\((.*)\) \((.*)\) \(.*\) \(\[.*(.*)]\) \(\[.*]\).*/\1\t\2\t\3\t\4\t\5\t\6/' | sed 's/ /_/g' > VFDB_annotation_reference.tsv
Change OFS="\\t" to OFS="\t" if you really wanted literal tabs:
$ cat tst.awk
BEGIN { OFS="\\t" }
{
c=0
while ( match($0,/\[[^][]+\]|\([^)(]+\)|[^][)(]+/) ) {
tgt = substr($0,RSTART,RLENGTH)
gsub(/^_+|_+$/,"",tgt)
if (tgt != "") {
printf "%s%s", (c++ ? OFS : ""), tgt
}
$0 = substr($0,RSTART+RLENGTH)
}
print
}
$ awk -f tst.awk file
>VFG000676\t(gb|AAD32411)\t(lef)\tanthrax_toxin_lethal_factor_precursor\t[Anthrax_toxin_(VF0142)]\t[Bacillus_anthracis_str._Sterne]

Resources