comparing files Unix - linux

I have 2 scripts file.txt and file2.txt
file1.txt
name|mandatory|
age|mandatory|
address|mandatory|
email|mandatory|
country|not-mandatory|
file2.txt
gabrielle||nashville|gabrielle#outlook.com||
These are my exact data files, In file1 column1 is the field name and column2 is to note whether the field should not be null in file2.
In file2 data is in single row separated by |.
The age mentioned as mandatory in file1 is not present in file2[which is a single row] and that is what my needed output too.
Expected output:
age mandatory
I got with code that file2 is in same format as file1 where mandatory is replaced with field2 data.
awk -F '|' '
NR==FNR && $3=="mandatory" {m[$2]++}
NR>FNR && $3=="" && m[$2] {printf "%s mandatory\n", $2}
' file1.txt file2.txt

You have to iterate over fields for(... i <= NR ...).
awk -F '|' '
NR==FNR { name[NR]=$1; man[NR]=$2 }
NR!=FNR {
for (i = 1; i <= NR; ++i) {
if ($i == "" && man[i] == "mandatory") {
printf("Field %s is mandatory!\n", name[i]);
}
}
}
' file1.txt file2.txt

Related

csv file manipulation in unix and append value to each line

I have the below csv file
,,,Test File,
,todays Date:,01/10/2018,Generation date,10/01/2019 11:20:58
Header 1,Header 2,Header 3,Header 4,Header 5
,My account no,100102GFC,,
A,B,C,D,E
A,B,C,D,E
A,B,C,D,E
TEST
I need to extract the todays date that is in 3rd column of the second line
and also the account number which is in 3rd column of the 4th line.
Below is the new file that i have to create, those extracted values
from 3rd and 4th line needs to be appended at the end of the file.
New file will contain the data from the 4th line and n-1 line
A,B,C,D,E,01/10/2018,100102GFC
A,B,C,D,E,01/10/2018,100102GFC
A,B,C,D,E,01/10/2018,100102GFC
Kindly could you please help me how to do the same in a shell script?
Here is what i tried, i am new to shell scripting, unable to combine all these
To extract the date from second row
sed -sn 2p test.csv| cut -d ',' -f 3
To extract the account no
sed -sn 3p test.csv| cut -d ',' -f 3
To extract the actual data
tail -n +5 test.csv | head -n -1>temp.csv
Try awk:
awk -F, 'NR==2{d=$3}NR==4{a=$3}NR>4{if (line) print line; line = $0 "," d "," a;}' Inputfile.csv
Eg:
$ cat file1
,,,Test File,
,todays Date:,01/10/2018,Generation date,10/01/2019 11:20:58
Header 1,Header 2,Header 3,Header 4,Header 5
,My account no,100102GFC,,
A,B,C,D,E
A,B,C,D,E
A,B,C,D,E
TEST
$ awk -F, 'NR==2{d=$3}NR==4{a=$3}NR>4{if (line) print line; line = $0 "," d "," a;}' file1
A,B,C,D,E,01/10/2018,100102GFC
A,B,C,D,E,01/10/2018,100102GFC
A,B,C,D,E,01/10/2018,100102GFC
Misunderstood your meaning before I edit your question, updated my answer afterwards.
In the awk command:
NR means the line number, -F to assign separator, d store date a account.
just concatenate the line $0 with d and a.
You don't want last line, so I used line to delay print, last line won't print out (though it did saved to line, and can be used if a END block is given).
You can try Perl also
$ cat dawn.txt
,,,Test File,
,todays Date:,01/10/2018,Generation date,10/01/2019 11:20:58
Header 1,Header 2,Header 3,Header 4,Header 5
,My account no,100102GFC,,
A,B,C,D,E
A,B,C,D,E
A,B,C,D,E
TEST
$ perl -F, -lane ' $dt=$F[2] if $.==2 ; $ac=$F[2] if $.==4; if($.>4 and ! eof) { print "$_,$dt,$ac" } ' dawn.txt
A,B,C,D,E,01/10/2018,100102GFC
A,B,C,D,E,01/10/2018,100102GFC
A,B,C,D,E,01/10/2018,100102GFC
$
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == 2 { date = $3 }
NR == 4 { acct = $3 }
NR>4 && NF>1 { print $0, date, acct }
$ awk -f tst.awk file
A,B,C,D,E,01/10/2018,100102GFC
A,B,C,D,E,01/10/2018,100102GFC
A,B,C,D,E,01/10/2018,100102GFC
or, depending on your requirements and actual input data:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == 2 { date = $3 }
NR == 4 { acct = $3 }
NR>4 {
if (out != "") {
print out
}
out = $0 OFS date OFS acct
}
$ awk -f tst.awk file
A,B,C,D,E,01/10/2018,100102GFC
A,B,C,D,E,01/10/2018,100102GFC
A,B,C,D,E,01/10/2018,100102GFC

Merge two files using awk in linux

I have a 1.txt file:
betomak#msn.com||o||0174686211||o||7880291304ca0404f4dac3dc205f1adf||o||Mario||o||Mario||o||Kawati
zizipi#libero.it||o||174732943.0174732943||o||e10adc3949ba59abbe56e057f20f883e||o||Tiziano||o||Tiziano||o||D'Intino
frankmel#hotmail.de||o||0174844404||o||8d496ce08a7ecef4721973cb9f777307||o||Melanie||o||Melanie||o||Kiesel
apoka-paris#hotmail.fr||o||0174847613||o||536c1287d2dc086030497d1b8ea7a175||o||Sihem||o||Sihem||o||Sousou
sofianomovic#msn.fr||o||174902297.0174902297||o||9893ac33a018e8d37e68c66cae23040e||o||Nabile||o||Nabile||o||Nassime
donaldduck#yahoo.com||o||174912161.0174912161||o||0c770713436695c18a7939ad82bc8351||o||Donald||o||Donald||o||Duck
cernakova#centrum.cz||o||0174991962||o||d161dc716be5daf1649472ddf9e343e6||o||Dagmar||o||Dagmar||o||Cernakova
trgsrl#tiscali.it||o||0175099675||o||d26005df3e5b416d6a39cc5bcfdef42b||o||Esmeralda||o||Esmeralda||o||Trogu
catherinesou#yahoo.fr||o||0175128896||o||2e9ce84389c3e2c003fd42bae3c49d12||o||Cat||o||Cat||o||Sou
ermimurati24#hotmail.com||o||0175228687||o||a7766a502e4f598c9ddb3a821bc02159||o||Anna||o||Anna||o||Beratsja
cece_89#live.fr||o||0175306898||o||297642a68e4e0b79fca312ac072a9d41||o||Celine||o||Celine||o||Jacinto
kendinegel39#hotmail.com||o||0175410459||o||a6565ca2bc8887cde5e0a9819d9a8ee9||o||Adem||o||Adem||o||Bulut
A 2.txt file:
9893ac33a018e8d37e68c66cae23040e:134:#a1
536c1287d2dc086030497d1b8ea7a175:~~#!:/92\
8d496ce08a7ecef4721973cb9f777307:demodemo
FS for 1.txt is "||o||" and for 2.txt is ":"
I want to merge two files in a single file result.txt based on the condition that the 3rd column of 1.txt must match with 1st column of 2.txt file and should be replaced by the 2nd column of 2.txt file.
The expected output will contain all the matching lines:
I am showing you one of them:
sofianomovic#msn.fr||o||174902297.0174902297||o||134:#a1||o||Nabile||o||Nabile||o||Nassime
I tried the script:
awk -F"||o||" 'NR==FNR{s=$0; sub(/:[^:]*$/, "", s); a[s]=$NF;next} {s = $5; for (i=6; i<=NF; ++i) s = s "," $i; if (s in a) { NF = 5; $5=a[s]; print } }' FS=: <(tr -d '\r' < 2.txt) FS="||o||" OFS="||o||" <(tr -d '\r' < 1.txt) > result.txt
But getting an empty file as the result. Any help would be highly appreciated.
If your actual Input_file(s) are same as shown sample then following awk may help you in same.
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
next
}
($1 in a){
print a[$1] s1 $2 FS $3 s1 b[$1]
}
' FS="|" 1.txt FS=":" 2.txt
EDIT: Since OP has changed requirement a bit so providing code as per new ask where it will create 2 files too 1 file which will have ids present in 1.txt and NOT in 2.txt and other will be vice versa of it.
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
c[$9]=$0;
next
}
($1 in a){
val=$1;
$1="";
sub(/:/,"");
print a[val] s1 $0 s1 b[val];
d[val]=$0;
next
}
{
print > "NOT_present_in_2.txt"
}
END{
for(i in d){
delete c[i]
};
for(j in c){
print j,c[j] > "NOT_present_in_1.txt"
}}
' FS="|" 1.txt FS=":" OFS=":" 2.txt
You can use this awk to get your output:
awk -F ':' 'NR==FNR{a[$1]=$2 FS $3; next} FNR==1{FS=OFS="||o||"; gsub(/[|]/, "\\\\&", FS)}
$3 in a{$3=a[$3]; print}' file2 file1 > result.txt
cat result.txt
frankmel#hotmail.de||o||0174844404||o||demodemo:||o||Melanie||o||Melanie||o||Kiesel
apoka-paris#hotmail.fr||o||0174847613||o||~~#!:/92\||o||Sihem||o||Sihem||o||Sousou
sofianomovic#msn.fr||o||174902297.0174902297||o||134:#a1||o||Nabile||o||Nabile||o||Nassime

Multi-input files for awk

I have two CSV files, the first one looks like below:
File1:
3124,3124,0,2,,1,0,1,1,0,0,0,0,0,0,0,0,1106,11
6118,6118,0,0,,0,0,1,0,0,0,0,1,1,1,1,1,5156,51
6679,6679,0,0,,1,0,1,0,0,0,0,0,1,0,1,0,1106,11
5249,5249,0,0,,0,0,1,1,0,0,0,0,0,0,0,0,1106,13
2658,2658,0,0,,1,0,1,1,0,0,0,0,0,0,0,0,1197,11
4322,4322,0,0,,1,0,1,1,0,0,0,0,0,0,0,0,1307,13
File2:
7792,1307,2012-06-07,,,,
5249,4001,2016-07-02,,,,
6001,1334,2017-01-23,,,,
2658,4001,2009-02-09,,,,
9279,1326,2014-12-20,,,,
what I need:
if the $2 in file2 = 4001, then has to match $1 of file2 with file1, if $18 in file1 = 1106 for the matched $1 then print that line.
the expected output:
5249,5249,0,0,,0,0,1,1,0,0,0,0,0,0,0,0,1106,13
I have tried something as the following, but with no success.
awk 'NR=FNR {A[$1]=$1;next} {print $1}'
P.S: The files are compressed, so I have to use the zcat command
I would try something like:
$ cat t.awk
BEGIN { FS = "," }
# Processing first file
NR == FNR && $18 == 1106 { a[$1] = $0; next }
# Processing second file
$2 == 4001 && $1 in a { print a[$1] }
$ awk -f t.awk file1.txt file2.txt
5249,5249,0,0,,0,0,1,1,0,0,0,0,0,0,0,0,1106,13

Comparing two CSV files in linux

I have two CSV files with me in the following format:
File1:
No.1, No.2
983264,72342349
763498,81243970
736493,83740940
File2:
No.1,No.2
"7938493","7364987"
"2153187","7387910"
"736493","83740940"
I need to compare the two files and output the matched,unmatched values.
I did it through awk:
#!/bin/bash
awk 'BEGIN {
FS = OFS = ","
}
if (FNR==1){next}
NR>1 && NR==FNR {
a[$1];
next
}
FNR>1 {
print ($1 in a) ? $1 FS "Match" : $1 FS "In file2 but not in file1"
delete a[$1]
}
END {
for (x in a) {
print x FS "In file1 but not in file2"
}
}'file1 file2
But the output is:
"7938493",In file2 but not in file1
"2153187",In file2 but not in file1
"8172470",In file2 but not in file1
7938493,In file1 but not in file2
2153187,In file1 but not in file2
8172470,In file1 but not in file2
Can you please tell me where I am going wrong?
Here are some corrections to your script:
BEGIN {
# FS = OFS = ","
FS = "[,\"]+"
OFS = ", "
}
# if (FNR==1){next}
FNR == 1 {next}
# NR>1 && NR==FNR {
NR==FNR {
a[$1];
next
}
# FNR>1 {
$2 in a {
# print ($1 in a) ? $1 FS "Match" : $1 FS "In file2 but not in file1"
print ($2 in a) ? $2 OFS "Match" : $2 "In file2 but not in file1"
delete a[$2]
}
END {
for (x in a) {
print x, "In file1 but not in file2"
}
}
This is an awk script, so you can run it like awk -f script.awk file1 file2. Doing so gives these results:
$ awk -f script.awk file1 file2
736493, Match
763498, In file1 but not in file2
983264, In file1 but not in file2
The main problem with your script was that it didn't correctly handle the double quotes around the numbers in file2. I changed the input field separator so that the double quotes are treated as part of the separator to deal with this. As a result, the first field $1 in the second file is empty (it is the bit between the start of the line and the first "), so you need to use $2 to refer to the first value you're interested in. Aside from that, I removed some redundant conditions from your other blocks and used OFS rather than FS in your first print statement.

Merging Multiple records into a Unique records with all the non-null values

Suppose I have 3 records :
P1||1234|
P1|56001||
P1|||NJ
I want to merge these 3 records into one with all the attributes. Final record :
P1|56001|1234|NJ
Is there any way to achieve this in Unix/Linux?
I assume you ask solution with bash, awk, sed etc.
You could try something like
$ cat test.txt
P1||1234|
P1|56001||
P1|||NJ
$ cat test.txt | awk -F'|' '{ for (i = 1; i <= NF; i++) print $i }' | egrep '.+' | sort | uniq | awk 'BEGIN{ c = "" } { printf c $0; c = "|" } END{ printf "\n" }'
1234|56001|NJ|P1
Briefly, awk splits the lines with '|' separator and prints each field to a line. egrep removes the empty lines. After that, sort and uniq removes multiple attributes. Finally, awk merges the lines with '|' separator.
Update:
If I understand correctly, here's what you seek for;
$ cat test.txt | awk -F'|' '{ for (i = 1; i <= NF; i++) if($i) col[i]=$i } END{ for (i = 1; i <= length(col); i++) printf col[i] (i == length(col) ? "\n" : "|")}'
P1|56001|1234|NJ
In your example, 1st row you have 1234, 2nd row you have 56001.
I don't get why in your final result, the 56001 goes before 1234. I assume it is a typo/mistake.
an awk-oneliner could do the job:
awk -F'|' '{for(i=2;i<=NF;i++)if($i)a[$1]=(a[$1]?a[$1]"|":"")$i}END{print $1"|"a[$1]}'
with your data:
kent$ echo "P1||1234|
P1|56001||
P1||NJ"|awk -F'|' '{for(i=2;i<=NF;i++)if($i)a[$1]=(a[$1]?a[$1]"|":"")$i}END{print $1"|"a[$1]}'
P1|1234|56001|NJ

Resources