linux shell get multi file intersection

linux shell get multi file intersection - linux

I have a few txt file examples 1.txt 2.txt 3.txt 4.txt
I want to get 1.txt 2.txt 3.txt 4.txt content intersection
cat 1.txt 2.txt | sort | uniq -c > tmp.txt
cat tmp.txt 3.txt | sort | uniq -c > tmp2.txt
and so on ....
Is there a better way?
input text
1.txt
1
2
3
4
2.txt
1
2
3
3.txt
1
2
4.txt
1
5
expected output:
1

With your shown samples please try following awk code.
1st solution: This considers that you may have duplicates values of lines with in a single Input_file itself then you may try following:
awk '
!arr2[FILENAME,$0]++{
arr1[$0]++
}
END{
for(i in arr1){
if(arr1[i]==(ARGC-1)){
print i
}
}
}
' *.txt
2nd solution: This solution assumes that there is no duplicates in Input_file if this is the case then try following:
awk '
{
arr[$0]++
}
END{
for(i in arr){
if(arr[i]==(ARGC-1)){
print i
}
}
}
' *.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
arr[$0]++ ##Creating an array named arr with index of $0 and keep increasing its value.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Traversing through array arr here.
if(arr[i]==(ARGC-1)){ ##Checking condition if value of current item in arr is Equal to total number of files then print it.
print i
}
}
}
' *.txt ##Passing all .txt files as an input to awk program from here.

Related

Print the input search string if grep doesn't match

I have file1
BOB
JOHN
SALLY
I have file2
There was a boy called JOHN and he was playing with FRED while
JILL went off to find a bucket of water from TOM but she
fell down the hill.
I want to iterate through the file1 words and search for these in file2.
I want to print the words that are NOT found in file2.
So the output would be
BOB
SALLY
I guess it is if the grep fails, I'd like to print the string that grep was searching for.
I'm starting here:
grep -o -f file1 file2
But of course, this returns
JOHN
How would I get the original search strings that didn't match - to print instead?

Here is a grep one liner to get this done:
grep -vxFf <(tr '[[:blank:]]' '\n' < file2) file1
BOB
SALLY
Using tr to convert space/tab to newline first then using grep -vxFf to get non-matching words in file1.
Or as David suggested in comments below:
grep -vxFf <(printf '%s\n' $(<file2)) file1

With your shown samples could you please try following.
awk '
FNR==NR{
arr[$0]
next
}
{
for(i in arr){
if(index($0,i)){
delete arr[i]
next
}
}
}
END{
for(i in arr){
print i
}
}
' file1 file2

If the order isn't critical, you can use:
awk '
FNR == NR { a[$1]=0; next }
{ for (i=1;i<=NF;i++)
if ($i in a)
a[$i]++
}
END {
for (i in a)
if (!a[i])
print i
}
' file1 file2
Example Use/Output
$ awk '
> FNR == NR { a[$1]=0; next }
> { for (i=1;i<=NF;i++)
> if ($i in a)
> a[$i]++
> }
> END {
> for (i in a)
> if (!a[i])
> print i
> }
> ' file1 file2
SALLY
BOB

implementing Excel-vlookup-like function with awk

I have a question about vlookup function implementation with awk. I have a csv file having id-score pairs like this (say 1.csv):
id,score
1,16
3,12
5,13
11,8
13,32
17,37
23,74
29,7
31,70
41,83
There are "unscored" guys. I also have a csv file including all registered guys both scored and unscored like this (say, 2.csv) (I transposed for the want of space)
id,1,3,5,7,11,13,17,19,23,29,31,37,41
I would like to generate id-score pairs according to 2nd csv file so as to include both scored and unscored guys. For unscored guys, NAN would be used instead of the digit.
In other words, final result is desired to be like this:
id,score
1,16
3,12
5,13
7,NAN
11,8
13,32
17,37
19,NAN
23,74
29,7
31,70
37,NAN
41,83
When I tried to create a new table with the following awk command, it did not work to me. Thanks in advance for any advice.
awk 'FNR==NR{a[$1]++; next} {print $0, (a[$1]) ? a[$2] : "NAN"}' 1.csv 2.csv

here is your script with fixes: set field separators; save the score value for each id; print the value from lookup, if missing NaN
$ awk 'BEGIN {FS=OFS=","}
FNR==NR {a[$1]=$2; next}
{print $1, (($1 in a)?a[$1]:"NAN")}' file1 file2
id,score
1,16
3,12
5,13
7,NAN
11,8
13,32
17,37
19,NAN
23,74
29,7
31,70
37,NAN
41,83

With bash and join:
echo "id,score"
join --header -j 1 -t ',' <(sort 1.csv | grep -v '^id') <(tr ',' '\n' < 2.csv | grep -v '^id' | sort) -e "NAN" -a 2 -o 2.1,1.2 | sort -n
Output:
id,score
1,16
3,12
5,13
7,NAN
11,8
13,32
17,37
19,NAN
23,74
29,7
31,70
37,NAN
41,83
See: man join

With awk could you please try following, written with shown samples in GNU awk. Considering(like your shown samples) your both the Input_files have headers in their first line.
awk -v counter=2 '
FNR==1{
next
}
FNR==NR{
a[FNR]=$0
b[FNR]=$1
next
}
{
if($0==b[counter]){
print a[counter]
counter++
}
else{
print $0",NA"
}
}
' FS="," 1.csv <(tr ',' '\n' < 2.csv)
Explanation: Adding detailed explanation for above.
awk -v counter=2 ' ##Starting awk program from here and setting counter as 2.
FNR==1{ ##Checking condition if line is 1st then do following.
next ##next will skip all further statements from here.
}
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when Input_file 1.csv is being read.
a[FNR]=$0 ##Creating array a with index FNR and value of current line.
b[FNR]=$1 ##Creating array b with index FNR and value of 1st field of current line.
next ##next will skip all further statements from here.
}
{
if($0==b[counter]){ ##Checking condiiton if current line is same as array b with index counter value then do following.
print a[counter] ##Printing array a with index of counter here.
counter++ ##Increasing count of counter by 1 each time cursor comes here.
}
else{ ##Else part of for above if condition starts here.
print $0",NA" ##Printing current line and NA here.
}
}
' FS="," 1.csv <(tr ',' '\n' < 2.csv) ##Setting FS as , for Input_file 1.csv and sending 2.csv output by changing comma to new line to awk.

An awk solution could be:
awk -v FS=, -v OFS=, '
NR == 1 { print; next }
NR == FNR { score[$1] = $2; next }
{ for (i = 2; i <= NF; ++i)
print $i, score[$i] == "" ? "NAN" : score[$i] }
' 1.csv 2.csv

extract data using sed or awk in linux

I am trying to merge data from 2 text files based on some condition.
I have two files:
1.txt
gera077||o||emi_riv_90#hotmail.com||||200.45.113.254||o||0f8caa3ced5dc172901a427410d20540
okan1993||||killa-o#hotmail.de||||84.141.125.140||o||69c1cb5ddbc66cceebe0dddba3eddf68
Tosiunia||||tosia_19#amorki.pl||o||83.22.193.86|||||ddcbba2076646980391cb4971b8030
DREP
glen-666||o||glen-666#hotmail.com||||84.196.42.167||o||f139d8b49085d012af9048bb1cba3534
Page 1
Sheyes1 ||||summer_faerie_dustyrose#yahoo.com|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
.
BenPhynix||||BenPhynix#aol.de||||| 62.226.181.57||||11dea24f1caebb012e11285579050f38
menopause |||totoche#wanadoo.fr||o||83.193.209.52||o||d7ca4d78fc79a795695ae1c161ce82ea
jonof.|o||joflem#medi3.no||o||213.161.242.106||o||239f33743e4a070b728d4dcbd1091f1a
2.txt
f139d8b49085d012af9048bb1cba3534: 12883 #: "#
d7ca4d78fc79a795695ae1c161ce82ea: 123422
0f8caa3ced5dc172901a427410d20540 :: demo
Contains the matching lines from 1.txt and hash is replaced with corresponding value in 2.txt
result.txt
gera077 || o || emi_riv_90#hotmail.com || or || 200.45.113.254 || o ||: demo
glen-666-||glen-666#hotmail.com||||84.196.42.167||||12883 #: "#
menopause |||totoche#wanadoo.fr||o||83.193.209.52||o||123422
Contains the non-matching lines from 1.txt
left.txt
okan1993||||killa-o#hotmail.de||||84.141.125.140||o||69c1cb5ddbc66cceebe0dddba3eddf68
Tosiunia||||tosia_19#amorki.pl||o||83.22.193.86|||||ddcbba2076646980391cb4971b8030
DREP
Page 1
Sheyes1 ||||summer_faerie_dustyrose#yahoo.com|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
.
BenPhynix||||BenPhynix#aol.de||||| 62.226.181.57||||11dea24f1caebb012e11285579050f38
jonof.|o||joflem#medi3.no||o||213.161.242.106||o||239f33743e4a070b728d4dcbd1091f1a
The script I am trying is :
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
c[$9]=$0;
next
}
($1 in a){
val=$1;
$1="";
sub(/:/,"");
print a[val] s1 $0 s1 b[val];
d[val]=$0;
next
}
END{
for(i in d){
delete c[i]
};
for(j in c){
print c[j] > "left.txt"
}}
' FS="|" 1.txt FS=":" OFS=":" 2.txt > result.txt
But it is giving me empty result.txt
I am facing difficulty in debugging the issue.
Any help would be highly appreciated.

Try following awk(completely based on your shown Input_file(s) and considering that your 2.txt will not have any duplicates on it too) and let me know if this helps you.
awk 'FNR==NR{a[$NF]=$0;next} $1~/:/{sub(/:/,"",$1);flag=1} ($1 in a){val=$1;if($0 ~ /:/ && !flag){sub(/[^:]*/,"");sub(/:/,"")};print a[val] OFS $0 > "result.txt";flag="";delete a[val]} END{for(i in a){print a[i]>"left.txt"}}' FS="|" 1.txt FS=" " OFS="||o||" 2.txt
Output will be 2 files named results.txt and left.txt. Will add non-one liner form and explanation too for above code shortly.
Adding a non-one liner form of solution too now.
awk '
FNR==NR{ ##FNR and NR both are awk out of the box variables and they denote line numbers in Input_file(s), difference between them is FNR value will be RESET when it complete reading 1 Input_file and NR value will be keep increasing till it completes reading all the Input_file(s).
a[$NF]=$0; ##Creating an array named a whose index is $NF(value of last field of current line) and value is current line.
next ##next is awk out of the box keyword which will skip all further statements now.
}
$1~/:/{ ##Checking condition here if current lines 1st field has a colon in it then do following:
sub(/:/,"",$1); ##Using sub function of awk which will substitute colon with NULL of 1st field of current line of current Input_file.
flag=1 ##Setting a variable named flag here(basically to make sure that 1st colon is substituted so need for another colon removal.
}
($1 in a){ ##Checking a condition here if current line $1 is present in array a then do following:
val=$1; ##Setting variable named val value to $1 here.
if($0 ~ /:/ && !flag){ ##Checking condition here if current line is having colon and variable flag is NOT NULL then do following:
sub(/[^:]*/,""); ##Substituting all the values from starting to till colon comes with NULL.
sub(/:/,"")}; ##Then substituting only 1 colon here.
print a[val] OFS $0 > "result.txt"; ##printing the value of array a whose index is variable val OFS(output field separator) current line values to output file named results.txt here.
flag=""; ##Unsetting the value of variable flag here.
delete a[val] ##Deleting the value of array a whose index is variable val here.
}
END{ ##Starting end section of this awk program here. which will be executed once all Input_file(s) have been read.
for(i in a){ ##Traversing through the array a now.
print a[i]>"left.txt"} ##Printing the value of array a(which will basically provide those values which are NOT matched in both files) in left.txt file.
}
' FS="|" 1.txt FS=" " OFS="||o||" 2.txt ##Setting FS="|" for 1.txt Input_file and then setting FS=" " and OFS="||o||" for 2.txt Input_file, 1.txt and 2.txt are Input_files for this program to run.

This awk script may also help.
$ awk 'BEGIN{FS="\|";OFS="|"}NR==FNR{data[$1]=$2;}
NR!=FNR{if($NF in data){
$NF=data[$NF];print >"result.txt"
}else{
print >"left.txt"}
}' <( sed 's/\s*:\s*/|/' 2.txt) 1.txt 2>/dev/null
Output
$ cat result.txt
gera077||o||emi_riv_90#hotmail.com||||200.45.113.254||o||: demo
glen-666||o||glen-666#hotmail.com||||84.196.42.167||o||12883 #: "#
menopause |||totoche#wanadoo.fr||o||83.193.209.52||o||123422
$ cat left.txt
okan1993||||killa-o#hotmail.de||||84.141.125.140||o||69c1cb5ddbc66cceebe0dddba3eddf68
Tosiunia||||tosia_19#amorki.pl||o||83.22.193.86|||||ddcbba2076646980391cb4971b8030
DREP
Page 1
Sheyes1 ||||summer_faerie_dustyrose#yahoo.com|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
.
BenPhynix||||BenPhynix#aol.de||||| 62.226.181.57||||11dea24f1caebb012e11285579050f38
jonof.|o||joflem#medi3.no||o||213.161.242.106||o||239f33743e4a070b728d4dcbd1091f1a
We have preprocessed the first file - using sed - to make its field delimiter | and used process substitution to pass the result to awk.

Matching files using awk in linux

I have 2 files:
1.txt:
e10adc3949ba59abbe56e057f20f883e
f8b46e989c5794eec4e268605b63eb59
e3ceb5881a0a1fdaad01296d7554868d
2.txt:
e10adc3949ba59abbe56e057f20f883e:1111
679ab793796da4cbd0dda3d0daf74ec1:1234
f8b46e989c5794eec4e268605b63eb59:1#/233:
I want 2 files as output:
One is result.txt which contains lines from 2.txt whose match is in 1.txt
and another is left.txt which contains lines from 1.txt whose match is not in 2.txt
Expected output of both files is below:
result.txt
e10adc3949ba59abbe56e057f20f883e:1111
f8b46e989c5794eec4e268605b63eb59:1#/233:
left.txt
e3ceb5881a0a1fdaad01296d7554868d
I tried 1-2 approaches with awk but not succeeded. Any help would be highly appreciated.
My script:
awk '
FNR==NR{
val=$1;
sub(/[^:]*/,"");
sub(/:/,"");
a[val]=$0;
next
}
!($NF in a){
print > "left.txt";
next
}
{
print $1,$2,a[$NF]> "result.txt"
}
' FS=":" 2.txt FS=":" OFS=":" 1.txt

Following awk may help you in same.
awk 'FNR==NR{a[$1]=$0;next} ($0 in a){print a[$0] > "results.txt";next} {print > "left.txt"}' FS=":" OFS=":" 2.txt FS=" " OFS=":" 1.txt
EDIT: Adding explanation of code too here.
awk '
FNR==NR{ ##FNR==NR condition will be TRUE when first Input_file is being read by awk. Where FNR and NR are the out of the box variables for awk.
a[$1]=$0; ##creating an array named a whose index is $1 and value is $2 from 2.txt Input_file.
next ##next is out of the box keyword from awk and will skip all further statements of awk.
}
($0 in a){ ##Checking here condition if current line of Input_file 1.txt is present in array named a then do following.
print a[$0] > "results.txt"; ##Printing the current line into output file named results.txt, since current line is coming in array named a(which was created by 1st file).
next ##next is awk keyword which will skip further statements for awk code now.
}
{
print > "left.txt" ##Printing all lines which skip above condition(which means they did not come into array a) to output file named left.txt as per OP need.
}
' FS=":" OFS=":" 2.txt FS=" " OFS=":" 1.txt ##Setting FS(field separator) as colon for 2.txt and Setting FS to space for 1.txt here. yes, we could set multiple field separators for different Input_file(s).

How about this one:
awk 'BEGIN{ FS = ":" }NR==FNR{ a[$0]; next }$1 in a{ print $0 > "results.txt"; delete a[$1]; next }END{ for ( i in a ) print i > "left.txt" }' 1.txt 2.txt
Output:
results.txt
e10adc3949ba59abbe56e057f20f883e:1111
f8b46e989c5794eec4e268605b63eb59:1#/233:
left.txt
e3ceb5881a0a1fdaad01296d7554868d

Merge two files using awk in linux

I have a 1.txt file:
betomak#msn.com||o||0174686211||o||7880291304ca0404f4dac3dc205f1adf||o||Mario||o||Mario||o||Kawati
zizipi#libero.it||o||174732943.0174732943||o||e10adc3949ba59abbe56e057f20f883e||o||Tiziano||o||Tiziano||o||D'Intino
frankmel#hotmail.de||o||0174844404||o||8d496ce08a7ecef4721973cb9f777307||o||Melanie||o||Melanie||o||Kiesel
apoka-paris#hotmail.fr||o||0174847613||o||536c1287d2dc086030497d1b8ea7a175||o||Sihem||o||Sihem||o||Sousou
sofianomovic#msn.fr||o||174902297.0174902297||o||9893ac33a018e8d37e68c66cae23040e||o||Nabile||o||Nabile||o||Nassime
donaldduck#yahoo.com||o||174912161.0174912161||o||0c770713436695c18a7939ad82bc8351||o||Donald||o||Donald||o||Duck
cernakova#centrum.cz||o||0174991962||o||d161dc716be5daf1649472ddf9e343e6||o||Dagmar||o||Dagmar||o||Cernakova
trgsrl#tiscali.it||o||0175099675||o||d26005df3e5b416d6a39cc5bcfdef42b||o||Esmeralda||o||Esmeralda||o||Trogu
catherinesou#yahoo.fr||o||0175128896||o||2e9ce84389c3e2c003fd42bae3c49d12||o||Cat||o||Cat||o||Sou
ermimurati24#hotmail.com||o||0175228687||o||a7766a502e4f598c9ddb3a821bc02159||o||Anna||o||Anna||o||Beratsja
cece_89#live.fr||o||0175306898||o||297642a68e4e0b79fca312ac072a9d41||o||Celine||o||Celine||o||Jacinto
kendinegel39#hotmail.com||o||0175410459||o||a6565ca2bc8887cde5e0a9819d9a8ee9||o||Adem||o||Adem||o||Bulut
A 2.txt file:
9893ac33a018e8d37e68c66cae23040e:134:#a1
536c1287d2dc086030497d1b8ea7a175:~~#!:/92\
8d496ce08a7ecef4721973cb9f777307:demodemo
FS for 1.txt is "||o||" and for 2.txt is ":"
I want to merge two files in a single file result.txt based on the condition that the 3rd column of 1.txt must match with 1st column of 2.txt file and should be replaced by the 2nd column of 2.txt file.
The expected output will contain all the matching lines:
I am showing you one of them:
sofianomovic#msn.fr||o||174902297.0174902297||o||134:#a1||o||Nabile||o||Nabile||o||Nassime
I tried the script:
awk -F"||o||" 'NR==FNR{s=$0; sub(/:[^:]*$/, "", s); a[s]=$NF;next} {s = $5; for (i=6; i<=NF; ++i) s = s "," $i; if (s in a) { NF = 5; $5=a[s]; print } }' FS=: <(tr -d '\r' < 2.txt) FS="||o||" OFS="||o||" <(tr -d '\r' < 1.txt) > result.txt
But getting an empty file as the result. Any help would be highly appreciated.

If your actual Input_file(s) are same as shown sample then following awk may help you in same.
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
next
}
($1 in a){
print a[$1] s1 $2 FS $3 s1 b[$1]
}
' FS="|" 1.txt FS=":" 2.txt
EDIT: Since OP has changed requirement a bit so providing code as per new ask where it will create 2 files too 1 file which will have ids present in 1.txt and NOT in 2.txt and other will be vice versa of it.
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
c[$9]=$0;
next
}
($1 in a){
val=$1;
$1="";
sub(/:/,"");
print a[val] s1 $0 s1 b[val];
d[val]=$0;
next
}
{
print > "NOT_present_in_2.txt"
}
END{
for(i in d){
delete c[i]
};
for(j in c){
print j,c[j] > "NOT_present_in_1.txt"
}}
' FS="|" 1.txt FS=":" OFS=":" 2.txt

You can use this awk to get your output:
awk -F ':' 'NR==FNR{a[$1]=$2 FS $3; next} FNR==1{FS=OFS="||o||"; gsub(/[|]/, "\\\\&", FS)}
$3 in a{$3=a[$3]; print}' file2 file1 > result.txt
cat result.txt
frankmel#hotmail.de||o||0174844404||o||demodemo:||o||Melanie||o||Melanie||o||Kiesel
apoka-paris#hotmail.fr||o||0174847613||o||~~#!:/92\||o||Sihem||o||Sihem||o||Sousou
sofianomovic#msn.fr||o||174902297.0174902297||o||134:#a1||o||Nabile||o||Nabile||o||Nassime

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

linux shell get multi file intersection - linux

Related

Print the input search string if grep doesn't match

implementing Excel-vlookup-like function with awk

extract data using sed or awk in linux

Matching files using awk in linux

Merge two files using awk in linux

Categories

Resources