Linux - Search text in a file and join in another file

Linux - Search text in a file and join in another file - linux

I have two text files:
File-1:
PRKCZ
TNFRSF14
PRDM16
MTHFR
File-2(contains two tab delimited columns):
atherosclerosis GRAB1|PRKCZ|TTN
cardiomyopathy,hypercholesterolemia PRKCZ|MTHFR
Pulmonary arterial hypertension,arrhythmia PRDM16|APOE|GATA4
Now, for each name in File-1, print also the corresponding diseases names from File-2 where it matches. So the output would be:
PRKCZ atherosclerosis,cardiomyopathy,hypercholesterolemia
PRDM16 Pulmonary arterial hypertension,arrhythmia
MTHFR cardiomyopathy,hypercholesterolemia
I have tried the code:
$ awk '{k=$1}
NR==FNR{if(NR>1)a[k]=","b"="$1";else{a[k]="";b=$1}next}
k in a{print $0a[k]}' File1 File2
but I obtained no desired output. Can anybody correct/help please.

You can do this with the following awk script:
script.awk
BEGIN { FS="[\t]" }
NR==FNR { split($2, tmp, "|")
for( ind in tmp ) {
name = tmp[ ind ]
if (name in disease) { disease[ name ] = disease[ name ] "," $1 }
else { disease[ name ] = $1 }
}
next
}
{ if( $1 in disease) print $1, disease[ $1 ] }
Use it like this awk -f script.awk File-2 File-1 (note first File-2).
Explanation:
the BEGIN block sets up tab as separator.
the NR == FNR block is executed for the first argument (File-2): it reads the diseases with the names, splits the names and then appends the disease to a dictionary under each of the names
the last block is executed only (due to the next in the previous block) for the second argument (File-1): it outputs the diseases that are stored under the name (taken from $1)

Related

Replace each nth occurrence of 'foo' and 'bar' on two distincts columns by numerically respective nth line of a supplied file in respective columns

I have a source.txt file like below containing two columns of data. The format of the columns of source.txt include [ ] (square bracket) as shown in my source.txt:
[hot] [water]
[16] [boots and, juice]
and I have another target.txt file and contain empty lines plus full stops at the end of each line:
the weather is today (foo) but we still have (bar).
= (
the next bus leaves at (foo) pm, we can't forget to take the (bar).
I want to do replace foo of each nth line of target.txt with the "respective contents" of the first column of source.txt, and also replace bar of each nth line of target.txt with the "respective contents" of the second column of source. txt.
i tried to search other sources and understand how i would do it, at first i already have a command that i use to replace "replace each nth occurrence of 'foo' by numerically respective nth line of a supplied file" but i couldn't adapt it:
awk 'NR==FNR {a[NR]=$0; next} /foo/{gsub("foo", a[++i])} 1' source.txt target.txt > output.txt;
I remember seeing a way to use gsub with containing two columns of data but I don't remember what exactly the difference was.
EDIT POST: sometimes read with some symbols between them = and ( and ) within the target.txt text. I added this symbol as some answers will not work if these symbols are in the target.txt file
Note: the number of target.txt lines and therefore the number of occurrences of bar and foo in this file can vary, I just showed a sample. But the number of occurrences of both foo and bar in each row is 1 respectively.

With your shown samples, please try following answer. Written and tested in GNU awk.
awk -F'\\[|\\] \\[|\\]' '
FNR==NR{
foo[FNR]=$2
bar[FNR]=$3
next
}
NF{
gsub(/\<foo\>/,foo[++count])
gsub(/\<bar\>/,bar[count])
}
1
' source.txt FS=" " target.txt
Explanation: Adding detailed explanation for above.
awk -F'\\[|\\] \\[|\\]' ' ##Setting field separator as [ OR ] [ OR ] here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when source.txt will be read.
foo[FNR]=$2 ##Creating foo array with index of FNR and value of 2nd field here.
bar[FNR]=$3 ##Creating bar array with index of FNR and value of 3rd field here.
next ##next will skip all further statements from here.
}
NF{ ##If line is NOT empty then do following.
gsub(/\<foo\>/,foo[++count]) ##Globally substituting foo with array foo value, whose index is count.
gsub(/\<bar\>/,bar[count]) ##Globally substituting bar with array of bar with index of count.
}
1 ##printing line here.
' source.txt FS=" " target.txt ##Mentioning Input_files names here.
EDIT: Adding following solution also which will handle n number of occurrences of [...] in source and matching them at target file also. Since this is a working solution for OP(confirmed in comments) adding this in here. Also fair warning this will fail when source.txt contains a &.
awk '
FNR==NR{
while(match($0,/\[[^]]*\]/)){
arr[++count]=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
}
next
}
{
line=$0
while(match(line,/\(?[[:space:]]*(\<foo\>|\<bar\>)[[:space:]]*\)?/)){
val=substr(line,RSTART,RLENGTH)
sub(val,arr[++count1])
line=substr(line,RSTART+RLENGTH)
}
}
1
' source.txt target.txt

Using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN {
FS="[][]"
tags["foo"]
tags["bar"]
}
NR==FNR {
map["foo",NR] = $2
map["bar",NR] = $4
next
}
{
found = 0
head = ""
while ( match($0,/\([^)]+)/) ) {
tag = substr($0,RSTART+1,RLENGTH-2)
if ( tag in tags ) {
if ( !found++ ) {
lineNr++
}
val = map[tag,lineNr]
}
else {
val = substr($0,RSTART,RLENGTH)
}
head = head substr($0,1,RSTART-1) val
$0 = substr($0,RSTART+RLENGTH)
}
print head $0
}
$ awk -f tst.awk source.txt target.txt
the weather is today hot but we still have water.
= (
the next bus leaves at 16 pm, we can't forget to take the boots and, juice.

awk '
NR==FNR { # build lookup
# delete gumph
gsub(/(^[[:space:]]*\[)|(\][[:space:]]*$)/, "")
# split
split($0, a, /\][[:space:]]+\[/)
# store
foo[FNR] = a[1]
bar[FNR] = a[2]
next
}
!/[^[:space:]]/ { next } # ignore blank lines
{ # do replacements
VFNR++ # FNR - (ignored lines)
# can use sub if foo/bar only appear once
gsub(/\<foo\>/, foo[VFNR])
gsub(/\<bar\>/, bar[VFNR])
print
}
' source.txt target.txt
Note: \< and \> are not in POSIX but are accepted by some versions of awk (eg. gawk). I'm not sure if POSIX awk regex has "word boundary".

match between two files and merge the output using awk

I have two files.First column is common between the both files and I would like to merge the file and generate the output where its copy the first file third column every time in second file whenever there is match.
file1
412234;mark
413234;raja
file2
412234;value1
412234;value2
412234;value3
412234;value4
413234;value1
413234;value2
413234;value3
Output file
412234;value1;mark
412234;value2;mark
412234;value3;mark
412234;value4;mark
413234;value1;raja
413234;value2;raja
413234;value3;raja

Try this:
awk -F';' 'BEGIN{FS=OFS=";"} FNR==NR{a[$1]=$2; next} ($1 in a){print $1, $2, a[$1]}' file1 file2
explanation:
-F';' means that AWK will use ; as field separator;
BEGIN{FS=OFS=";"} set the Output filed separator, used by print function;
AWK parse all files sequentially, the condition:
FNR==NR
is true only when parsing the first file.
While parsing file1, it saves a vector a with first match as index and second match as value;
a is expected to be
a[412234] = mark
a[413234] = raja
($1 in a) is the condition to met, true when first match on file2 is found on vector a.
If true then execute:
print $1";"$2";"a[$1]
that prints matches from file2 and the value of the vector a, saved from file1
----- EDIT
In case file1 contains multiple lines with same index, you need to save all distinct values in a vector and then scan the whole vector for multiple matches on file2
awk -F';' ' \
function vlen(a){n=0; for(i in a) n++; return n;} # helper function defined here \
function contained(val, vect) {found =0; for (x in vect) { if(vect[x] == val) found=1}; return found} # helper function defined here \
BEGIN{FS=OFS=";"} # Set output field separator \
FNR==NR{n=vlen(a); a[n]=$1; b[n]=$2; next} # scan file1 and save all indexes and value in different vectors \
{if(contained($1,a)) { for (i in a) { if (a[i] == $1) { print $1, $2, b[i]}} } else { print $1, $2 } } # for each line in file2, scan the whole vector a looking for a match \
' file1 file2
here we are defining the vlen and contained helper functions

Would you try the following:
awk '
BEGIN {FS=OFS=";"}
NR==FNR {
c[$1]++
a[$1,c[$1]]=$2
next
}
{
if (c[$1]) {
for (i=1; i<=c[$1]; i++) {
$3=a[$1,i]; print
}
} else {
print
}
}' file1 file2
Result with the file1 and file2 provided in the OP's last comment:
412234;value1;mark
412234;value1;raja
412234;value2;mark
412234;value2;raja
413234;value1
413234;value2
If the index in the 1st column (such as 412234) appears more than once
in file1, we need to preserve the existing value in the 2nd column
(such as mark) without overwriting.
Then an array c is introduced to count the occurrences of the index.
Note that the order of the result differs from the OP's expected output.
I hope it is acceptable.

Diff 2 long strings and write result in 3rd file

I am working on my initial bash scripts and got stuck at a place where need help from the forum.
How to implement below in shell script? (Any suggestions/pointers are appreciated!!!)
Requirement:
Compare 2 files matchings KEY containing long string and persist in 3rd file only long strings which differ in other attributes (say value of USER is different). Also skip some attributes comparison.
Input FILE1-
AAUTOX=Y;ACCT=;ACTION=C;APRICE=99.975;AQTY=5541;USER=Sam,bpl;CONFIRM=Y;KEY=29976DYE4;DEPT=MYNA-CLCD -- same
AAUTOX=Y;ACCT=;ACTION=C;APRICE=05.975;AQTY=3451;USER=Todd,chr;CONFIRM=N;KEY=29976DYE5;DEPT=MYNA-CLCD -- diff (USER=Todd,chr) write in result file
Input FILE2-
AAUTOX=Y;ACCT=;ACTION=C;APRICE=99.975;AQTY=5541;USER=Sam,bpl;CONFIRM=Y;KEY=29976DYE4;DEPT=MYNA-CLCD -- same
AAUTOX=Y;ACCT=;ACTION=C;APRICE=05.975;AQTY=3451;USER=Alan,ncr;CONFIRM=N;KEY=29976DYE5;DEPT=MYNA-CLCD -- diff (USER=Alan,ncr) write in result file
AAUTOX=Y;ACCT=;ACTION=C;APRICE=17.000;AQTY=6453;USER=Todd,chr;CONFIRM=N;KEY=29976DYE6;DEPT=MYNA-CLCD -- no match (KEY) found write in result file
Output FILE3:
FILE1:AAUTOX=Y;ACCT=;ACTION=C;APRICE=05.975;AQTY=3451;USER=Todd,chr;CONFIRM=N;KEY=29976DYE5;DEPT=MYNA-CLCD
FILE2:AAUTOX=Y;ACCT=;ACTION=C;APRICE=05.975;AQTY=3451;USER=Alan,ncr;CONFIRM=N;KEY=29976DYE5;DEPT=MYNA-CLCD
FILE1:
FILE2: AAUTOX=Y;ACCT=;ACTION=C;APRICE=17.000;AQTY=6453;USER=Todd,chr;CONFIRM=N;KEY=29976DYE5;DEPT=MYNA-CLCD
and so on for each differed line....
Approach in my mind (It's first cut and cud improve later):
Read FILE1 line by line (awk or read??)
For each Line
a) read FILE2 for matching unique "KEY" (which command to use here??? Could awk read file based on key??? grep KEY from FILE2 but how to break line into fields for comparison??)
b) Now compare each field of FILE1.LINE1 with FILE2.LINE and if different write in 3rd result file (awk breaks line into fields $1, $2 so could compare
though not sure how to do if use "read" command???)

This uses GNU awk 4.* for sorted in (see http://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Array-Traversal), with other awks you can pipe to sort or otherwise determine the key order:
$ cat tst.awk
BEGIN { FS="[;=]" }
{
delete name2val
for (i=1; i<=NF; i+=2) { name2val[$i] = $(i+1) }
key = name2val["KEY"]
keys[key]
recs[key,FILENAME] = $0
for (name in name2val) { vals[key,FILENAME,name] = name2val[name] }
}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
file1 = ARGV[1]
file2 = ARGV[2]
for (key in keys) {
state = "SAME"
if ( (key,file1) in recs ) {
if ( (key,file2) in recs ) {
for (name in name2val) {
if (name != "CONFIRM") {
if (vals[key,file1,name] != vals[key,file2,name]) {
state = "DIFF"
}
}
}
} else { state = "FILE1_ONLY" }
} else { state = "FILE2_ONLY" }
if (state != "SAME") {
print file1":", recs[key,file1]
print file2":", recs[key,file2]
print ""
}
}
}
.
$ gawk -f tst.awk FILE1 FILE2
FILE1: AAUTOX=Y;ACCT=;ACTION=C;APRICE=05.975;AQTY=3451;USER=Todd,chr;CONFIRM=N;KEY=29976DYE5;DEPT=MYNA-CLCD -- diff (USER=Todd,c
hr) write in result file
FILE2: AAUTOX=Y;ACCT=;ACTION=C;APRICE=05.975;AQTY=3451;USER=Alan,ncr;CONFIRM=N;KEY=29976DYE5;DEPT=MYNA-CLCD -- diff (USER=Alan,ncr) write in result file
FILE1:
FILE2: AAUTOX=Y;ACCT=;ACTION=C;APRICE=17.000;AQTY=6453;USER=Todd,chr;CONFIRM=N;KEY=29976DYE6;DEPT=MYNA-CLCD -- no match (KEY) found write in result file

file manipulation with command line tools on linux

I want to transform a file from this format
1;a;34;34;a
1;a;34;23;d
1;a;34;23;v
1;a;4;2;r
1;a;3;2;d
2;f;54;3;f
2;f;34;23;e
2;f;23;5;d
2;f;23;23;g
3;t;26;67;t
3;t;34;45;v
3;t;25;34;h
3;t;34;23;u
3;t;34;34;z
to this format
1;a;34;34;a;34;23;d;34;23;v;4;2;r;3;2;d
2;f;54;3;f;34;23;e;23;5;d;23;23;g;;;
3;t;26;67;t;34;45;v;25;34;h;34;23;u;34;34;z
These are cvs files, so it should work with awk or sed ... but I have failed till now. If the first value is the same, I want to add the last three values to the first line. And this will run till the last entry in the file.
Here some code in awk, but it does not work:
#!/usr/bin/awk -f
BEGIN{ FS = " *; *"}
{ ORS = "\;" }
{
x = $1
print $0
}
{ if (x == $1)
print $3, $4, $5
else
print "\n"
}
END{
print "\n"
}

$ cat tst.awk
BEGIN { FS=OFS=";" }
{ curr = $1 FS $2 }
curr == prev {
sub(/^[^;]*;[^;]*/,"")
printf "%s", $0
next
}
{
printf "%s%s", (NR>1?ORS:""), $0
prev = curr
}
END { print "" }
$ awk -f tst.awk file
1;a;34;34;a;34;23;d;34;23;v;4;2;r;3;2;d
2;f;54;3;f;34;23;e;23;5;d;23;23;g
3;t;26;67;t;34;45;v;25;34;h;34;23;u;34;34;z

If I understand you correctly that you want to build a line from fields 3-5 of all lines with the same first two fields (preceded by those two fields), then
awk -F \; 'key != $1 FS $2 { if(NR != 1) print line; key = $1 FS $2; line = key } { line = line FS $3 FS $4 FS $5 } END { print line }' filename
That is
key != $1 FS $2 { # if the key (first two fields) changed
if(NR != 1) print line; # print the line (except at the very
# beginning, to not get an empty line there)
key = $1 FS $2 # remember the new key
line = key # and start building the next line
}
{
line = line FS $3 FS $4 FS $5 # take the value fields from each line
}
END { # and at the very end,
print line # print the last line (that the block above
} # cannot handle)

You got good answers in awk. Here is one in perl:
perl -F';' -lane'
$key = join ";", #F[0..1]; # Establish your key
$seen{$key}++ or push #rec, $key; # Remember the order
push #{ $h{$key} }, #F[2..$#F] # Build your data structure
}{
$, = ";"; # Set the output list separator
print $_, #{ $h{$_} } for #rec' file # Print as per order

This is going to seem a lot more complicated than the other answers, but it's adding a few things:
It computes the maximum number of fields from all built up lines
Appends any missing fields as blanks to the end of the built up lines
The posix awk on a mac doesn't maintain the order of array elements even when the keys are numbered when using the for(key in array) syntax. To maintain the output order then, you can keep track of it as I've done or pipe to sort afterwards.
Having matching numbers of fields in the output appears to be a requirement per the specified output. Without knowing what it should be, this awk script is built to load all the lines first, compute the maximum number of fields in an output line then output the lines with any adjustments in order.
#!/usr/bin/awk -f
BEGIN {FS=OFS=";"}
{
key = $1
# create an order array for the mac's version of awk
if( key != last_key ) {
order[++key_cnt] = key
last_key = key
}
val = a[key]
# build up an output line in array a for the given key
start = (val=="" ? $1 OFS $2 : val)
a[key] = start OFS $3 OFS $4 OFS $5
# count number of fields for each built up output line
nf_a[key] += 3
}
END {
# compute the max number of fields per any built up output line
for(k in nf_a) {
nf_max = (nf_a[k]>nf_max ? nf_a[k] : nf_max)
}
for(i=1; i<=key_cnt; i++) {
key = order[i]
# compute the number of blank flds necessary
nf_pad = nf_max - nf_a[key]
blank_flds = nf_pad!=0 ? sprintf( "%*s", nf_pad, OFS ) : ""
gsub( / /, OFS, blank_flds )
# output lines along with appended blank fields in order
print a[key] blank_flds
}
}
If the desired number of fields in the output lines is known ahead of time, simply appending the blank fields on key switch without all these arrays would work and make a simpler script.
I get the following output:
1;a;34;34;a;34;23;d;34;23;v;4;2;r;3;2;d
2;f;54;3;f;34;23;e;23;5;d;23;23;g;;;
3;t;26;67;t;34;45;v;25;34;h;34;23;u;34;34;z

Check variables from different lines with awk

I want to combine values from multiple lines with different lengths using awk into one line if they match. In the following sample match values for first field,
aggregating values from second field into a list.
Input, sample csv:
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
Output:
222;a|b
555;f
4444;a|d|z
How can I write an awk expression (maybe some other shell expression) to check if the first field value match with the next/previous line, and then print a list of second fields values aggregated and separated by a pipe?

awk '
BEGIN {FS=";"}
{ if ($1==prev) {sec=sec "|" $2; }
else { if (prev) { print prev ";" sec; };
prev=$1; sec=$2; }}
END { if (prev) { print prev ";" sec; }}'
This, as you requested, checks the consecutive lines.

does this oneliner work?
awk -F';' '{a[$1]=a[$1]?a[$1]"|"$2:$2;} END{for(x in a) print x";"a[x]}' file
tested here:
kent$ cat a
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
kent$ awk -F';' '{a[$1]=a[$1]?a[$1]"|"$2:$2;} END{for(x in a) print x";"a[x]}' a
555;f
4444;a|d|z
222;a|b
if you want to keep it sorted, add a |sort at the end.

Slightly convoluted, but does the job:
awk -F';' \
'{
if (a[$1]) {
a[$1]=a[$1] "|" $2
} else {
a[$1]=$2
}
}
END {
for (k in a) {
print k ";" a[k]
}
}' file

Assuming that you have set the field separator ( -F ) to ; :
{
if ( $1 != last ) { print s; s = ""; }
last = $1;
s = s "|" $2;
} END {
print s;
}
The first line and the first character are slightly wrong, but that's an exercise for the reader :-). Two simple if's suffice to fix that.
(Edit: Missed out last line.)

this should work:
Command:
awk -F';' '{if(a[$1]){a[$1]=a[$1]"|"$2}else{a[$1]=$2}}END{for (i in a){print i";" a[i] }}' fil
Input:
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
Output:
222;a|b
555;f
4444;a|d|z

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Linux - Search text in a file and join in another file - linux

Related

Replace each nth occurrence of 'foo' and 'bar' on two distincts columns by numerically respective nth line of a supplied file in respective columns

match between two files and merge the output using awk

Diff 2 long strings and write result in 3rd file

file manipulation with command line tools on linux

Check variables from different lines with awk

Categories

Resources