how can I make awk match up lines in file 1 with the lines in file 2 based on some number ranges in file 2 - linux

I have the following two files:
file 1:
22
2
42
32
file 2:
1 10 valuea
11 20 valueb
21 30 valuec
31 40 valued
41 50 valuee
51 60 valuef
How can I make awk grab each value from file 1, match it up with file 2 based on whether it falls between the number range in columns 1 and 2 of file 2, and then print out column 3 from the matched column in file 2? The output would resemble the following:
valuec
valuea
valuee
valued
I tried using the following AWK command (based on what I found in this post: How to check value of a column lies between values of two columns in other file and print corresponding value from column in Unix?), but it does not seem to be working correctly.
#!/bin/bash
awk 'FNR == NR { val[$1] = $1 }
FNR != NR { if (val[$1] >= $1 && val[$1] <= $2)
print $3
}' file1 file2
Also I did not include it in here for obvious reasons, but for the actual application of this script, file 1 would include around 7,000 entries while file 2 would include 68,000 entries

alternative awk script
$ awk 'FNR == NR {a[$1]=$2; v[$1]=$3; next}
{for(k in a)
if(k+0<=$1 && $1+0<=a[k]) print v[k]}' file2 file1
valuec
valuea
valuee
valued
note that file2 is the first file. This will cover multiple range matches as well. +0 is to force for numerical comparison.

Related

Awk script to concatenate two column and look for concatenated values in another file

Need your help in solving this puzzle. Any kind of help will be appreciated and link for any documents to read and learn and to deal with such scenarios would be helpful
Concatenate column1 and column2 of file 1. Then check for the concatenated value in Column1 of File2. If found extract the corresponding value of column2 and column3 of File2, Again concatenate column1 and column2 of File2. Now look for this concatenated value in File1 and if found
For example - concatenate column1(262881626) and column2(10) of File1. Then look for this concatenated(26288162610) value in column1 of File2 and extract corresponding column2 and column3 value of File2.
Now again concatenate column1 and column2 of File2 and look for this concatenated(2628816261050) value in File1 and multiply exchange rate(2) fetched by concatenated value(26288162610) with taxable value(65) which corresponding to 2628816261050 of File1. Store the result of multiplcation value in column4(AD) of File1 only
File1
Bill Doc LineNo Taxablevalue AD
262881626 10 245
262881627 10 32
262881628 20 456
262881629 30 0
262881630 40 45
2628816261050 11 65
2628816271060 12 34
2628816282070 13 45
2628816293080 14 0
2628816304090 15
File2
Bill.Doc Item Exch.Rate
26288162610 50 2
26288162710 60 1
26288162820 70 45
26288162930 80 1
26288163040 90 5
Output File
Bill Doc LineNo Taxablevalue AD
262881626 10 245
262881627 10 32
262881628 20 456
262881629 30 0
262881630 40
2628816261050 11 65 130
2628816271060 12 34 34
2628816282070 13 45 180
2628816293080 14 0 0
2628816304090 15
Though your output is not clear, could you please try following and let me know of this helps you.
awk -F"|" 'FNR==NR{a[$1$2]=$NF;next} {print $0,$1 in a?"|" a[$1]*$NF:""}' OFS="" File2 File1
Explanation:
awk -F"|" ' ##Setting field separator as |(pipe) here.
FNR==NR{ ##Checking condition here FNR==NR which will be TRUE when first file named File2 is being read.
a[$1$2]=$NF; ##Creating an array named a whose index is $1$2(first and second field of current line) and value if last field.
next} ##next will skip all further statements from here.
{ ##Statements from here will be executed when only 2nd Input_file named File1 is being read.
print $0,$1 in a?"|" a[$1]*$NF:"" ##Printing $0(current line) and then checking if $1 of current line is present in array a is yes then print a value * $NF else print NULL.
}
' OFS="" File2 File1 ##Setting OFS to NULL here and mentioning both the Input_file(s) name here.

How to do divide a column based on the corresponding value in another file?

I have multiple files (66) and want to divid column 3 of each file to its corresponding value in the info.file and insert the new value in column 4 of each file.
My manual code is:
awk '{print $4=$3/NUmber from info.file}1' file
But this takes me hours to do for each individual file. So I want to automate it for all files. Thanks
file1:
chrm name value
4 a 8
3 b 4
file2:
chrm name value
3 g 6
5 s 12
info.file:
file_name average
file1 8
file2 6
file3 10
output:
file1:
chrm name value new_value
4 a 8 1
3 b 4 0.5
file2:
chrm name value new_value
3 g 6 1
5 s 12 2
without error handling
$ awk 'NR==FNR {a[$1]=$2; next}
FNR==1 {out=FILENAME".new"; print $0, "new_value" > out; next}
{v=$NF/a[FILENAME]; $++NF=v; print > out}' info file1 file2
will generate updated files
$ head file{1,2}.new | column -t
==> file1.new <==
chrm name value new_value
4 a 8 1
3 b 4 0.5
==> file2.new <==
chrm name value new_value
3 g 6 1
5 s 12 2
Explanation
NR==FNR {a[$1]=$2; next} scan the first file and save the file/value pairs in the associative array
FNR==1 in the header line of each data file
out=FILENAME".new" set a output filename
print $0, "new_value" > out print existing header appended with the new column name
v=$NF/a[FILENAME] for every data line, scale the last field and assign to v
$++NF=v increment number of fields and assign the new computed value to the last field
print > out print the new line to the same file set before
info file1 file2 the list of files should be preceded by the info file
I have prepared the following double nested awk command for you:
awk 'NR>1{system("awk -v div="$2" -f div_column3.awk "$1" | column -t > new_"$1);}' info.file
with div_column3.awk being a awk commands script file with the content:
$ cat div_column3.awk
NR==1{print $0" new_value"}NR>1{print $0" "$3/div}

Bash Colum sum over a table of variable length

Im trying to get the columsums (exept for the first one) of a tab delimited containing numbers.
To find out the number of columns an store it in a variable I use:
cols=$(awk '{print NF}' file.txt | sort -nu | tail -n 1
next I want to calculate the sum of all numbers in that column and store this in a variable again in a for loop:
for c in 2:$col
do
num=$(cat file.txt | awk '{sum+$2 ; print $0} END{print sum}'| tail -n 1
done
this
num=$(cat file.txt | awk '{sum+$($c) ; print $0} END{print sum}'| tail -n 1
on itself with a fixed numer and without variable input works find but i cannot get it to accept the for-loop variable.
Thanks for the support
p.s. It would also be fine if i could sum all columns (expept the first one) at once without the loop-trouble.
Assuming you want the sums of the individual columns,
$ cat file
1 2 3 4
5 6 7 8
9 10 11 12
$ awk '
{for (i=2; i<=NF; i++) sum[i] += $i}
END {for (i=2; i<=NF; i++) printf "%d%s", sum[i], OFS; print ""}
' file
18 21 24
In case you're not bound to awk, there's a nice tool for "command-line statistical operations" on textual files called GNU datamash.
With datamash, summing (probably the simplest operation of all) a 2nd column is as easy as:
$ datamash sum 2 < table
9
Assuming the table file holds tab-separated data like:
$ cat table
1 2 3 4
2 3 4 5
3 4 5 6
To sum all columns from 2 to n use column ranges (available in datamash 1.2):
$ n=4
$ datamash sum 2-$n < table
9 12 15
To include headers, see the --headers-out option

Select rows in one file based on specific values in the second file (Linux)

I have two files:
One is "total.txt". It has two columns: the first column is natural numbers (indicator) ranging from 1 to 20, the second column contains random numbers.
1 321
1 423
1 2342
1 7542
2 789
2 809
2 5332
2 6762
2 8976
3 42
3 545
... ...
20 432
20 758
The other one is "index.txt". It has three columns:(1.indicator, 2:low value, 3: high value)
1 400 5000
2 600 800
11 300 4000
I want to output the rows of "total.txt" file with first column matches with the first column of "index.txt" file. And at the same time, the second column of output results must be larger than (>) the second column of the "index.txt" and smaller than (<) the third column of the "index.txt".
The expected result is as follows:
1 423
1 2342
2 809
2 5332
2 6762
11 ...
11 ...
I have tried this:
awk '$1==(awk 'print($1)' index.txt) && $2 > (awk 'print($2)' index.txt) && $1 < (awk 'print($2)' index.txt)' total.txt > result.txt
But it failed!
Can you help me with this? Thank you!
You need to read both files in the same awk script. When you read index.txt, store the other columns in an array.
awk 'FNR == NR { low[$1] = $2; high[$1] = $3; next }
$2 > low[$1] && $2 < high[$1] { print }' index.txt total.txt
FNR == NR is the common awk idiom to detect when you're processing the first file.
Use join like Barmar said:
# To join on the first columns
join -11 -21 total.txt index.txt
And if the files aren't sorted in lexical order by the first column then:
join -11 -21 <(sort -k1,1 total.txt) <(sort -k1,1 index.txt)

Find the maximum values in 2nd column for each distinct values in 1st column using Linux

I have two columns as follows
ifile.dat
1 10
3 34
1 4
3 32
5 3
2 2
4 20
3 13
4 50
1 40
2 20
What I look for is to find the maximum values in 2nd column for each 1,2,3,4,5 in 1st column.
ofile.dat
1 40
2 20
3 34
4 50
5 3
I found someone has done this using other program e.g. Get the maximum values of column B per each distinct value of column A
awk seems a prime candidate for this task. Simply traverse your input file and keep an array indexed by the first column values and storing a value of column 2 if it is larger than the currently stored value. At the end of the traversal iterate over the array to print indices and corresponding values
awk '{
if (a[$1] < $2) {
a[$1]=$2
}
} END {
for (i in a) {
print i, a[i]
}
}' ifile.dat
Now the result will not be sorted numerically on the first column but that should be easily fixable if that is required
Another way is using sort.
First numeric sort on column 2 decreasing and then remove non unique values of column 1, a one-liner:
sort -n -r -k 2 ifile.dat| sort -u -n -k 1
The easiest command to find the maximum value in the second column is something like this
sort -nrk2 data.txt | awk 'NR==1{print $2}'
When doing min/max calculations, always seed the min/max variable using the first value read:
$ cat tst.awk
!($1 in max) || $2>max[$1] { max[$1] = $2 }
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (key in max) {
print key, max[key]
}
}
$ awk -f tst.awk file
1 40
2 20
3 34
4 50
5 3
The above uses GNU awk 4.* for PROCINFO["sorted_in"] to control output order, see http://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Array-Traversal.
Considering that your 1st field will be starting from 1 if yes then try one more solution in awk also.
awk '{a[$1]=$2>a[$1]?$2:(a[$2]?a[$2]:$2);} END{for(j=1;j<=length(a);j++){if(a[j]){print j,a[j]}}}' Input_file
Adding one more way for same too here.
sort -k1 Input_file | awk 'prev != $1 && prev{print prev, val;val=prev=""} {val=val>$2?val:$2;prev=$1} END{print prev,val}'

Resources