Read line range from a file and find largest value within the range in another file - linux

I'm looking to extract the largest value from a range of line numbers in a file, with the range being read from another file.
Define three files:
position_file: Containing two columns of integers defining a range of line numbers so col1[i] < col2[i]
full_data_file: Containing a single column of numerical data (>=0)
extracted_data_file: Containing for each line in position_file the largest value in full_data_file where the line number in full_data_file falls within the range defined in position_file
cat position_file
1 3
5 7
cat full_data_file
1
4.3
5.2
2.0
0.1
0
4
9
cat extracted_data_file
5.2
4
My current way of doing this is
while read pos1 pos2; do
awk -v p1="$pos1" -v p2="$pos2" 'BEGIN {max=0} NR>=p1 && NR<=p2 && $1>max {max=$1} END {print max}' < full_data_file >> extracted_data_file
done < position_file
This is not a good way because I repeatedly load full_data_file to memory, which is very slow. I'm looking for a way to do this in a single step. I'm not very accomplished in using arrays in awk but I imagine the solution will probably (but not necessarily) utilize arrays in awk.
Thank you very much for your help.

You may use this awk:
awk 'FNR==NR{a[FNR]=$1; next} {max=a[$1]; for (i=$1+1; i<=$2; i++)
if (a[i]>max) max=a[i]; print max}' full_data_file position_file > extracted_data_file
cat extracted_data_file
5.2
4

Related

Linux SHELL script, read each row for different number of columns

I have file and for example values in it:
1 value1.1 value1.2
2 value2.1
3 value3.1 value3.2 value3.3
I need to read values using the shell script from it but number of columns in each row is different!!!
I know that if for example I want to read second column I will do it like this (for row number as input parameter)
$ awk -v key=1 '$1 == key { print $2 }' input.txt
value1.1
But as I mentioned number of columns is different for each row.
How to make this read dynamic?
For example:
if input parameter is 1 it means I should read columns from the first row so output should be
value1.1 value1.2
if input parameter is 2 it means I should read columns from the second row so output should be
value2.1
if input parameter is 3 it means I should read columns from the third row so output should be
value3.1 value3.2 value3.2
Th point is that number of columns is not static and I should read columns from that specific row until the end of the row.
Thank you
Then you can simply say:
awk -v key=1 'NR==key' input.txt
UPDATED
If you want to process with the column data, there will be several ways.
With awk you can say something like:
awk -v key=3 'NR==key {
for (i=1; i<=NF; i++)
printf "column %d = %s\n", i, $i
}' input.txt
which outputs:
column 1 = value3.1
column 2 = value3.2
column 3 = value3.2
In awk you can access each column value by $1, $2, $3 directly or by $i indirectly where variable i holds either of 1, 2, 3.
If you prefer going with bash, try something like:
line=$(awk -v key=3 'NR==key' input.txt)
set -- $line # split into columns
for ((i=1; i<=$#; i++)); do
echo column $i = ${!i}
done
which outputs the same results.
In bash the indirect access is a little bit complex and you need to say ${!i} where i is a variable name.
Hope this helps.

Fast extraction of lines based on line numbers

I am looking for a fast way to extract lines of a file based on a list of line numbers read from a different file in bash.
Define three files:
position_file: Containing a single column of integers
full_data_file: Containing a single column of data
extracted_data_file: Containing those lines in full_data_file whose line numbers match the integers in position_file
My current way of doing this is
while read position; do
awk -v pos="$position" 'NR==pos {print; exit}' < full_data_file >> extracted_data_file
done < position_file
The problem is that this is painfully slow and I'm trying to do this for a large number of rather large files. I was hoping someone might be able to suggest a faster way.
Thank you for your help.
The right way with awk command:
Input files:
$ head pos.txt data.txt
==> pos.txt <==
2
4
6
8
10
==> data.txt <==
a
b
c
d
e
f
g
h
i
j
awk 'NR==FNR{ a[$1]; next }FNR in a' pos.txt data.txt > result.txt
$ cat result.txt
b
d
f
h
j

AWK field contains number range

I'm trying to use awk to output lines from a semi-colon (;) delimited text file in which the third field contains a number from a certain range. e.g.
[root#example ~]# cat foo.csv
john doe; lawyer; section 4 stand 356; area 5
chris thomas; carpenter; stand 289 section 2; area 5
tom sawyer; politician; stan 210 section 4; area 6
I want awk to give me all lines in which the third field contains a number between 200 and 300 regardless of the other text in the field.
You may use a regular expression, like this:
awk -F\; '$3 ~ /\y2[0-9][0-9]\y/' a.csv
A better version that allows you to simply pass the boundaries at the command line without changing the regular expression could look like the following:
(Since it is a more complex script I recommend to save it to a file)
filter.awk
BEGIN { FS=";" }
{
# Split the 3rd field by sequences of non-numeric characters
# and store the pieces in 'a'. 'a' will contain the numbers
# of the 3rd field (plus an optional empty strings if $3 does
# not start or end with a number)
split($3, a, "[^0-9]+")
# iterate through a and check if a number is within the range
for(i in a){
if(a!="" && a[i]>=low && a[i]<high){
print
next
}
}
}
Call it like this:
awk -v high=300 -v low=200 -f filter.awk a.csv
grep alternative:
grep '^[^;]*;[^;]*;[^;]*\b2[0-9][0-9]\b' foo.csv
The output:
chris thomas; carpenter; stand 289 section 2; area 5
tom sawyer; politician; stan 210 section 4; area 6
If 300 should be inclusive boundary you may use the following:
grep '^[^;]*;[^;]*;[^;]*\b\(2[0-9][0-9]\|300\)\b' foo.csv

AWK compare two columns in two seperate files

I would like to compare two files and do something like this: if the 5th column in the first file is equal to the 5th column in the second file, I would like to print the whole line from the first file. Is that possible? I searched for the issue but was unable to find a solution :(
The files are separated by tabulators and I tried something like this:
zcat file1.txt.gz file2.txt.gz | awk -F'\t' 'NR==FNR{a[$5];next}$5 in a {print $0}'
Did anybody tried to do a similar thing? :)
Thanks in advance for help!
Your script is fine, but you need to provide each file individually to awk and in reverse order.
$ cat file1.txt
a b c d 100
x y z w 200
p q r s 300
1 2 3 4 400
$ cat file2.txt
. . . . 200
. . . . 400
$ awk 'NR==FNR{a[$5];next} $5 in a {print $0}' file2.txt file1.txt
x y z w 200
1 2 3 4 400
EDIT:
As pointed out in the comments, the generic solution above can be improved and tailored to OP's situation of starting with compressed tab-separated files:
$ awk -F'\t' 'NR==FNR{a[$5];next} $5 in a' <(zcat file2.txt) <(zcat file1.txt)
x y z w 200
1 2 3 4 400
Explanation:
NR is the number of the current record being processed and FNR is the number
of the current record within its file . Thus NR == FNR is only
true when awk is processing the first file given to it (which in our case is file2.txt).
a[$5] adds the value of the 5th column as an index to the array a. Arrays in awk are associative arrays, but often you don't care about associating a value and just want to make a nice collection of things. This is a
pithy way to make a collection of all the values we've seen in 5th column of the
first file. The next statement, which follows, says to immediately get the next
available record without looking at any anymore statements in the awk program.
Summarizing the above, this line says "If you're reading the first file (file2.txt),
save the value of column 5 in the array called a and move on to the record without
continuing with the rest of the awk program."
NR == FNR { a[$5]; next }
Hopefully it's clear from the above that the only way we can past that first line of
the awk program is if we are reading the second file (file1.txt in our case).
$5 in a evaluates to true if the value of the 5th column occurs as an index in
the a array. In other words, it is true for every record in file1.txt whose 5th
column we saw as a value in the 5th column of file2.txt.
In awk, when the pattern portion evaluates to true, the accompanying action is
invoked. When there's no action given, as below, the default action is triggered
instead, which is to simply print the current record. Thus, by just saying
$5 in a, we are telling awk to print all the records in file1.txt whose 5th
column also occurs in file2.txt, which of course was the given requirement.
$5 in a

Bash- sum values from an array in one line

I have this array:
array=(1 2 3 4 4 3 4 3)
I can get the largest number with:
echo "num: $(printf "%d\n" ${array[#]} | sort -nr | head -n 1)"
#outputs 4
But i want to get all 4's add sum them up, meaning I want it to output 12 (there are 3 occurrences of 4) instead. any ideas?
dc <<<"$(printf '%d\n' "${array[#]}" | sort -n | uniq -c | tail -n 1) * p"
sort to get max value at end
uniq -c to get only unique values, with a count of how many times they appear
tail to get only the last line (with the max value and its count)
dc to multiply the value by the count
I picked dc for the multiplication step because it's RPN, so you don't have to split up the uniq -c output and insert anything in the middle of it - just add stuff to the end.
Using awk:
$ printf "%d\n" "${array[#]}" | sort -nr | awk 'NR>1 && p!=$0{print x;exit;}{x+=$0;p=$0;}'
12
Using sort, the numbers are sorted(-n) in reverse(-r) order, and the awk keeps summing the numbers till it finds a number which is different from the previous one.
You can do this with awk:
awk -v RS=" " '{sum[$0]+=$0; if($0>max) max=$0} END{print sum[max]}' <<<"${array[#]}"
Setting RS (record separator) to space allows you to read your array entries as separate records.
sum[$0]+=$0; means sum is a map of cumulative sums for each input value; if($0>max) max=$0 calculates the max number seen so far; END{print sum[max]} prints the sum for the larges number seen at the end.
<<<"${array[#]}" is a here-document that allows you to feed a string (in this case all elements of the array) as stdin into awk.
This way there is no piping or looping involved - a single command does all the work.
Using only bash:
echo $((${array// /+}))
Replace all spaces with plus, and evaluate using double-parentheses expression.

Resources