Gawk not filtering out larger number? - linux

My simple gawk filter used in my program is not filtering out a value that is a digit longer than the rest.
Here is my text file:
172 East Fourth Street Toronto 4 1890 1500000 6
2213 Mt. Vernon Avenue Vaughn 2 890 500000 4
One Lincoln Plaza Toronto 2 980 900000 1
The columns are separated by tabs.
My gawk script:
echo "Enter max price"
read price
gawk -F "\t+" '$5 <= "'$price'"' file
The 1500000 value appears if I enter a value of 150001 or greater. I think it has to do with the gawk not reading the last digit correctly. I'm not permitted to change the original text file and I need to use the gawk command. Any help is appreciated!

Your awk command performs lexical comparison rather than numerical comparison, because the RHS - the price value - is enclosed in double-quotes.
Removing the double-quotes would help, but it's advisable to reformulate the command as follows:
gawk -F '\t+' -v price="$price" '$5 <= price' file
The shell variable $price is now passed to Awk using -v, as Awk variable price, which is the safe way to pass values to awk - you can then use a single-quoted awk script without having to splice in shell variables or having to worry about which parts may be expanded by the shell up front.
Afterthought: As Ed Morton points out in a comment, to ensure that a field or variable is treated as a number, append +0 to it; e.g., $5 <= price+0 (conversely, append "" to force treatment as a string).
By default, Awk infers from the values involved and the context whether to interpret a given value as a string or a number - which may not always give the desired result.

You're really calling a separate gawk for each column? One will do:
gawk -F "\t+" -v OFS="\t" \
-v city="$city" \
-v bedrooms="$bedrooms" \
-v space="$space" \
-v price="$price" \
-v weeks="$weeks" '
$2 == city && $3 >= bedrooms && $4 >= space && $5 <= price && $6 <= weeks {
$1 = $1; print
}
' listing |
sort -t $'\t' $sortby $ordering |
column -s $'\t' -t
(This is not an answer, just a comment that needs formatting)
The $1=$1 bit is an awk trick to make it rewrite the current record using the Output Field Separator, a single tab. Saves you a call to tr

Related

Filtering on a condition using the column names and not numbers

I am trying to filter a text file with columns based on two conditions. Due to the size of the file, I cannot use the column numbers (as there are thousands and are unnumbered) but need to use the column names. I have searched and tried to come up with multiple ways to do this but nothing is returned to the command line.
Here are a few things I have tried:
awk '($colname1==2 && $colname2==1) { count++ } END { print count }' file.txt
to filter out the columns based on their conditions
and
head -1 file.txt | tr '\t' | cat -n | grep "COLNAME
to try and return the possible column number related to the column.
An example file would be:
ID ad bd
1 a fire
2 b air
3 c water
4 c water
5 d water
6 c earth
Output would be:
2 (count of ad=c and bd=water)
with your input file and the implied conditions this should work
$ awk -v c1='ad' -v c2='bd' 'NR==1{n=split($0,h); for(i=1;i<=n;i++) col[h[i]]=i}
$col[c1]=="c" && $col[c2]=="water"{count++} END{print count+0}' file
2
or you can replace c1 and c2 with the values in the script as well.
to find the column indices you can run
$ awk -v cols='ad bd' 'BEGIN{n=split(cols,c); for(i=1;i<=n;i++) colmap[c[i]]}
NR==1{for(i=1;i<=NF;i++) if($i in colmap) print $i,i; exit}' file
ad 2
bd 3
or perhaps with this chain
$ sed 1q file | tr -s ' ' \\n | nl | grep -E 'ad|bd'
2 ad
3 bd
although may have false positives due to regex match...
You can rewrite the awk to be more succinct
$ awk -v cols='ad bd' '{while(++i<=NF) if(FS cols FS ~ FS $i FS) print $i,i;
exit}' file
ad 2
bd 3
As I mentioned in an earlier comment, the answer at https://unix.stackexchange.com/a/359699/133219 shows how to do this:
awk -F'\t' '
NR==1 {
for (i=1; i<=NF; i++) {
f[$i] = i
}
}
($(f["ad"]) == "c") && ($(f["bd"]) == "water") { cnt++ }
END { print cnt+0 }
' file
2
I'm assuming your input is tab-separated due to the tr '\t' in the command in your question that looks like you're trying to convert tabs to newlines to convert column names to numbers. If I'm wrong and they're just separated by any chains of white space then remove -F'\t' from the above.
Use miller toolkit to manipulate tab-delimited files using column names. Below is a one-liner that filters a tab-delimited file (delimiter is specified using --tsv) and writes the results to STDOUT together with the header. The header is removed using tail and the lines are counted with wc.
mlr --tsv filter '$ad == "c" && $bd == "water"' file.txt | tail -n +2 | wc -l
Prints:
2
SEE ALSO:
miller manual
Note that miller can be easily installed, for example, using conda, like so:
conda create --name miller miller
For years it bugged me there is no succinct way in Unix to do this sort of thing, although miller is a pretty good tool for this. Recently I wrote pick to choose columns by name, and additionally modify, combine and add them by name, as well as filtering rows by clauses using column names. The solution to the above with pick is
pick -h #ad=c #bd=water < data.txt | wc -l
By default pick prints the header of the selected columns, -h is to omit it. To print columns you simply name them on the command line, e.g.
pick ad water < data.txt | wc -l
Pick has many modes, all of them focused on manipulating columns and selecting/filtering rows with a minimal amount of syntax.

find records longer/shorter than a particular col

this is my file: FILEABC.txt
Name|address|age|country
john|london|12|UK
adam|newyork|39|US|X12|123
jake|madrid|45|ESP
ram|delhi
joh|cal|34|US|788
I wanted to find the the header count in the file. so i've this command
cat FILEABC.txt | awk --field-separator='|' '{print NF}' | sort -n |uniq -c
the result i get for this cmd is
cat FILEABC.txt | awk --field-separator='|' '{print NF}' | sort -n |uniq -c
1 2
3 4
1 5
1 6
My requirement is that, how do i find those records that have only 2 fields, 4 fields and so on from my file.
for ex,
if want to see the records having only 2 col:
ram|delhi
if want to see rec's having more than 4 col:
adam|newyork|39|US|X12|123
If you want to only print the records which have 2 fields then following may help you in same.
awk -F"|" 'NF==2' Input_file
For any kind of records if you need a line which has more than 4 fields then change above condition to NF>4 or you need line which have more than 5 fields eg--> NF>5
Explanation: BY doing -F"|" I am making sure field separator is pipe here, then NF is an awk out of the box variable which defines the TOTAL number of fields in a line, so as per your request checking if number of fields are more than 2 here, if this condition is TRUE then print the current line(where I have NOT written print because awk works on method of condition and action, so if condition is TRUE here I am not mentioning any action and by default action print will happen for that line).
Using awk, variable NF gives total number of fields in record/row, by default awk use single space as field separator, if you alter FS, it will calculate NF based on field separator mentioned, so what you can do is
awk -v FS='|' 'NF==2' infile
Which is same as
# Usual Syntax : awk 'condition { action }' infile
awk -v FS='|' 'NF==2{ print }' infile
For more than 4 fields,
awk -v FS='|' 'NF > 4' infile
you can also use grep to filter 2-columed records:
grep '^[^|]*|[^|]*$' FILEABC.txt
It will output:
ram|delhi

AWK field contains number range

I'm trying to use awk to output lines from a semi-colon (;) delimited text file in which the third field contains a number from a certain range. e.g.
[root#example ~]# cat foo.csv
john doe; lawyer; section 4 stand 356; area 5
chris thomas; carpenter; stand 289 section 2; area 5
tom sawyer; politician; stan 210 section 4; area 6
I want awk to give me all lines in which the third field contains a number between 200 and 300 regardless of the other text in the field.
You may use a regular expression, like this:
awk -F\; '$3 ~ /\y2[0-9][0-9]\y/' a.csv
A better version that allows you to simply pass the boundaries at the command line without changing the regular expression could look like the following:
(Since it is a more complex script I recommend to save it to a file)
filter.awk
BEGIN { FS=";" }
{
# Split the 3rd field by sequences of non-numeric characters
# and store the pieces in 'a'. 'a' will contain the numbers
# of the 3rd field (plus an optional empty strings if $3 does
# not start or end with a number)
split($3, a, "[^0-9]+")
# iterate through a and check if a number is within the range
for(i in a){
if(a!="" && a[i]>=low && a[i]<high){
print
next
}
}
}
Call it like this:
awk -v high=300 -v low=200 -f filter.awk a.csv
grep alternative:
grep '^[^;]*;[^;]*;[^;]*\b2[0-9][0-9]\b' foo.csv
The output:
chris thomas; carpenter; stand 289 section 2; area 5
tom sawyer; politician; stan 210 section 4; area 6
If 300 should be inclusive boundary you may use the following:
grep '^[^;]*;[^;]*;[^;]*\b\(2[0-9][0-9]\|300\)\b' foo.csv

Bash- sum values from an array in one line

I have this array:
array=(1 2 3 4 4 3 4 3)
I can get the largest number with:
echo "num: $(printf "%d\n" ${array[#]} | sort -nr | head -n 1)"
#outputs 4
But i want to get all 4's add sum them up, meaning I want it to output 12 (there are 3 occurrences of 4) instead. any ideas?
dc <<<"$(printf '%d\n' "${array[#]}" | sort -n | uniq -c | tail -n 1) * p"
sort to get max value at end
uniq -c to get only unique values, with a count of how many times they appear
tail to get only the last line (with the max value and its count)
dc to multiply the value by the count
I picked dc for the multiplication step because it's RPN, so you don't have to split up the uniq -c output and insert anything in the middle of it - just add stuff to the end.
Using awk:
$ printf "%d\n" "${array[#]}" | sort -nr | awk 'NR>1 && p!=$0{print x;exit;}{x+=$0;p=$0;}'
12
Using sort, the numbers are sorted(-n) in reverse(-r) order, and the awk keeps summing the numbers till it finds a number which is different from the previous one.
You can do this with awk:
awk -v RS=" " '{sum[$0]+=$0; if($0>max) max=$0} END{print sum[max]}' <<<"${array[#]}"
Setting RS (record separator) to space allows you to read your array entries as separate records.
sum[$0]+=$0; means sum is a map of cumulative sums for each input value; if($0>max) max=$0 calculates the max number seen so far; END{print sum[max]} prints the sum for the larges number seen at the end.
<<<"${array[#]}" is a here-document that allows you to feed a string (in this case all elements of the array) as stdin into awk.
This way there is no piping or looping involved - a single command does all the work.
Using only bash:
echo $((${array// /+}))
Replace all spaces with plus, and evaluate using double-parentheses expression.

How to cut first n and last n columns?

How can I cut off the first n and the last n columns from a tab delimited file?
I tried this to cut first n column. But I have no idea to combine first and last n column
cut -f 1-10 -d "<CTR>v <TAB>" filename
Cut can take several ranges in -f:
Columns up to 4 and from 7 onwards:
cut -f -4,7-
or for fields 1,2,5,6 and from 10 onwards:
cut -f 1,2,5,6,10-
etc
The first part of your question is easy. As already pointed out, cut accepts omission of either the starting or the ending index of a column range, interpreting this as meaning either “from the start to column n (inclusive)” or “from column n (inclusive) to the end,” respectively:
$ printf 'this:is:a:test' | cut -d: -f-2
this:is
$ printf 'this:is:a:test' | cut -d: -f3-
a:test
It also supports combining ranges. If you want, e.g., the first 3 and the last 2 columns in a row of 7 columns:
$ printf 'foo:bar:baz:qux:quz:quux:quuz' | cut -d: -f-3,6-
foo:bar:baz:quux:quuz
However, the second part of your question can be a bit trickier depending on what kind of input you’re expecting. If by “last n columns” you mean “last n columns (regardless of their indices in the overall row)” (i.e. because you don’t necessarily know how many columns you’re going to find in advance) then sadly this is not possible to accomplish using cut alone. In order to effectively use cut to pull out “the last n columns” in each line, the total number of columns present in each line must be known beforehand, and each line must be consistent in the number of columns it contains.
If you do not know how many “columns” may be present in each line (e.g. because you’re working with input that is not strictly tabular), then you’ll have to use something like awk instead. E.g., to use awk to pull out the last 2 “columns” (awk calls them fields, the number of which can vary per line) from each line of input:
$ printf '/a\n/a/b\n/a/b/c\n/a/b/c/d\n' | awk -F/ '{print $(NF-1) FS $(NF)}'
/a
a/b
b/c
c/d
You can cut using following ,
-d: delimiter ,-f for fields
\t used for tab separated fields
cut -d$'\t' -f 1-3,7-
To use AWK to cut off the first and last fields:
awk '{$1 = ""; $NF = ""; print}' inputfile
Unfortunately, that leaves the field separators, so
aaa bbb ccc
becomes
[space]bbb[space]
To do this using kurumi's answer which won't leave extra spaces, but in a way that's specific to your requirements:
awk '{delim = ""; for (i=2;i<=NF-1;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
This also fixes a couple of problems in that answer.
To generalize that:
awk -v skipstart=1 -v skipend=1 '{delim = ""; for (i=skipstart+1;i<=NF-skipend;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
Then you can change the number of fields to skip at the beginning or end by changing the variable assignments at the beginning of the command.
You can use Bash for that:
while read -a cols; do echo ${cols[#]:0:1} ${cols[#]:1,-1}; done < file.txt
you can use awk, for example, cut off 1st,2nd and last 3 columns
awk '{for(i=3;i<=NF-3;i++} print $i}' file
if you have a programing language such as Ruby (1.9+)
$ ruby -F"\t" -ane 'print $F[2..-3].join("\t")' file
Try the following:
echo a#b#c | awk -F"#" '{$1 = ""; $NF = ""; print}' OFS=""
Use
cut -b COLUMN_N_BEGINS-COLUMN_N_UNTIL INPUT.TXT > OUTPUT.TXT
-f doesn't work if you have "tabs" in the text file.

Resources