AWK field contains number range - linux

I'm trying to use awk to output lines from a semi-colon (;) delimited text file in which the third field contains a number from a certain range. e.g.
[root#example ~]# cat foo.csv
john doe; lawyer; section 4 stand 356; area 5
chris thomas; carpenter; stand 289 section 2; area 5
tom sawyer; politician; stan 210 section 4; area 6
I want awk to give me all lines in which the third field contains a number between 200 and 300 regardless of the other text in the field.

You may use a regular expression, like this:
awk -F\; '$3 ~ /\y2[0-9][0-9]\y/' a.csv
A better version that allows you to simply pass the boundaries at the command line without changing the regular expression could look like the following:
(Since it is a more complex script I recommend to save it to a file)
filter.awk
BEGIN { FS=";" }
{
# Split the 3rd field by sequences of non-numeric characters
# and store the pieces in 'a'. 'a' will contain the numbers
# of the 3rd field (plus an optional empty strings if $3 does
# not start or end with a number)
split($3, a, "[^0-9]+")
# iterate through a and check if a number is within the range
for(i in a){
if(a!="" && a[i]>=low && a[i]<high){
print
next
}
}
}
Call it like this:
awk -v high=300 -v low=200 -f filter.awk a.csv

grep alternative:
grep '^[^;]*;[^;]*;[^;]*\b2[0-9][0-9]\b' foo.csv
The output:
chris thomas; carpenter; stand 289 section 2; area 5
tom sawyer; politician; stan 210 section 4; area 6
If 300 should be inclusive boundary you may use the following:
grep '^[^;]*;[^;]*;[^;]*\b\(2[0-9][0-9]\|300\)\b' foo.csv

Related

Filtering on a condition using the column names and not numbers

I am trying to filter a text file with columns based on two conditions. Due to the size of the file, I cannot use the column numbers (as there are thousands and are unnumbered) but need to use the column names. I have searched and tried to come up with multiple ways to do this but nothing is returned to the command line.
Here are a few things I have tried:
awk '($colname1==2 && $colname2==1) { count++ } END { print count }' file.txt
to filter out the columns based on their conditions
and
head -1 file.txt | tr '\t' | cat -n | grep "COLNAME
to try and return the possible column number related to the column.
An example file would be:
ID ad bd
1 a fire
2 b air
3 c water
4 c water
5 d water
6 c earth
Output would be:
2 (count of ad=c and bd=water)
with your input file and the implied conditions this should work
$ awk -v c1='ad' -v c2='bd' 'NR==1{n=split($0,h); for(i=1;i<=n;i++) col[h[i]]=i}
$col[c1]=="c" && $col[c2]=="water"{count++} END{print count+0}' file
2
or you can replace c1 and c2 with the values in the script as well.
to find the column indices you can run
$ awk -v cols='ad bd' 'BEGIN{n=split(cols,c); for(i=1;i<=n;i++) colmap[c[i]]}
NR==1{for(i=1;i<=NF;i++) if($i in colmap) print $i,i; exit}' file
ad 2
bd 3
or perhaps with this chain
$ sed 1q file | tr -s ' ' \\n | nl | grep -E 'ad|bd'
2 ad
3 bd
although may have false positives due to regex match...
You can rewrite the awk to be more succinct
$ awk -v cols='ad bd' '{while(++i<=NF) if(FS cols FS ~ FS $i FS) print $i,i;
exit}' file
ad 2
bd 3
As I mentioned in an earlier comment, the answer at https://unix.stackexchange.com/a/359699/133219 shows how to do this:
awk -F'\t' '
NR==1 {
for (i=1; i<=NF; i++) {
f[$i] = i
}
}
($(f["ad"]) == "c") && ($(f["bd"]) == "water") { cnt++ }
END { print cnt+0 }
' file
2
I'm assuming your input is tab-separated due to the tr '\t' in the command in your question that looks like you're trying to convert tabs to newlines to convert column names to numbers. If I'm wrong and they're just separated by any chains of white space then remove -F'\t' from the above.
Use miller toolkit to manipulate tab-delimited files using column names. Below is a one-liner that filters a tab-delimited file (delimiter is specified using --tsv) and writes the results to STDOUT together with the header. The header is removed using tail and the lines are counted with wc.
mlr --tsv filter '$ad == "c" && $bd == "water"' file.txt | tail -n +2 | wc -l
Prints:
2
SEE ALSO:
miller manual
Note that miller can be easily installed, for example, using conda, like so:
conda create --name miller miller
For years it bugged me there is no succinct way in Unix to do this sort of thing, although miller is a pretty good tool for this. Recently I wrote pick to choose columns by name, and additionally modify, combine and add them by name, as well as filtering rows by clauses using column names. The solution to the above with pick is
pick -h #ad=c #bd=water < data.txt | wc -l
By default pick prints the header of the selected columns, -h is to omit it. To print columns you simply name them on the command line, e.g.
pick ad water < data.txt | wc -l
Pick has many modes, all of them focused on manipulating columns and selecting/filtering rows with a minimal amount of syntax.

How to replace some cells number of .csv file if specific lines found in Linux

Lets say I have the following file.csv file content
"US","BANANA","123","100","0.5","ok"
"US","APPLE","456","201","0.1", "no"
"US","PIE","789","109","0.8","yes"
"US","APPLE","245","201","0.4","no"
I want to search all lines that have APPLE and 201, and then replace the column 5 values to 0. So, my output would look like
"US","BANANA","123","100","0.5","ok"
"US","APPLE","456","201","0", "no"
"US","PIE","789","109","0.8","yes"
"US","APPLE","245","201","0","no"
I can do grep search
grep "APPLE" file.csv | grep 201
to find out the lines. But could not figure out how to modify column 5 values of these lines in the original file.
You can use awk for this:
awk -F, '$2=="\"APPLE\"" { for (i=1;i<=NF;i++) { if ($i=="\"201\"") { gsub($5,"\""substr($5,2,length($5)-1)*1.10"\"",$5) } } }1' file.csv
Set the field delimiter to , and then when the second field is equal to APPLE in quotes, loop through each field and check if it is equal to 201 in quotes. If it is, replace the 5th field with 0 in quotes using Awk's gsub function. Print each line, changed or otherwise with short-hand 1

How do I use grep to get numbers larger than 50 from a txt file

I am relatively new to grep and unix. I am trying to get the names of people who have won more than 50 races from a txt file. So far the code I have used is, cat file.txt|grep -E "[5-9][0-9]$" but this is only giving me numbers from 50-99. How could I get it from 50-200. Thank you!!
driver
races
wins
Some_Man
90
160
Some_Man
10
80
the above is similar to the format of the data, although it is not tabulated.
Do you have to use grep? you could use awk like this:
awk '{if($[replace with the field number]>50)print$2}' < file.txt
assuming your fields are delimited by spaces, otherwise you could use -F flag to specify delimiter.
if you must use grep, then it's regular expression like you did. to make it 50 to 200 you will do:
cat file.txt|grep -E "(\b[5-9][0-9]|\b1[0-9][0-9])$"
Input:
Rank Country Driver Races Wins
1 [United_Kingdom] Lewis_Hamilton 264 94
2 [Germany] Sebastian_Vettel 254 53
3 [Spain] Fernando_Alonso 311 32
4 [Finland] Kimi_Raikkonen 326 21
5 [Germany] Nico_Rosberg 200 23
Awk would be a better candidate for this:
awk '$4>=50 && $4<=200 { print $0 }' file
Check to see if the fourth space delimited field ($4 - Change to what ever field number this actually is) if both greater than or equal to 50 and less than or equal to 200 and print the line ($0) if the condition is met

grep to search data in first column

I have a text file with two columns.
Product Cost
Abc....def 10
Abc.def 20
ajsk,,lll 04
I want to search for product starts from "Abc" and ends with "def" then for those entries I want to add Cost.
I have used :
grep "^Abc|def$" myfile
but it is not working
Use awk. cat myfile | awk '{print $1}' | grep query
If you can use awk, try this:
text.txt
--------
Product Cost
Abc....def 10
Abc.def 20
ajsk,,lll 04
With only awk:
awk '$1 ~ /^Abc.*def$/ { SUM += $2 } END { print SUM } ' test.txt
Result: 30
With grep and awk:
grep "^Abc.*def.*\d*$" test.txt | awk '{SUM += $2} END {print SUM}'
Result: 30
Explanation:
awk reads each line and matches the first column with a regular expression (regex)
The first column has to start with Abc, followed by anything (zero or more times), and ends with def
If such match is found, add 2nd column to SUM variable
After reading all lines print the variable
Grep extracts each line that starts with Abc, followed by anything, followed by def, followed by anything, followed by a number (zero or more times) to end. Those lines are fed/piped to awk. Awk just increments SUM for each line it receives. After reading all lines received, it prints the SUM variable.
Thanks edited. Do you want the command like this?
grep "^Abc.*def *.*$"
If you don't want to use cat, and also show the line numbers:
awk '{print $1}' filename | grep -n keyword
If applicable, you may consider caret ^: grep -E '^foo|^bar' it will match text at the beginning of the string. Column one is always located at the beginning of the string.
Regular expression > POSIX basic and extended
^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line.

How to use grep or awk to process a specific column ( with keywords from text file )

I've tried many combinations of grep and awk commands to process text from file.
This is a list of customers of this type:
John,Mills,81,Crescent,New York,NY,john#mills.com,19/02/1954
I am trying to separate these records into two categories, MEN and FEMALES.
I have a list of some 5000 Female Names , all in plain text , all in one file.
How can I "grep" the first column ( since I am only matching first names) but still printing the entire customer record ?
I found it easy to "cut" the first column and grep --file=female.names.txt, but this way it's not going to print the entire record any longer.
I am aware of the awk option but in that case I don't know how to read the female names from file.
awk -F ',' ' { if($1==" ???Filename??? ") print $0} '
Many thanks !
You can do this with Awk:
awk -F, 'NR==FNR{a[$0]; next} ($1 in a)' female.names.txt file.csv
Would print the lines of your csv file that contain first names of any found in your file female.names.txt.
awk -F, 'NR==FNR{a[$0]; next} !($1 in a)' female.names.txt file.csv
Would output lines not found in female.names.txt.
This assumes the format of your female.names.txt file is something like:
Heather
Irene
Jane
Try this:
grep --file=<(sed 's/.*/^&,/' female.names.txt) datafile.csv
This changes all the names in the list of female names to the regular expression ^name, so it only matches at the beginning of the line and followed by a comma. Then it uses process substitution to use that as the file to match against the data file.
Another alternative is Perl, which can be useful if you're not super-familiar with awk.
#!/usr/bin/perl -anF,
use strict;
our %names;
BEGIN {
while (<ARGV>) {
chomp;
$names{$_} = 1;
}
}
print if $names{$F[0]};
To run (assume you named this file filter.pl):
perl filter.pl female.names.txt < records.txt
So, I've come up with the following:
Suppose, you have a file having the following lines in a file named test.txt:
abe 123 bdb 532
xyz 593 iau 591
Now you want to find the lines which include the first field having the first and last letters as vowels. If you did a simple grep you would get both of the lines but the following will give you the first line only which is the desired output:
egrep "^([0-z]{1,} ){0}[aeiou][0-z]+[aeiou]" test.txt
Then you want to the find the lines which include the third field having the first and last letters as vowels. Similary, if you did a simple grep you would get both of the lines but the following will give you the second line only which is the desired output:
egrep "^([0-z]{1,} ){2}[aeiou][0-z]+[aeiou]" test.txt
The value in the first curly braces {1,} specifies that the preceding character which ranges from 0 to z according to the ASCII table, can occur any number of times. After that, we have the field separator space in this case. Change the value within the second curly braces {0} or {2} to the desired field number-1. Then, use a regular expression to mention your criteria.

Resources