Grep csv with arguments from another csv - linux

If I have the following in file1.csv (space delimited):
RED 4 VWX
BLUE 2 MNO
BLUE 7 DEF
PURPLE 6 JKL
BLACK 8 VWX
BROWN 1 MNO
RED 1 GHI
RED 7 ABC
And the following in file2.csv (comma delimited):
BROWN,2
RED,5
YELLOW,8
Is there a way to use file2.csv to search file1.csv for matching lines? Currently, if I want to use line 1 terms from file2.csv to search file1.csv, I have to manually enter the following:
grep "BROWN" file1.csv | grep "2"
I would like to automate this search to find lines in file1.csv that match BOTH items in a given line in file2.csv. I have tried some awk commands, but am having a hard time using awk output as an argument in grep. I am running all this through a standard Mac terminal (so I guess I'm using bash?) Any help is greatly appreciated. Thank you!

awk one-liner
awk 'FNR==NR{a[$1]=$2; next} ($1 in a) && a[$1]==$2' FS=, file2.csv FS=" " file1.csv
FNR==NR{a[$1]=$2; next} : To read first input file, here file2.csv and create an associative array a with keys as column 1 in file2.csv and value as item number.
($1 in a) && a[$1]==$2: While iterating over second input file i.e file1.csv here check if column 1 value exists as key in array a. If it exists check if item number matches. If it matches the result will be 1 and line would be printed.
Or
simply using grep
grep -wf <(tr "," " " <file2) file1
Here we are replacing , with space in file2 using tr and using each line in file2 as pattern to search in file1 using the -f option provided by our lovely grep
-w is to match with word boundaries so that ABC 1 won't match with ABC 123

You can use awk to do the matching.
awk 'BEGIN { FS=","; }
NR == FNR { a[$1,$2]++; next; }
FNR == 1 { FS=" "; }
($1,$2) in a' file2.csv file1.csv
The second line creates an array whose keys are the values in file2.csv. The third line changes the field separator to space when we're processing file1.csv, and the last line matches any line where the first two fields are a key in the array.

Related

Filtering on a condition using the column names and not numbers

I am trying to filter a text file with columns based on two conditions. Due to the size of the file, I cannot use the column numbers (as there are thousands and are unnumbered) but need to use the column names. I have searched and tried to come up with multiple ways to do this but nothing is returned to the command line.
Here are a few things I have tried:
awk '($colname1==2 && $colname2==1) { count++ } END { print count }' file.txt
to filter out the columns based on their conditions
and
head -1 file.txt | tr '\t' | cat -n | grep "COLNAME
to try and return the possible column number related to the column.
An example file would be:
ID ad bd
1 a fire
2 b air
3 c water
4 c water
5 d water
6 c earth
Output would be:
2 (count of ad=c and bd=water)
with your input file and the implied conditions this should work
$ awk -v c1='ad' -v c2='bd' 'NR==1{n=split($0,h); for(i=1;i<=n;i++) col[h[i]]=i}
$col[c1]=="c" && $col[c2]=="water"{count++} END{print count+0}' file
2
or you can replace c1 and c2 with the values in the script as well.
to find the column indices you can run
$ awk -v cols='ad bd' 'BEGIN{n=split(cols,c); for(i=1;i<=n;i++) colmap[c[i]]}
NR==1{for(i=1;i<=NF;i++) if($i in colmap) print $i,i; exit}' file
ad 2
bd 3
or perhaps with this chain
$ sed 1q file | tr -s ' ' \\n | nl | grep -E 'ad|bd'
2 ad
3 bd
although may have false positives due to regex match...
You can rewrite the awk to be more succinct
$ awk -v cols='ad bd' '{while(++i<=NF) if(FS cols FS ~ FS $i FS) print $i,i;
exit}' file
ad 2
bd 3
As I mentioned in an earlier comment, the answer at https://unix.stackexchange.com/a/359699/133219 shows how to do this:
awk -F'\t' '
NR==1 {
for (i=1; i<=NF; i++) {
f[$i] = i
}
}
($(f["ad"]) == "c") && ($(f["bd"]) == "water") { cnt++ }
END { print cnt+0 }
' file
2
I'm assuming your input is tab-separated due to the tr '\t' in the command in your question that looks like you're trying to convert tabs to newlines to convert column names to numbers. If I'm wrong and they're just separated by any chains of white space then remove -F'\t' from the above.
Use miller toolkit to manipulate tab-delimited files using column names. Below is a one-liner that filters a tab-delimited file (delimiter is specified using --tsv) and writes the results to STDOUT together with the header. The header is removed using tail and the lines are counted with wc.
mlr --tsv filter '$ad == "c" && $bd == "water"' file.txt | tail -n +2 | wc -l
Prints:
2
SEE ALSO:
miller manual
Note that miller can be easily installed, for example, using conda, like so:
conda create --name miller miller
For years it bugged me there is no succinct way in Unix to do this sort of thing, although miller is a pretty good tool for this. Recently I wrote pick to choose columns by name, and additionally modify, combine and add them by name, as well as filtering rows by clauses using column names. The solution to the above with pick is
pick -h #ad=c #bd=water < data.txt | wc -l
By default pick prints the header of the selected columns, -h is to omit it. To print columns you simply name them on the command line, e.g.
pick ad water < data.txt | wc -l
Pick has many modes, all of them focused on manipulating columns and selecting/filtering rows with a minimal amount of syntax.

grep string after first occurrence of numbers

How do I get a string after the first occurrence of a number?
For example, I have a file with multiple lines:
34 abcdefg
10 abcd 123
999 abc defg
I want to get the following output:
abcdefg
abcd 123
abc defg
Thank you.
You could use Awk for this, loop through all the columns in each line upto NF (last column in each line) and once matching the first word, print the column next to it. The break statement would exit the for loop after the first iteration.
awk '{ for(i=1;i<=NF;i++) if ($i ~ /[[:digit:]]+/) { print $(i+1); break } }' file
It is not clear what you exactly want, but you can try to express it in sed.
Remove everything until the first digit, the next digits and any spaces.
sed 's/[^0-9]*[0-9]\+ *//'
Imagine the following two input files :
001 ham
03spam
3 spam with 5 eggs
A quick solution with awk would be :
awk '{sub(/[^0-9]*[0-9]+/,"",$0); print $1}' <file>
This line substitutes the first string of anything that does not contain a number followed by a number by an empty set (""). This way $0 is redefined and you can reprint the first field or the remainder of the field. This line gives exactly the following output.
ham
spam
spam
If you are interested in the remainder of the line
awk '{sub(/[^0-9]*[0-9]+ */,"",$0); print $0}' <file>
This will have as an output :
ham
spam
spam with 5 eggs
Be aware that an extra " *" is needed in the regular expression to remove all trailing spaces after the number. Without it you would get
awk '{sub(/[^0-9]*[0-9]+/,"",$0); print $0}' <file>
ham
spam
spam with 5 eggs
You can remove digits and whitespaces using sed:
sed -E 's/[0-9 ]+//' file
grep can do the job:
$ grep -o -P '(?<=[0-9] ).*' inputFIle
abcdefg
abcd 123
abc defg
For completeness, here is a solution with perl:
$ perl -lne 'print $1 if /[0-9]+\s*(.*)/' inputFIle
abcdefg
abcd 123
abc defg

awk print number of row only in uniq column

I have data set like this:
1 A
1 B
1 C
2 A
2 B
2 C
3 B
3 C
And I have a script which calculates me:
Number of occurrences in searching string
Number of rows
awk -v search="A" \
'BEGIN{count=0} $2 == search {count++} END{print count "\n" NR}' input
That works perfectly fine.
I would like to add to my awk one liner number of unique lines from the first column.
So the output should be separated by \n:
2
8
3
I can do this in separate awk code, but I am not able to integrate it to my original awk code.
awk '{a[$1]++}END{for(i in a){print i}}' input | wc -l
Any idea how to integrate it in one awk solution without piping ?
Looks like you want this:
awk -v search="A" '{a[$1]++}
$2 == search {count++}
END{OFS="\n";print count+0, NR, length(a)}' file

Subtract a constant number from a column

I have two large files (~10GB) as follows:
file1.csv
name,id,dob,year,age,score
Mike,1,2014-01-01,2016,2,20
Ellen,2, 2012-01-01,2016,4,35
.
.
file2.csv
id,course_name,course_id
1,math,101
1,physics,102
1,chemistry,103
2,math,101
2,physics,102
2,chemistry,103
.
.
I want to subtract 1 from the "id" columns of these files:
file1_updated.csv
name,id,dob,year,age,score
Mike,0,2014-01-01,2016,2,20
Ellen,0, 2012-01-01,2016,4,35
file2_updated.csv
id,course_name,course_id
0,math,101
0,physics,102
0,chemistry,103
1,math,101
1,physics,102
1,chemistry,103
I have tried awk '{print ($1 - 1) "," $0}' file2.csv, but did not get the correct result:
-1,id,course_name,course_id
0,1,math,101
0,1,physics,102
0,1,chemistry,103
1,2,math,101
1,2,physics,102
1,2,chemistry,103
You've added an extra column in your attempt. Instead set your first field $1 to $1-1:
awk -F"," 'BEGIN{OFS=","} {$1=$1-1;print $0}' file2.csv
That semicolon separates the commands. We set the delimiter to comma (-F",") and the Output Field Seperator to comma BEGIN{OFS=","}. The first command to subtract 1 from the first field executes first, then the print command executes second, so the entire record, $0, will now contain the new $1 value when it's printed.
It might be helpful to only subtract 1 from records that are not your header. So you can add a condition to the first command:
awk -F"," 'BEGIN{OFS=","} NR>1{$1=$1-1} {print $0}' file2.csv
Now we only subtract when the record number (NR) is greater than 1. Then we just print the entire record.

Removing last column from rows that have three columns using bash

I have a file that contains several lines of data. Some lines contain three columns, but most contain only two. All lines are single-tab separated. For those that contain three columns, the third column is typically redundant and contains the same data as the second so I'd like to remove it.
I imagine awk or cut would be appropriate, but I'm drawing a blank on how to test the row for three columns so my script will only work on those rows. I know awk is a very powerful language with logic and whatnot built into it, I'm just not that strong with it.
I looked at a similar question, but I'm not sure what is going on with the awk answer. Should the -4 be -1 since I only want to remove one column? What about if the row has two columns; will it remove the second even though I don't want to do anything?
I modified it to what I think it would be:
awk -F"\t" -v OFS="\t" '{ for (i=1;i<=NF-4;i++){ print $i }}'
But when I run it (with the file) nothing happens. If I change NF-1 or NF-2 I get some output, but it only a handful of lines and only the first column.
Can anyone clue me into what I should be doing?
If you just want to remove the third column, you could just print the first and the second:
awk -F '\t' '{print $1 "\t" $2}'
And it's similar to cut:
cut -f 1,2
The awk variable NF gives you the number for fields. So an expression like this should work for you.
awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
Running it on an input file like so
a,b,c
x,y
u,v,w
l,m
gives me
$ cat test | awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
a,b
x,y
u,v
l,m
This might work for you (GNU sed):
sed 's/\t[^\t]*//2g' file
Restricts the file to two columns.
awk 'NF==3{print $1"\t"$2}NF==2{print}' your_file
Testde below:
> cat temp
1 2
3 4 5
6 7
8 9 10
>
> awk 'NF==3{print $1"\t"$2}NF==2{print}' temp
1 2
3 4
6 7
8 9
>
or in a much more simplere way in awk:
awk 'NF==3{print $1"\t"$2}NF==2' your_file
Or you can also go with perl:
perl -lane 'print "$F[0]\t$F[1]"' your_file

Resources