Find lines with a common value in a particular column - linux

Suppose I have a file like this
5 kata 45 buu
34 tuy 3 rre
21 ppo 90 ty
21 ret 60 buu
09 ret 89 ty
21 plk 1 uio
23 kata 90 ty
I want to have in output only the lines that contains repetead values on the 4th column. Therefore, my desired output would be this one:
5 kata 45 buu
21 ppo 90 ty
21 ret 60 buu
09 ret 89 ty
23 kata 90 ty
How can I perform this task?
I can identify and isolate the column of my interest with:
awk -F"," '{print $4}' file1 > file1_temp
and then check if there are repeated values and how many with:
awk '{dups[$1]++} END{for (num in dups) {print num,dups[num]}}' file1_temp
but that's not definitely what I would like to do..

A simple way to preserve the ordering would be to run through the file twice. The first time, keep a record of the counts, then print the ones with a count greater than 1 on the second pass:
awk 'NR == FNR { ++count[$4]; next } count[$4] > 1' file file
If you prefer not to loop through the file twice, you can keep track of things in a few arrays and do the printing in the END block:
awk '{ line[NR] = $0; col[NR] = $4; ++count[$4] }
END { for (i = 1; i <= NR; ++i) if (count[col[i]] > 1) print line[i] }' file
Here line stores the contents of the whole line, col stores the fourth column and count does the same as before.

Related

Looping though list of IDs to count matches in two columns

This is going to be a complicated one to explain so bear with me.
I am doing a blastp comparison multiple proteins all vs all and want the number of shared proteins between the genomes.
I have a large file of the query id and sequence id, example:
A A 100
A A 100
A A 100
A B 74
A B 47
A B 67
A C 73
A C 84
A C 74
A D 48
A D 74
A D 74
B A 67
B A 83
B A 44
B B 100
The file continues like that. I'd like to count the number of occurrences of A in column 1 and B in column 2. I have found a way to do this with awk:
awk -F, '$1=="A" && $2=="A"' file | wc -l
However, I have hundreds of genomes and this would involve typing the awk script thousands of times to get the different combinations. I had added the IDs from column 1 to a text file and tried a loop to loop through all the IDs for all possible combinations
for i in $(cat ID.txt); do input_file=file.csv; awk -F, '$1==$i && $2==$i' ${input_file} | wc -l; done
This is the output:
0
0
0
0
0
0
0
etc.
I'd like the output to be:
A A 60
A B 54
A C 34
A D 35
etc.
Any help would be appreciated.
If I'm understanding correctly, then you can collect the count for each pair into an array, and then print out the array once complete:
awk -F, '{++a[$1 FS $2]} END{for(entry in a){print entry, a[entry]}}' file
A,A 3
B,A 3
A,B 3
B,B 1
A,C 3
A,D 3
This is doing the following:
Increment the count in array a for the item with the key constructed from the concatenation of the first two columns, separated by the field separator FS (comma): {++a[$1 FS $2]}
Once the file processing is done END, loop through the array calling each array entry entry, for (entry in a)
In the loop, print the key/entry and the value {print entry, a[entry]}
…input… | WHINY_USERS=1 \ # Not trying to insult anyone -
# this is a special shell parameter
# recognized by mawk-1 to have array
# indices pre-sorted, somewhat similar to gawk's
#
# PROCINFO["sorted_in"]="#ind_str_asc"
mawk '{__[$!--NF]—-} END { for(_ in __) { print _,-__[_] } }' OFS=',' FS='[, \t]+'
A,A,3
A,B,3
A,C,3
A,D,3
B,A,3
B,B,1
if there's a chance more in than 3 columns in input, then do :
{m,n,g}awk '
BEGIN { _ += _ ^= FS = "["(OFS=",")" \t]+"
} { __[$!(NF=_)]++
} END {
for(_ in __) { print _, __[_] } }'
let $1 = $1 take care of placing the comma in between columns 1 and 2 instead of having to manually do it

Sum each row in a CSV file and sort it by specific value bash

i have a question taking the below set Coma separated CSV i want to run a script in bash that sums all values from colums 7,8,9 from an especific city and show the row with the max value
so Original dataset:
Row,name,city,age,height,weight,good rates,bad rates,medium rates
1,john,New York,25,186,98,10,5,11
2,mike,New York,21,175,87,19,6,21
3,Sandy,Boston,38,185,88,0,5,6
4,Sam,Chicago,34,167,76,7,0,2
5,Andy,Boston,31,177,85,19,0,1
6,Karl,New York,33,189,98,9,2,1
7,Steve,Chicago,45,176,88,10,3,0
the desire output will be
Row,name,city,age,height,weight,good rates,bad rates,medium rates,max rates by city
2,mike,New York,21,175,87,19,6,21,46
5,Andy,Boston,31,177,85,19,0,1,20
7,Steve,Chicago,45,176,88,10,3,0,13
im trying with this; but it gives me only the highest rate number so 46 but i need it by city and that shows all the row, any ideas how to continue?
awk 'BEGIN {FS=OFS=","}{sum = 0; for (i=7; i<=9;i++) sum += $i} NR ==1 || sum >max {max = sum}
You may use this awk:
awk '
BEGIN {FS=OFS=","}
NR==1 {
print $0, "max rates by city"
next
}
{
s = $7+$8+$9
if (s > max[$3]) {
max[$3] = s
rec[$3] = $0
}
}
END {
for (i in max)
print rec[i], max[i]
}' file
Row,name,city,age,height,weight,good rates,bad rates,medium rates,max rates by city
7,Steve,Chicago,45,176,88,10,3,0,13
2,mike,New York,21,175,87,19,6,21,46
5,Andy,Boston,31,177,85,19,0,1,20
or to get tabular output:
awk 'BEGIN {FS=OFS=","} NR==1{print $0, "max rates by city"; next} {s=$7+$8+$9; if (s > max[$3]) {max[$3] = s; rec[$3] = $0}} END {for (i in max) print rec[i], max[i]}' file | column -s, -t
Row name city age height weight good rates bad rates medium rates max rates by city
7 Steve Chicago 45 176 88 10 3 0 13
2 mike New York 21 175 87 19 6 21 46
5 Andy Boston 31 177 85 19 0 1 20

How to use AWK to continuously output lines from a file

I have a file with multiple lines, and I want to continuously output some lines of the file, such as first time, print from line 1 to line 5, next time, print line 2 to line 6, and so on.
I find AWK as a very useful function and I tried to write a code on my own, but it just outputs nothing.
Following is my code
#!/bin/bash
for n in `seq 1 3`
do
N1=$n
N2=$((n+4))
awk -v n1="$N1" -v n2="$N2" 'NR == n1, NR == n2 {print $0}' my_file >> new_file
done
For example, I have an input file called my_file
1 99 tut
2 24 bcc
3 32 los
4 33 rts
5 642 pac
6 23 caas
7 231 cdos
8 1 caee
9 78 cdsa
Then I expect an output file as
1 99 tut
2 24 bcc
3 32 los
4 33 rts
5 642 pac
2 24 bcc
3 32 los
4 33 rts
5 642 pac
6 23 caas
3 32 los
4 33 rts
5 642 pac
6 23 caas
7 231 cdos
Could you please try following, written and tested with shown samples in GNU awk. Where one needs to mention all lines which needs to be printed in lines_from variable, then there is a variable named till_lines which tells us how many lines we need to print from a specific line(eg--> from 1st line print next 4 lines too). On another note, I have tested OP's code and it worked fine for me, its generating the output file with new_file since calling awk in bash loop is NOT good practice hence adding this as an improvement too here.
awk -v lines_from="1,2,3" -v till_lines="4" '
BEGIN{
num=split(lines_from,arr,",")
for(i=1;i<=num;i++){ line[arr[i]] }
}
FNR==NR{
value[FNR]=$0
next
}
(FNR in line){
print value[FNR] > "output_file"
j=""
while(++j<=till_lines){ print value[FNR+j] > "output_file" }
}
' Input_file Input_file
When I see contents of output_file I could see following:
cat output_file
1 99 tut
2 24 bcc
3 32 los
4 33 rts
5 642 pac
2 24 bcc
3 32 los
4 33 rts
5 642 pac
6 23 caas
3 32 los
4 33 rts
5 642 pac
6 23 caas
7 231 cdos
Explanation: Adding detailed explanation for above.
awk -v lines_from="1,2,3" -v till_lines="4" ' ##Starting awk program from here and creating 2 variables lines_from and till_lines here, where lines_from will have all line numbers which one wants to print from. till_lines is the value till lines one has to print.
BEGIN{ ##Starting BEGIN section of this program from here.
num=split(lines_from,arr,",") ##Splitting lines_from into arr with delimiter of , here.
for(i=1;i<=num;i++){ ##Running a for loop from i=1 to till value of num here.
line[arr[i]] ##Creating array line with index of value of array arr with index of i here.
}
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when 1st time Input_file is being read.
value[FNR]=$0 ##Creating value with index as FNR and its value is current line.
next ##next will skip all further statements from here.
}
(FNR in line){ ##Checking condition if current line number is coming in array then do following.
print value[FNR] > "output_file" ##Printing value with index of FNR into output_file
j="" ##Nullifying value of j here.
while(++j<=till_lines){ ##Running while loop from j=1 to till value of till_lines here.
print value[FNR+j] > "output_file" ##Printing value of array value with index of FNR+j and print output into output_file
}
}
' Input_file Input_file ##Mentioning Input_file names here.
Another awk variant
awk '
BEGIN {N1=1; N2=5}
arr[NR]=$0 {}
END {
while (arr[N2]) {
for (i=N1; i<=N2; i++)
print arr[i]
N1++
N2++
}
}
' file

Use printf to format list that is uneven

I have a small list of student grades, I need to format it them side by side depending on the gender of the student. So one column is Male the other Female. The problem is the list doesn't go male female male female, it is uneven.
I've tried using printf to format the output so the 2 columns are side by side, but the format is ruined because of the uneven list.
Name Gender Mark1 Mark2 Mark3
AA M 20 15 35
BB F 22 17 44
CC F 19 14 25
DD M 15 20 42
EE F 18 22 30
FF M 0 20 45
This is the list I am talking about ^^
awk 'BEGIN {print "Male" " Female"} {if (NR!=1) {if ($2 == "M") {printf "%-s %-s %-s", $3, $4, $5} else if ($2 == "F") {printf "%s %s %s\n", $3, $4 ,$5}}}' text.txt
So I'm getting results like
Male Female
20 15 35 22 17 44
19 14 25
15 20 42 18 22 30
0 20 45
But I want it like this:
Male Female
20 15 35 22 17 44
15 20 42 19 14 25
0 20 45 18 22 30
I haven't added separators yet I'm just trying to figure this out, not sure if it would be better to put the marks into 2 arrays depending on gender then printing them out.
another solution tries to address if M/F is not unity
$ awk 'NR==1 {print "Male\tFemale"}
NR>1 {k=$2;$1=$2="";sub(/ +/,"");
if(k=="M") m[++mc]=$0; else f[++fc]=$0}
END {max=mc>fc?mc:fc;
for(i=1;i<=max;i++) print (m[i]?m[i]:"-") "\t" (f[i]?f[i]:"-")}' file |
column -ts$'\t'
Male Female
20 15 35 22 17 44
15 20 42 19 14 25
0 20 45 18 22 30
Something like this?
awk 'BEGIN{format="%2s %2s %2s %2s\n";printf("Male Female\n"); }NR>1{if (s) { if ($2=="F") {printf(format, s, $3, $4, $5);} else {printf(format, $3,$4,$5,s);} s=""} else {s=sprintf("%2s %2s %2s", $3, $4, $5)}}' file
Another approach using awk
awk '
BEGIN {
print "Male\t\tFemale"
}
NR > 1 {
I = ++G[$2]
A[$2 FS I] = sprintf("%2d %2d %2d", $(NF-2), $(NF-1), $NF)
}
END {
M = ( G["M"] > G["F"] ? G["M"] : G["F"] )
for ( i = 1; i <= M; i++ )
print A["M" FS i] ? A["M" FS i] : OFS, A["F" FS i] ? A["F" FS i] : OFS
}
' OFS='\t' file
This might work for you (GNU sed):
sed -Ee '1c\Male Female' -e 'N;s/^.. M (.*)\n.. F(.*)/\1\2/;s/^.. F(.*)\n.. M (.*)/\2\1/' file
Change the header line. Then compare a pair of lines and re-arrange them as appropriate.

extract a list of data from multiple files

I would like to ask help on this. Thank you very much!
I have thousands of files, each containing 5 columns and the first column containing names.
$ cat file1
name math eng hist sci
Kyle 56 45 68 97
Angela 88 86 59 30
June 48 87 85 98
I also have a file containing a list of names that can be found in the 5-column files.
$ cat list.txt
June
Isa
Angela
Manny
Specifically, I want to extract, say, the data in the 3rd column corresponding to the list file that I have in a structured way; columns representing the thousands of files and the names as rows. If one name in the list file is not present in a 5-column file, it should be presented as 0. Additionally, columns should headed with the filenames.
$ cat output.txt
names file1 file2 file3 file4
June 87 65 67 87
Isa 0 0 0 54
Angela 86 75 78 78
Manny 39 46 0 38
Using your test files list.txt and file1 (twice) for testing. First the awk:
$ cat program.awk
function isEmpty(arr, idx) { # using #EdMorton's test for array emptiness
for (idx in arr) # for figuring out the first data file
return 0 # https://stackoverflow.com/a/20078022/4162356
return 1
}
function add(n,a) { # appending grades for the chosen ones
if(!isEmpty(a)) { # if a is not empty
for(i in n) # iterate thru all chosen ones
n[i]=n[i] (n[i]==""?"":OFS) (i in a?a[i]:0) # and append
}
}
FNR==1 { # for each new file
h=h (h==""?"":OFS) FILENAME # build header
process(n,a) # and process the previous file in hash a
}
NR==FNR { # chosen ones to hash n
n[$1]
next
}
$1 in n { # add chosen ones to a
a[$1]=$3 #
}
END {
process(n,a) # in the end
print h # print the header
for(i in n) # and names with grades
print i,n[i]
}
Running it:
$ awk -f program.awk list.txt file1 file1
list.txt file1 file1
Manny 0 0
Isa 0 0
Angela 86 86
June 87 87
$ cat awk-script
BEGIN{f_name="names"} # save the "names" to var f_name
NR==FNR{
a[$1]=$1;b[$1];next # assign 2 array a & b, which keys is the content of "list.txt"
}
FNR==1{ # a new file is scanned
f_name=f_name"\t"FILENAME; # save the FILENAME to f_name
for(i in a){
a[i]=b[i]==""?a[i]:a[i]"\t"b[i]; # flush the value of b[i] to append to the value of a[i]
b[i]=0 # reset the value of b[i]
}
}
{ if($1 in b){b[$1]=$3} } # set $3 as the value of b[$1] if $1 existed in the keys of array b
END{
print f_name; # print the f_name
for(i in a){
a[i]=b[i]==""?a[i]:a[i]"\t"b[i]; # flush the the value of b[i] to a[i] belongs to the last file
print a[i] # print a[i]
}
}
Assumed more the one file (i.e., file1, file2, etc) existed, you may use the command to get the result,
$ awk -f awk-script list.txt file*
names file1 file2
Manny 0 46
Isa 0 0
Angela 86 75
June 87 65

Resources