AWK count occurrences of column A based on uniqueness of column B - linux

I have a file with several columns and I want to count the occurrence of one column based on a second columns value being unique to the first column
For example:
column 10 column 15
-------------------------------
orange New York
green New York
blue New York
gold New York
orange Amsterdam
blue New York
green New York
orange Sweden
blue Tokyo
gold New York
I am fairly new to using commands like awk and am looking to gain more practical knowledge.
I've tried some different variations of
awk '{A[$10 OFS $15]++} END {for (k in A) print k, A[k]}' myfile
but, not quite understanding the code, the output was not what I've expected.
I am expecting output of
orange 3
blue 2
green 1
gold 1

With GNU awk. I assume tab is your field separator.
awk '{count[$10 FS $15]++}END{for(j in count) print j}' FS='\t' file | cut -d $'\t' -f 1 | sort | uniq -c | sort -nr
Output:
3 orange
2 blue
1 green
1 gold
I suppose it could be more elegant.

Single GNU awk invocation version (Works with non-GNU awk too, just doesn't sort the output):
$ gawk 'BEGIN{ OFS=FS="\t" }
NR>1 { names[$2,$1]=$1 }
END { for (n in names) colors[names[n]]++;
PROCINFO["sorted_in"] = "#val_num_desc";
for (c in colors) print c, colors[c] }' input.tsv
orange 3
blue 2
gold 1
green 1
Adjust column numbers as needed to match real data.
Bonus solution that uses sqlite3:
$ sqlite3 -batch -noheader <<EOF
.mode tabs
.import input.tsv names
SELECT "column 10", count(DISTINCT "column 15") AS total
FROM names
GROUP BY "column 10"
ORDER BY total DESC, "column 10";
EOF
orange 3
blue 2
gold 1
green 1

Related

Use values in a column to separate strings in another column in bash

I am trying to separate a column of strings using the values from another column, maybe an example will be easier for you to understand.
The input is a table, with strings in column 2 separated with a comma ,.
The third column is the field number that should be outputted, with , as the delimited in the second column.
Ben mango,apple 1
Mary apple,orange,grape 2
Sam apple,melon,* 3
Peter melon 1
The output should look like this, where records that correspond to an asterisk should not be outputted (the Sam row is not outputted):
Ben mango
Mary orange
Peter melon
I am able to generate the desired output using a for loop, but I think it is quite cumbersome:
IFS=$'\n'
for i in $(cat input.txt)
do
F=`echo $i | cut -f3`
paste <(echo $i | cut -f1) <(echo $i | cut -f2 | cut -d "," -f$F) | grep -v "\*"
done
Is there any one-liner to do it maybe using sed or awk? Thanks in advance.
The key to doing it in awk is the split() function, which populates an array based on a regular expression that matches the delimiters to split a string on:
$ awk '{ split($2, fruits, /,/); if (fruits[$3] != "*") print $1, fruits[$3] }' input.txt
Ben mango
Mary orange
Peter melon

Search the amount of unique value and how many times they appear

I have a csv file with
value name date sentence
0000 name1 date1 I want apples
0021 name2 date1 I want bananas
0212 name3 date2 I want cars
0321 name1 date3 I want pinochio doll
0123 name1 date1 I want lemon
0100 name2 date1 I want drums
1021 name2 date1 I want grape
2212 name3 date2 I want laptop
3321 name1 date3 I want Pot
4123 name1 date1 I want WC
2200 name4 date1 I want ramen
1421 name5 date1 I want noodle
2552 name4 date2 I want film
0211 name6 date3 I want games
0343 name7 date1 I want dvd
I want to find the unique value in the name tab (I know I have to use -f 2 but then I also want to know how many times they appear/the amount of sentence they made.
eg: name1,5
name2,3
name3,2
name4,2
name5,1
name6,1
name7,1
Then afterwards I want to make another data on how many people per appearence
1 appearance, 3
2 appearance ,2
3 appearance ,1
4 appearance ,0
5 appearance ,1
The answer to the first part is using awk below
awk -F" " 'NR>1 { print $2 } ' jerome.txt | sort | uniq -c
For the second part, you can pipe it through Perl and get the results as below
> awk -F" " 'NR>1 { print $2 } ' jerome.txt | sort | uniq -c | perl -lane '{$app{$F[0]}++} END {#c=sort keys %app; foreach($c[0] ..$c[$#c]) {print "$_ appearance,",defined($app{$_})?$app{$_}:0 }}'
1 appearance,3
2 appearance,2
3 appearance,1
4 appearance,0
5 appearance,1
>
EDIT1:
Second part using a Perl one-liner
> perl -lane '{$app{$F[1]}++ if $.>1} END {$app2{$_}++ for(values %app);#c=sort keys %app2;foreach($c[0] ..$c[$#c]) {print "$_ appearance,",$app2{$_}+0}}' jerome.txt
1 appearance,3
2 appearance,2
3 appearance,1
4 appearance,0
5 appearance,1
>
For the 1st report, you can use:
tail -n +2 file | awk '{print $2}' | sort | uniq -c
5 name1
3 name2
2 name3
2 name4
1 name5
1 name6
1 name7
For the 2nd report, you can use:
tail -n +2 file | awk '{print $2}'| sort | uniq -c | awk 'BEGIN{max=0} {map[$1]+=1; if($1>max) max=$1} END{for(i=1;i<=max;i++){print i" appearance,",(i in map)?map[i]:0}}'
1 appearance, 3
2 appearance, 2
3 appearance, 1
4 appearance, 0
5 appearance, 1
The complexity here is due to the fact that you wanted a 0 and custom text appearance in the output.
What you are after is a classic example of combining a set of core-tools of Linux in a pipeline:
This solves your first problem:
$ awk '(NR>1){print $2}' file | sort | uniq -c
5 name1
3 name2
2 name3
2 name4
1 name5
1 name6
1 name7
This solves your second problem:
$ awk '(NR>1){print $2}' file | sort | uniq -c | awk '{print $1}' | uniq -c
1 5
1 3
2 2
3 1
You notice that the formatting is a bit missing, but this essentially solves your problem.
Of course in awk you can do it in one go, but I do believe that you should try to understand the above line. Have a look at man sort and man uniq. The awk solution is:
Problem 1:
awk '(NR>1){a[$2]++}END{ for(i in a) print i "," a[i] }' file
name6,1
name7,1
name1,4
name2,3
name3,2
name4,2
name5,1
Problem 2:
awk '(NR>1){a[$2]++; m=(a[$2]<m?m:a[$2])}
END{ for(i in a) c[a[i]]++;
for(i=1;i<=m;++i) print i, "appearance,", c[i]+0
}' foo.txt
1 appearance, 3
2 appearance, 2
3 appearance, 1
4 appearance, 0
5 appearance, 1

File Manipulation in UNIX - Creating duplicate records based on the count from a column and manipulating only one column

I have a space delimited file.txt with n number of columns. 3rd column in the file.txt is comma delimited, and I want to create duplicate records in same file.txt based on the number of counts in column(n=3) by splitting the comma delimited column with each value.
--file.txt
I have 0,1,2,3 apples
I have 2,3 bananas
I have 3 oranges
--desiredoutput.txt
I have 0 apples
I have 1 apples
I have 2 apples
I have 3 apples
I have 2 bananas
I have 3 bananas
I have 3 oranges
Awk solution:
awk '$3~/,/{
split($3, a, ","); f=$1 OFS $2; sub(/^[^ ]+ +[^ ]+ +[^ ]+/,"");
for (i in a) print f,a[i] $0; next
}1' file
The output:
I have 0 apples
I have 1 apples
I have 2 apples
I have 3 apples
I have 2 bananas
I have 3 bananas
I have 3 oranges

Linux command or/and script for duplicate lines retrieval

I would like to know if there's an easy way way to locate duplicate lines in a text file that contains many entries (about 200.000 or more) and output a file with the duplicates' line numbers, keeping the source file intact. For instance, I got a file with tweets like this:
1. i got red apple
2. i got red apple in my stomach
3. i got green apple
4. i got red apple
5. i like blue bananas
6. i got red apple
7. i like blues music
8. i like blue bananas
9. i like blue bananas
I want the output to be a separate file like this:
4
6
8
9
where numbers will indicate the lines with duplicate entries (excluding the first occurrence of the duplicates). Also note that the matching pattern must be exactly the same sentence (like line 1 is different than line 2, 5 is different than 7 and so on).
Everything I could find with sort | uniq doesn't seem to match the whole sentence but only the first word of the sentence so I'm considering if an awk script would be better for this task or if there is another type of command that can do that.
I also need the first file to be intact (not sorted or reordered in any way) and get only the line numbers as shown above because I want to manually delete these lines from two files. The first file contains the tweets and the second the hashtags of these tweets, so I want to delete the lines that contain duplicate tweets in both files, keeping the first occurrence.
You can try this awk:
awk '$0 in a && a[$0]==1{print NR} {a[$0]++}' file
As per comment,
awk '$0 in a{print NR} {a[$0]++}' file
Output:
$ awk '$0 in a && a[$0]==1{print NR} {a[$0]++}' file
4
8
$ awk '$0 in a{print NR} {a[$0]++}' file
4
6
8
9
you could use python script for doing the same.
f = open("file")
lines = f.readlines()
count = len (lines)
i=0
ignore = []
for i in range(count):
if i in ignore:
continue
for j in range(count):
if (j<= i):
continue
if lines[i] == lines[j]:
ignore.append(j)
print j+1
output :
4
6
8
9
Here is a method combining a few command line tools:
nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}' |
cut -f 1
This
numbers the lines with nl, left adjusted with no leading zeroes (-n ln)
sorts them (ignoring the the first field, i.e., the line number) with sort
finds duplicate lines, ignoring the first field with uniq; the --all-repeated=prepend adds an empty line before each group of duplicate lines
removes all the empty lines and the first one of each group of duplicates with sed
removes everything but the line number with cut
This is what the output looks like at the different stages:
$ nl -n ln file
1 i got red apple
2 i got red apple in my stomach
3 i got green apple
4 i got red apple
5 i like blue bananas
6 i got red apple
7 i like blues music
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2
3 i got green apple
1 i got red apple
4 i got red apple
6 i got red apple
2 i got red apple in my stomach
5 i like blue bananas
8 i like blue bananas
9 i like blue bananas
7 i like blues music
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend
1 i got red apple
4 i got red apple
6 i got red apple
5 i like blue bananas
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}'
4 i got red apple
6 i got red apple
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}' | cut -f 1
4
6
8
9

AWK count number of times a term appear with respect to other columns

Given a CSV file:
id, fruit, binary
1, apple, 1
2, orange, 0
3, pear, 1
4, apple, 0
5, peach, 0
6, apple, 1
How can i calculate for each unique values in fruit,
the number of times the binary value =1 / number of occurences of that
fruit appearing in the fruit column
?
Another way to do it is to sum the value of the binary column for for each unique fruit.
For example:
For the fruit apple, it appeared with binary = 1 two times and had a frequency of 3. Hence i will get 2/3.
How can i write this in an efficient AWK code?
I know that i can do this to get unique values from the second column:
cut -d , -f2 file.csv | sort | uniq |
or
awk '{ a[$2]++ } END { for (b in a) { print b } }' file.csv
So my non-working code looks like this:
cat file.csv | awk '{ a[$2]++ } END { for (b in a) if ($3==1) {sum+=$3} END {print $0 sum}'
and
awk '{ a[$2]++ } END { for (b in a) { sum +=1 } }' file.csv
need help in correcting my syntax and merging the 2 awk codes
This should work for you?
$ cat file.csv
1, apple, 1
2, orange, 0
3, pear, 1
4, apple, 0
5, peach, 0
6, apple, 1
$ cat file.csv|awk -F',' '{ $3 == 1 && fruit[$2]++; tfruit[$2]++ } END { for ( fr in tfruit) { print fr, fruit[fr], tfruit[fr] } }'
pear 1 1
apple 2 3
orange 1
peach 1
Almost same as the other answer, but printing 0 instead of blank.
AMD$ awk -F, 'NR>1{a[$2]+=$3;b[$2]++} END{for(i in a)print i, a[i], b[i]}' File
pear 1 1
apple 2 3
orange 0 1
peach 0 1
Taking , as field seperator. For all lines except the first, update array a. i.e $2(fruit name) is taken as index and adding up the number of times binary is 1 for this fruit. Also increase b[$2] by one, this will be the number of times the fruit is seen. At the end, print the fruit, binary count and num of times fruit seen. Hope it is clear.

Resources