Search the amount of unique value and how many times they appear - linux

I have a csv file with
value name date sentence
0000 name1 date1 I want apples
0021 name2 date1 I want bananas
0212 name3 date2 I want cars
0321 name1 date3 I want pinochio doll
0123 name1 date1 I want lemon
0100 name2 date1 I want drums
1021 name2 date1 I want grape
2212 name3 date2 I want laptop
3321 name1 date3 I want Pot
4123 name1 date1 I want WC
2200 name4 date1 I want ramen
1421 name5 date1 I want noodle
2552 name4 date2 I want film
0211 name6 date3 I want games
0343 name7 date1 I want dvd
I want to find the unique value in the name tab (I know I have to use -f 2 but then I also want to know how many times they appear/the amount of sentence they made.
eg: name1,5
name2,3
name3,2
name4,2
name5,1
name6,1
name7,1
Then afterwards I want to make another data on how many people per appearence
1 appearance, 3
2 appearance ,2
3 appearance ,1
4 appearance ,0
5 appearance ,1

The answer to the first part is using awk below
awk -F" " 'NR>1 { print $2 } ' jerome.txt | sort | uniq -c
For the second part, you can pipe it through Perl and get the results as below
> awk -F" " 'NR>1 { print $2 } ' jerome.txt | sort | uniq -c | perl -lane '{$app{$F[0]}++} END {#c=sort keys %app; foreach($c[0] ..$c[$#c]) {print "$_ appearance,",defined($app{$_})?$app{$_}:0 }}'
1 appearance,3
2 appearance,2
3 appearance,1
4 appearance,0
5 appearance,1
>
EDIT1:
Second part using a Perl one-liner
> perl -lane '{$app{$F[1]}++ if $.>1} END {$app2{$_}++ for(values %app);#c=sort keys %app2;foreach($c[0] ..$c[$#c]) {print "$_ appearance,",$app2{$_}+0}}' jerome.txt
1 appearance,3
2 appearance,2
3 appearance,1
4 appearance,0
5 appearance,1
>

For the 1st report, you can use:
tail -n +2 file | awk '{print $2}' | sort | uniq -c
5 name1
3 name2
2 name3
2 name4
1 name5
1 name6
1 name7
For the 2nd report, you can use:
tail -n +2 file | awk '{print $2}'| sort | uniq -c | awk 'BEGIN{max=0} {map[$1]+=1; if($1>max) max=$1} END{for(i=1;i<=max;i++){print i" appearance,",(i in map)?map[i]:0}}'
1 appearance, 3
2 appearance, 2
3 appearance, 1
4 appearance, 0
5 appearance, 1
The complexity here is due to the fact that you wanted a 0 and custom text appearance in the output.

What you are after is a classic example of combining a set of core-tools of Linux in a pipeline:
This solves your first problem:
$ awk '(NR>1){print $2}' file | sort | uniq -c
5 name1
3 name2
2 name3
2 name4
1 name5
1 name6
1 name7
This solves your second problem:
$ awk '(NR>1){print $2}' file | sort | uniq -c | awk '{print $1}' | uniq -c
1 5
1 3
2 2
3 1
You notice that the formatting is a bit missing, but this essentially solves your problem.
Of course in awk you can do it in one go, but I do believe that you should try to understand the above line. Have a look at man sort and man uniq. The awk solution is:
Problem 1:
awk '(NR>1){a[$2]++}END{ for(i in a) print i "," a[i] }' file
name6,1
name7,1
name1,4
name2,3
name3,2
name4,2
name5,1
Problem 2:
awk '(NR>1){a[$2]++; m=(a[$2]<m?m:a[$2])}
END{ for(i in a) c[a[i]]++;
for(i=1;i<=m;++i) print i, "appearance,", c[i]+0
}' foo.txt
1 appearance, 3
2 appearance, 2
3 appearance, 1
4 appearance, 0
5 appearance, 1

Related

I want to count distinct words in document using linux command?

For below data set I tried with using uniq command but did not get satisfactory result
Meredith Norris Thomas;Regular Air;HomeOffice
Kara Pace;Regular Air;HomeOffice
Ryan Foster;Regular Air;HomeOffice
Code:
cat HomeOffice_sales.txt |tr " " "\n" | tr ";" "\n"| uniq -c
result I got was wrong as Air,Regular,HomeOffice word is thrice(expected 3 Home office) :
1 Meredith
1 Norris
1 Thomas
1 Regular
1 Air
1 HomeOffice
1 Kara
1 Pace
1 Regular
1 Air
1 HomeOffice
1 Ryan
1 Foster
1 Regular
1 Air
1 HomeOffice
uniq only counts repeated lines that are together in the input, so you need to sort before piping to uniq.
tr ' ;' '\n\n' < HomeOffice_sales.txt | sort | uniq -c
You don't need multiple tr commands, you can give a list of input and output characters.

Substitute second digit of a column with a constant number linux

I have a table like:
Person age
name1 45
name2 13
name3 28
name4 89
I would like, in an automatised way (since it's a big table) , to modify the second digit of the second table for a 0, so that I have decade groups instead of exact age number:
Person age
name1 40
name2 10
name3 20
name4 80
Which is the neatest way to do that? Thanks!
Try this script, it does the trick ( just remove the first line from your input file )
#!/bin/bash
file_path="/home/mobaxterm/Desktop/f.txt"
dest_file="/home/mobaxterm/Desktop/f2.txt"
echo "Person age" > /home/mobaxterm/Desktop/f2.txt
while read p; do
name=$(echo "$p" | cut -f1 -d' ')
age=$(echo "$p" | cut -f2 -d' ' | sed 's/\(.\{1\}\)./\10/')
echo "$name $age" >> /home/mobaxterm/Desktop/f2.txt
done < $file_path

Merging two files in bash with a twist in shell linux

The following question is somehow tricky but seemingly simple , i need to use bash
let us suppose i have 2 text files, the first on is
FirstFile.txt
0 1
0 2
1 1
1 2
2 0
SecondFile.txt
0 1
0 2
0 3
0 4
0 5
1 0
1 1
1 2
1 3
1 4
1 5
2 1
2 2
2 3
2 4
2 5
I want to be able to create a new Thirdfile.txt that contains that values that are not in file A , meaning if there is a common variable with file A i want it to be removed. knowing that 2 0 and 0 2 are the same ...
Can you help me out ?
Using awk, you can rearrange the columns so that the lower number is always first. When reading the first file, save them as keys in an associative array. When reading the second file, test if they're not found in the array.
awk '{if ($1 <= $2) { a = $1; b = $2; } else { a = $2; b = $1 } }
FNR==NR { arr[a, b] = 1; next; }
!arr[a, b]' FirstFile.txt SecondFile.txt > ThirdFile.txt
Results:
0 3
0 4
0 5
1 3
1 4
1 5
2 2
2 3
2 4
2 5
paste <(cut -f2 a.txt) <(cut -f1 a.txt) > tmp.txt
cat a.txt b.txt tmp.txt | sort | uniq -u
or
cat a.txt b.txt <(paste <(cut -f2 a.txt) <(cut -f1 a.txt)) | sort | uniq -u
Result
0 3
0 4
0 5
1 3
1 4
1 5
2 2
2 3
2 4
2 5
Explanation
uniq removes duplicate rows from a text file.
uniq requires that its input be sorted.
uniq -u prints only the rows that do not have duplicates.
So, cat a.txt b.txt | sort | uniq -u will almost get you there: Only rows in b.txt that are not in a.txt will get printed. However it doesn't handle the reversed cases, like '1 2' <-> '2 1'.
Therefore, you need a temp file that holds all the reversed removal keys. That's what paste <(cut -f2 a.txt) <(cut -f1 a.txt) does.
Note that cut assumes columns are separated by \t's. If they are not, you will need to specify a delimiter with, for example, -d ' '.

Awk stop processing in case of only head row is existed in the file

I have one awk to process as below
awk 'FNR==1 && NR!=1 { while (/1Name/) getline; } 1 { print } ' *.test.final | sort -t $'\t' -k1,1 > test.out
This is combine multiple files which have extension .test.final.
each files has same format like below
test1.test.final
1Name column1 column2
Test1_1 5 4
Test1_2 3 2
another file test2.test.final
1Name column1 column2
Test2_1 2 4
Test2_2 3 2
So final results like below,
1Name column1 column2
Test1_1 5 4
Test1_2 3 2
Test2_1 2 4
Test2_2 3 2
But some times it just stopped processing in case there is no data is existed in the file.
Like below,
test3.test.final
1Name column1 column2
It is just stop and do not process
Anyone know why and how to fix this?
All files are tab delimited.
Thanks
I think you are overcomplicating the code by using a while and getline.
Just skip the header of the files when they are not the first one. On the rest of cases, print normally:
awk 'FNR==1 && NR!=1 {next} 1' *.test.final
Tested with all your *.test.final files and worked well:
$ awk 'FNR==1 && NR!=1 {next} 1' *.final
1Name column1 column2
Test1_1 5 4
Test1_2 3 2
Test2_1 2 4
Test2_2 3 2

Cannot get this simple sed command

This sed command is described as follows
Delete the cars that are $10,000 or more. Pipe the output of the sort into a sed to do this, by quitting as soon as we match a regular expression representing 5 (or more) digits at the end of a record (DO NOT use repetition for this):
So far the command is:
$ grep -iv chevy cars | sort -nk 5
I have to add another pipe at the end of that command I think which "quits as soon as we match a regular expression representing 5 or more digits at the end of a record"
I tried things like
$ grep -iv chevy cars | sort -nk 5 | sed "/[0-9][0-9][0-9][0-9][0-9]/ q"
and other variations within the // but nothing works! What is the command which matches a regular expression representing 5 or more digits and quits according to this question?
Nominally, you should add a $ before the second / to match 5 digits at the end of the record. If you omit the $, then any sequence of 5 digits will cause sed to quit, so if there is another number (a VIN, perhaps) before the price, it might match when you didn't intend it to.
grep -iv chevy cars | sort -nk 5 | sed '/[0-9][0-9][0-9][0-9][0-9]$/q'
On the whole, it's safer to use single quotes around the regex, unless you need to substitute a shell variable into it (or unless the regex contains single quotes itself). You can also specify the repetition:
grep -iv chevy cars | sort -nk 5 | sed '/[0-9]\{5,\}$/q'
The \{5,\} part matches 5 or more digits. If for any reason that doesn't work, you might find you're using GNU sed and you need to do something like sed --posix to get it working in the normal mode. Or you might be able to just remove the backslashes. There certainly are options to GNU sed to change the regex mechanism it uses (as there are with GNU grep too).
Another way.
As you don't post a file sample, a did it as a guess.
Here I'm looking for lines with the word "chevy" where the field 5 is less than 10000.
awk '/chevy/ {if ( $5 < 10000 ) print $0} ' cars
I forgot the flag -i from grep ... so the correct is:
awk 'BEGIN{IGNORECASE=1} /chevy/ {if ( $5 < 10000 ) print $0} ' cars
$ cat > cars
Chevy 2 3 4 10000
Chevy 2 3 4 5000
chEvy 2 3 4 1000
CHEVY 2 3 4 10000
CHEVY 2 3 4 2000
Prevy 2 3 4 1000
Prevy 2 3 4 10000
$ awk 'BEGIN{IGNORECASE=1} /chevy/ {if ( $5 < 10000 ) print $0} ' cars
Chevy 2 3 4 5000
chEvy 2 3 4 1000
CHEVY 2 3 4 2000
grep -iv chevy cars | sort -nk 5 | sed '/[0-9][0-9][0-9][0-9][0-9]$/d'

Resources