I want to count distinct words in document using linux command? - linux

For below data set I tried with using uniq command but did not get satisfactory result
Meredith Norris Thomas;Regular Air;HomeOffice
Kara Pace;Regular Air;HomeOffice
Ryan Foster;Regular Air;HomeOffice
Code:
cat HomeOffice_sales.txt |tr " " "\n" | tr ";" "\n"| uniq -c
result I got was wrong as Air,Regular,HomeOffice word is thrice(expected 3 Home office) :
1 Meredith
1 Norris
1 Thomas
1 Regular
1 Air
1 HomeOffice
1 Kara
1 Pace
1 Regular
1 Air
1 HomeOffice
1 Ryan
1 Foster
1 Regular
1 Air
1 HomeOffice

uniq only counts repeated lines that are together in the input, so you need to sort before piping to uniq.
tr ' ;' '\n\n' < HomeOffice_sales.txt | sort | uniq -c
You don't need multiple tr commands, you can give a list of input and output characters.

Related

Search the amount of unique value and how many times they appear

I have a csv file with
value name date sentence
0000 name1 date1 I want apples
0021 name2 date1 I want bananas
0212 name3 date2 I want cars
0321 name1 date3 I want pinochio doll
0123 name1 date1 I want lemon
0100 name2 date1 I want drums
1021 name2 date1 I want grape
2212 name3 date2 I want laptop
3321 name1 date3 I want Pot
4123 name1 date1 I want WC
2200 name4 date1 I want ramen
1421 name5 date1 I want noodle
2552 name4 date2 I want film
0211 name6 date3 I want games
0343 name7 date1 I want dvd
I want to find the unique value in the name tab (I know I have to use -f 2 but then I also want to know how many times they appear/the amount of sentence they made.
eg: name1,5
name2,3
name3,2
name4,2
name5,1
name6,1
name7,1
Then afterwards I want to make another data on how many people per appearence
1 appearance, 3
2 appearance ,2
3 appearance ,1
4 appearance ,0
5 appearance ,1
The answer to the first part is using awk below
awk -F" " 'NR>1 { print $2 } ' jerome.txt | sort | uniq -c
For the second part, you can pipe it through Perl and get the results as below
> awk -F" " 'NR>1 { print $2 } ' jerome.txt | sort | uniq -c | perl -lane '{$app{$F[0]}++} END {#c=sort keys %app; foreach($c[0] ..$c[$#c]) {print "$_ appearance,",defined($app{$_})?$app{$_}:0 }}'
1 appearance,3
2 appearance,2
3 appearance,1
4 appearance,0
5 appearance,1
>
EDIT1:
Second part using a Perl one-liner
> perl -lane '{$app{$F[1]}++ if $.>1} END {$app2{$_}++ for(values %app);#c=sort keys %app2;foreach($c[0] ..$c[$#c]) {print "$_ appearance,",$app2{$_}+0}}' jerome.txt
1 appearance,3
2 appearance,2
3 appearance,1
4 appearance,0
5 appearance,1
>
For the 1st report, you can use:
tail -n +2 file | awk '{print $2}' | sort | uniq -c
5 name1
3 name2
2 name3
2 name4
1 name5
1 name6
1 name7
For the 2nd report, you can use:
tail -n +2 file | awk '{print $2}'| sort | uniq -c | awk 'BEGIN{max=0} {map[$1]+=1; if($1>max) max=$1} END{for(i=1;i<=max;i++){print i" appearance,",(i in map)?map[i]:0}}'
1 appearance, 3
2 appearance, 2
3 appearance, 1
4 appearance, 0
5 appearance, 1
The complexity here is due to the fact that you wanted a 0 and custom text appearance in the output.
What you are after is a classic example of combining a set of core-tools of Linux in a pipeline:
This solves your first problem:
$ awk '(NR>1){print $2}' file | sort | uniq -c
5 name1
3 name2
2 name3
2 name4
1 name5
1 name6
1 name7
This solves your second problem:
$ awk '(NR>1){print $2}' file | sort | uniq -c | awk '{print $1}' | uniq -c
1 5
1 3
2 2
3 1
You notice that the formatting is a bit missing, but this essentially solves your problem.
Of course in awk you can do it in one go, but I do believe that you should try to understand the above line. Have a look at man sort and man uniq. The awk solution is:
Problem 1:
awk '(NR>1){a[$2]++}END{ for(i in a) print i "," a[i] }' file
name6,1
name7,1
name1,4
name2,3
name3,2
name4,2
name5,1
Problem 2:
awk '(NR>1){a[$2]++; m=(a[$2]<m?m:a[$2])}
END{ for(i in a) c[a[i]]++;
for(i=1;i<=m;++i) print i, "appearance,", c[i]+0
}' foo.txt
1 appearance, 3
2 appearance, 2
3 appearance, 1
4 appearance, 0
5 appearance, 1

Merging two files in bash with a twist in shell linux

The following question is somehow tricky but seemingly simple , i need to use bash
let us suppose i have 2 text files, the first on is
FirstFile.txt
0 1
0 2
1 1
1 2
2 0
SecondFile.txt
0 1
0 2
0 3
0 4
0 5
1 0
1 1
1 2
1 3
1 4
1 5
2 1
2 2
2 3
2 4
2 5
I want to be able to create a new Thirdfile.txt that contains that values that are not in file A , meaning if there is a common variable with file A i want it to be removed. knowing that 2 0 and 0 2 are the same ...
Can you help me out ?
Using awk, you can rearrange the columns so that the lower number is always first. When reading the first file, save them as keys in an associative array. When reading the second file, test if they're not found in the array.
awk '{if ($1 <= $2) { a = $1; b = $2; } else { a = $2; b = $1 } }
FNR==NR { arr[a, b] = 1; next; }
!arr[a, b]' FirstFile.txt SecondFile.txt > ThirdFile.txt
Results:
0 3
0 4
0 5
1 3
1 4
1 5
2 2
2 3
2 4
2 5
paste <(cut -f2 a.txt) <(cut -f1 a.txt) > tmp.txt
cat a.txt b.txt tmp.txt | sort | uniq -u
or
cat a.txt b.txt <(paste <(cut -f2 a.txt) <(cut -f1 a.txt)) | sort | uniq -u
Result
0 3
0 4
0 5
1 3
1 4
1 5
2 2
2 3
2 4
2 5
Explanation
uniq removes duplicate rows from a text file.
uniq requires that its input be sorted.
uniq -u prints only the rows that do not have duplicates.
So, cat a.txt b.txt | sort | uniq -u will almost get you there: Only rows in b.txt that are not in a.txt will get printed. However it doesn't handle the reversed cases, like '1 2' <-> '2 1'.
Therefore, you need a temp file that holds all the reversed removal keys. That's what paste <(cut -f2 a.txt) <(cut -f1 a.txt) does.
Note that cut assumes columns are separated by \t's. If they are not, you will need to specify a delimiter with, for example, -d ' '.

How can I sort a 10GB file?

I'm trying to sort a big table stored in a file. The format of the file is
(ID, intValue)
The data is sorted by ID, but what I need is to sort the data using the intValue, in descending order.
For example
ID | IntValue
1 | 3
2 | 24
3 | 44
4 | 2
to this table
ID | IntValue
3 | 44
2 | 24
1 | 3
4 | 2
How can I use the Linux sort command to do the operation? Or do you recommend another way?
How can I use the Linux sort command to do the operation? Or do you recommend another way?
As others have already pointed out, see man sort for -k & -t command line options on how to sort by some specific element in the string.
Now, the sort also has facility to help sort huge files which potentially don't fit into the RAM. Namely the -m command line option, which allows to merge already sorted files into one. (See merge sort for the concept.) The overall process is fairly straight forward:
Split the big file into small chunks. Use for example the split tool with the -l option. E.g.:
split -l 1000000 huge-file small-chunk
Sort the smaller files. E.g.
for X in small-chunk*; do sort -t'|' -k2 -nr < $X > sorted-$X; done
Merge the sorted smaller files. E.g.
sort -t'|' -k2 -nr -m sorted-small-chunk* > sorted-huge-file
Clean-up: rm small-chunk* sorted-small-chunk*
The only thing you have to take special care about is the column header.
How about:
sort -t' ' -k2 -nr < test.txt
where test.txt
$ cat test.txt
1 3
2 24
3 44
4 2
gives sorting in descending order (option -r)
$ sort -t' ' -k2 -nr < test.txt
3 44
2 24
1 3
4 2
while this sorts in ascending order (without option -r)
$ sort -t' ' -k2 -n < test.txt
4 2
1 3
2 24
3 44
in case you have duplicates
$ cat test.txt
1 3
2 24
3 44
4 2
4 2
use the uniq command like this
$ sort -t' ' -k2 -n < test.txt | uniq
4 2
1 3
2 24
3 44

Cannot get this simple sed command

This sed command is described as follows
Delete the cars that are $10,000 or more. Pipe the output of the sort into a sed to do this, by quitting as soon as we match a regular expression representing 5 (or more) digits at the end of a record (DO NOT use repetition for this):
So far the command is:
$ grep -iv chevy cars | sort -nk 5
I have to add another pipe at the end of that command I think which "quits as soon as we match a regular expression representing 5 or more digits at the end of a record"
I tried things like
$ grep -iv chevy cars | sort -nk 5 | sed "/[0-9][0-9][0-9][0-9][0-9]/ q"
and other variations within the // but nothing works! What is the command which matches a regular expression representing 5 or more digits and quits according to this question?
Nominally, you should add a $ before the second / to match 5 digits at the end of the record. If you omit the $, then any sequence of 5 digits will cause sed to quit, so if there is another number (a VIN, perhaps) before the price, it might match when you didn't intend it to.
grep -iv chevy cars | sort -nk 5 | sed '/[0-9][0-9][0-9][0-9][0-9]$/q'
On the whole, it's safer to use single quotes around the regex, unless you need to substitute a shell variable into it (or unless the regex contains single quotes itself). You can also specify the repetition:
grep -iv chevy cars | sort -nk 5 | sed '/[0-9]\{5,\}$/q'
The \{5,\} part matches 5 or more digits. If for any reason that doesn't work, you might find you're using GNU sed and you need to do something like sed --posix to get it working in the normal mode. Or you might be able to just remove the backslashes. There certainly are options to GNU sed to change the regex mechanism it uses (as there are with GNU grep too).
Another way.
As you don't post a file sample, a did it as a guess.
Here I'm looking for lines with the word "chevy" where the field 5 is less than 10000.
awk '/chevy/ {if ( $5 < 10000 ) print $0} ' cars
I forgot the flag -i from grep ... so the correct is:
awk 'BEGIN{IGNORECASE=1} /chevy/ {if ( $5 < 10000 ) print $0} ' cars
$ cat > cars
Chevy 2 3 4 10000
Chevy 2 3 4 5000
chEvy 2 3 4 1000
CHEVY 2 3 4 10000
CHEVY 2 3 4 2000
Prevy 2 3 4 1000
Prevy 2 3 4 10000
$ awk 'BEGIN{IGNORECASE=1} /chevy/ {if ( $5 < 10000 ) print $0} ' cars
Chevy 2 3 4 5000
chEvy 2 3 4 1000
CHEVY 2 3 4 2000
grep -iv chevy cars | sort -nk 5 | sed '/[0-9][0-9][0-9][0-9][0-9]$/d'

Why does "uniq" count identical words as different?

I want to calculate the frequency of the words from a file, where the words are one by line. The file is really big, so this might be the problem (it counts 300k lines in this example).
I do this command:
cat .temp_occ | uniq -c | sort -k1,1nr -k2 > distribution.txt
and the problem is that it gives me a little bug: it considers the same words as different.
For example, the first entries are:
306 continua
278 apertura
211 eventi
189 murah
182 giochi
167 giochi
with giochi repeated twice as you can see.
At the bottom of the file it becomes even worse and it looks like this:
1 win
1 win
1 win
1 win
1 win
1 win
1 win
1 win
1 win
1 winchester
1 wind
1 wind
for all the words.
What am I doing wrong?
Try to sort first:
cat .temp_occ | sort| uniq -c | sort -k1,1nr -k2 > distribution.txt
Or use "sort -u" which also eliminates duplicates. See here.
The size of the file has nothing to do with what you're seeing. From the man page of uniq(1):
Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without
'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'.`
So running uniq on
a
b
a
will return:
a
b
a
Is it possible that some of the words have whitespace characters after them? If so you should remove them using something like this:
cat .temp_occ | tr -d ' ' | uniq -c | sort -k1,1nr -k2 > distribution.txt

Resources