Why does "uniq" count identical words as different? - linux

I want to calculate the frequency of the words from a file, where the words are one by line. The file is really big, so this might be the problem (it counts 300k lines in this example).
I do this command:
cat .temp_occ | uniq -c | sort -k1,1nr -k2 > distribution.txt
and the problem is that it gives me a little bug: it considers the same words as different.
For example, the first entries are:
306 continua
278 apertura
211 eventi
189 murah
182 giochi
167 giochi
with giochi repeated twice as you can see.
At the bottom of the file it becomes even worse and it looks like this:
1 win
1 win
1 win
1 win
1 win
1 win
1 win
1 win
1 win
1 winchester
1 wind
1 wind
for all the words.
What am I doing wrong?

Try to sort first:
cat .temp_occ | sort| uniq -c | sort -k1,1nr -k2 > distribution.txt

Or use "sort -u" which also eliminates duplicates. See here.

The size of the file has nothing to do with what you're seeing. From the man page of uniq(1):
Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without
'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'.`
So running uniq on
a
b
a
will return:
a
b
a

Is it possible that some of the words have whitespace characters after them? If so you should remove them using something like this:
cat .temp_occ | tr -d ' ' | uniq -c | sort -k1,1nr -k2 > distribution.txt

Related

Finding the Number of strings in a File

I'm trying to write a very small program that will check the number of sub strings in a large text file. All it will do is count the first 2000 lines of the text file, find any "TTT" sub-strings, count them, and set a variable to that total. I'm a bit new to shell, so any help would be amazingly appreciated!
#!/bin/bash
$counter=(head -2000 [file name] | grep TTT | grep -o TTT | wc -l)
echo $counter
For what it's worth you might awk better suited for this task:
awk -F"ttt" '{j=(NF-1)+j}END{print j}' filename
This will split each record in your file by delimiter "ttt". Then it counts the number of fields, subtracts one, and adds that to the total.
A file like:
ttt tttttt something
1 5 ttt
tt
one more ttt record
Would be split (visualizing with pipe delim) like:
| || something
1 5 |
tt
one more | record
Counting the number of fields per record:
4
2
1
2
Subtracting one from that:
3
1
0
1
Which totals to 5, which is how many "ttt" substrings are present.
To incorporate this into your script (and fixing your other issue):
#!/bin/bash
counter=$(awk -F"ttt" '{j=(NF-1)+j}END{print j}' filename)
echo $counter
The change here is that when we set a variable in Bash we don't include the $ sign at the front. Only in referencing the variable do we include the $.
You have some minor syntax errors there, probably you meant this:
counter=$(head -2000 [file name] | grep TTT | grep -o TTT | wc -l)
echo $counter
Notice the tiny changes I made there to make it work.
Btw the grep TTT in the middle is redundant, you can simply drop it, that is:
counter=$(head -2000 [file name] | grep -o TTT | wc -l)
grep can already do what you want: counter=$(grep -c TTT $infile). You can limit the number of hits (not lines) with -m NUM, --max-count=NUM, which makes grep stop at the end of the file OR when NUM occurrences are found.

How can I sort a 10GB file?

I'm trying to sort a big table stored in a file. The format of the file is
(ID, intValue)
The data is sorted by ID, but what I need is to sort the data using the intValue, in descending order.
For example
ID | IntValue
1 | 3
2 | 24
3 | 44
4 | 2
to this table
ID | IntValue
3 | 44
2 | 24
1 | 3
4 | 2
How can I use the Linux sort command to do the operation? Or do you recommend another way?
How can I use the Linux sort command to do the operation? Or do you recommend another way?
As others have already pointed out, see man sort for -k & -t command line options on how to sort by some specific element in the string.
Now, the sort also has facility to help sort huge files which potentially don't fit into the RAM. Namely the -m command line option, which allows to merge already sorted files into one. (See merge sort for the concept.) The overall process is fairly straight forward:
Split the big file into small chunks. Use for example the split tool with the -l option. E.g.:
split -l 1000000 huge-file small-chunk
Sort the smaller files. E.g.
for X in small-chunk*; do sort -t'|' -k2 -nr < $X > sorted-$X; done
Merge the sorted smaller files. E.g.
sort -t'|' -k2 -nr -m sorted-small-chunk* > sorted-huge-file
Clean-up: rm small-chunk* sorted-small-chunk*
The only thing you have to take special care about is the column header.
How about:
sort -t' ' -k2 -nr < test.txt
where test.txt
$ cat test.txt
1 3
2 24
3 44
4 2
gives sorting in descending order (option -r)
$ sort -t' ' -k2 -nr < test.txt
3 44
2 24
1 3
4 2
while this sorts in ascending order (without option -r)
$ sort -t' ' -k2 -n < test.txt
4 2
1 3
2 24
3 44
in case you have duplicates
$ cat test.txt
1 3
2 24
3 44
4 2
4 2
use the uniq command like this
$ sort -t' ' -k2 -n < test.txt | uniq
4 2
1 3
2 24
3 44

Find unique lines

How can I find the unique lines and remove all duplicates from a file?
My input file is
1
1
2
3
5
5
7
7
I would like the result to be:
2
3
sort file | uniq will not do the job. Will show all values 1 time
uniq has the option you need:
-u, --unique
only print unique lines
$ cat file.txt
1
1
2
3
5
5
7
7
$ uniq -u file.txt
2
3
Use as follows:
sort < filea | uniq > fileb
You could also print out the unique value in "file" using the cat command by piping to sort and uniq
cat file | sort | uniq -u
While sort takes O(n log(n)) time, I prefer using
awk '!seen[$0]++'
awk '!seen[$0]++' is an abbreviation for awk '!seen[$0]++ {print}', print line(=$0) if seen[$0] is not zero.
It take more space but only O(n) time.
I find this easier.
sort -u input_filename > output_filename
-u stands for unique.
you can use:
sort data.txt| uniq -u
this sort data and filter by unique values
uniq -u has been driving me crazy because it did not work.
So instead of that, if you have python (most Linux distros and servers already have it):
Assuming you have the data file in notUnique.txt
#Python
#Assuming file has data on different lines
#Otherwise fix split() accordingly.
uniqueData = []
fileData = open('notUnique.txt').read().split('\n')
for i in fileData:
if i.strip()!='':
uniqueData.append(i)
print uniqueData
###Another option (less keystrokes):
set(open('notUnique.txt').read().split('\n'))
Note that due to empty lines, the final set may contain '' or only-space strings. You can remove that later. Or just get away with copying from the terminal ;)
#
Just FYI, From the uniq Man page:
"Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'."
One of the correct ways, to invoke with:
#
sort nonUnique.txt | uniq
Example run:
$ cat x
3
1
2
2
2
3
1
3
$ uniq x
3
1
2
3
1
3
$ uniq -u x
3
1
3
1
3
$ sort x | uniq
1
2
3
Spaces might be printed, so be prepared!
uniq -u < file will do the job.
uniq should do fine if you're file is/can be sorted, if you can't sort the file for some reason you can use awk:
awk '{a[$0]++}END{for(i in a)if(a[i]<2)print i}'
sort -d "file name" | uniq -u
this worked for me for a similar one. Use this if it is not arranged.
You can remove sort if it is arranged
This was the first i tried
skilla:~# uniq -u all.sorted
76679787
76679787
76794979
76794979
76869286
76869286
......
After doing a cat -e all.sorted
skilla:~# cat -e all.sorted
$
76679787$
76679787 $
76701427$
76701427$
76794979$
76794979 $
76869286$
76869286 $
Every second line has a trailing space :(
After removing all trailing spaces it worked!
thank you
Instead of sorting and then using uniq, you could also just use sort -u. From sort --help:
-u, --unique with -c, check for strict ordering;
without -c, output only the first of an equal run

Cannot get this simple sed command

This sed command is described as follows
Delete the cars that are $10,000 or more. Pipe the output of the sort into a sed to do this, by quitting as soon as we match a regular expression representing 5 (or more) digits at the end of a record (DO NOT use repetition for this):
So far the command is:
$ grep -iv chevy cars | sort -nk 5
I have to add another pipe at the end of that command I think which "quits as soon as we match a regular expression representing 5 or more digits at the end of a record"
I tried things like
$ grep -iv chevy cars | sort -nk 5 | sed "/[0-9][0-9][0-9][0-9][0-9]/ q"
and other variations within the // but nothing works! What is the command which matches a regular expression representing 5 or more digits and quits according to this question?
Nominally, you should add a $ before the second / to match 5 digits at the end of the record. If you omit the $, then any sequence of 5 digits will cause sed to quit, so if there is another number (a VIN, perhaps) before the price, it might match when you didn't intend it to.
grep -iv chevy cars | sort -nk 5 | sed '/[0-9][0-9][0-9][0-9][0-9]$/q'
On the whole, it's safer to use single quotes around the regex, unless you need to substitute a shell variable into it (or unless the regex contains single quotes itself). You can also specify the repetition:
grep -iv chevy cars | sort -nk 5 | sed '/[0-9]\{5,\}$/q'
The \{5,\} part matches 5 or more digits. If for any reason that doesn't work, you might find you're using GNU sed and you need to do something like sed --posix to get it working in the normal mode. Or you might be able to just remove the backslashes. There certainly are options to GNU sed to change the regex mechanism it uses (as there are with GNU grep too).
Another way.
As you don't post a file sample, a did it as a guess.
Here I'm looking for lines with the word "chevy" where the field 5 is less than 10000.
awk '/chevy/ {if ( $5 < 10000 ) print $0} ' cars
I forgot the flag -i from grep ... so the correct is:
awk 'BEGIN{IGNORECASE=1} /chevy/ {if ( $5 < 10000 ) print $0} ' cars
$ cat > cars
Chevy 2 3 4 10000
Chevy 2 3 4 5000
chEvy 2 3 4 1000
CHEVY 2 3 4 10000
CHEVY 2 3 4 2000
Prevy 2 3 4 1000
Prevy 2 3 4 10000
$ awk 'BEGIN{IGNORECASE=1} /chevy/ {if ( $5 < 10000 ) print $0} ' cars
Chevy 2 3 4 5000
chEvy 2 3 4 1000
CHEVY 2 3 4 2000
grep -iv chevy cars | sort -nk 5 | sed '/[0-9][0-9][0-9][0-9][0-9]$/d'

How to do sum from the file and move in particular way in another file in linux?

Acttualy this is my assignment.I have three-four file,related by student record.Every file have two-three student record.like this
Course Name:Opreating System
Credit: 4
123456 1 1 0 1 1 0 1 0 0 0 1 5 8 0 12 10 25
243567 0 1 1 0 1 1 0 1 0 0 0 7 9 12 15 17 15
Every file have different coursename.I did every coursename and studentid move
in one file but now i don't know how to add all marks and move to another file on same place where is id? Can you please tell me how to do it?
It looks like this:
Student# Operating Systems JAVA C++ Web Programming GPA
123456 76 63 50 82 67.75
243567 80 - 34 63 59
I did like this:
#!/bin/sh
find ~/2011/Fall/StudentsRecord -name "*.rec" | xargs grep -l 'CREDITS' | xargs cat > rsh1
echo "STUDENT ID" > rsh2
sed -n /COURSE/p rsh1 | sed 's/COURSE NAME: //g' >> rsh2
echo "GPA" >> rsh2
sed -e :a -e '{N; s/\n/ /g; ta}' rsh2 > rshf
sed '/COURSE/d;/CREDIT/d' rsh1 | sort -uk 1,1 | cut -d' ' -f1 | paste -d' ' >> rshf
Some comments and a few pointers :
It would help to add 'comments' for each line of code that is not self evident ; i.e. code like mv f f.bak doesn't need to be commented, but I'm not sure what the intent of your many lines of code are.
You insert a comment with the '#' char, like
# concatenate all files that contain the word CREDITS into a file called rsh1
find ~/2011/Fall/StudentsRecord -name "*.rec" | xargs grep -l 'CREDITS' | xargs cat > rsh1
Also note that you consistently use all uppercase for your search targets, i.e. CREDITS, when your sample files shows mixed case. Either used correct case for your search targets, i.e.
`grep -l 'Credits'`
OR tell grep to -i(gnore case), i.e.
`grep -il 'Credits'
Your line
sed -n /COURSE/p rsh1 | sed 's/COURSE NAME: //g' >> rsh2
can be reduced to 1 call to sed (and you have the same case confusion thing going on), try
sed -n '/COURSE/i{;s/COURSE NAME: //gip;}' rsh1 >> rsh2
This means (-n don't print every line by default),
`gip` = global substitute,
= ignore case in matching
print only lines where substituion was made
So you're editing out the string COURSE NAME for any line that has COURSE in it, and only printing those lines' (you probably don't need the 'g' (global) specifier given that you expect only 1 instance per line)
Your line
sed -e :a -e '{N; s/\n/ /g; ta}' rsh2 > rshf
Actually looks pretty good, very advanced, you're trying to 'fold' each 2 lines together into 1 line, right?
But,
sed '/COURSE/d;/CREDIT/d' rsh1 | sort -uk 1,1 | cut -d' ' -f1 | paste -d' ' >> rshf
I'm really confused by this, is this where you're trying to total a students score? (with a sort embedded I guess not). Why do you think you need a sort,
While it is possible to perform arithmetic in sed, it is super-crazy hard, so you can either use bash variables to calculate the values OR use a unix tool that is designed to process text AND perform logical and mathematical operations of the data presented, awk or perl come to mind here
Anyway, one solution to total each score is to use awk
echo "123456 1 1 0 1 1 0 1 0 0 0 1 5 8 0 12 10 25" |\
awk '{for (i=2;i<=NF;i++) { tot+=$i }; print $1 "\t" tot }'
Will give you a clue on how to proceed for that.
Awk has predefined variables that it populates for each file, and each line of text that it reads, i.e.
$0 = complete line of text (as defined by the internal variables RS (RecordSeparator)
which defaults to '\n' new-line char, the unix end-of-line char
$1 = first field in text (as defined by the internal variables FS (FieldSeparator)
which defaults to (possibly multiple) space chars OR tab char
a line with 2 connected spaces chars and 1 tab char has 3 fields)
NF = Number(of)Fields in current line of data (again fields defined by value of FS as
described above)
(there are many others, besides, $0, $n, $NF, $FS, $RS).
you can programatically increment for values like $1, $2, $3, by using a variable as in the example code, like $i (i is a variable that has a number between 2 and NF. The leading '$'
says give me the value of field i (i.e. $2, $3, $4 ...)
Incidentally, your problem could be easily solved with a single awk script, but apparently, you're supposed to learn about cat, cut, grep, etc, which is a very worthwhile goal.
I hope this helps.

Resources